NueSpeak Apex is NueForm's proprietary text-to-speech engine, purpose-built for conversational form interactions. It powers both the telephony voice agent and the in-browser TTS narration feature.
Architecture
NueSpeak Apex is a transformer-based neural TTS model with a multi-scale generative architecture. It processes text through a hierarchical pipeline — semantic understanding, prosody prediction, and acoustic waveform synthesis — to produce natural, human-like speech in real time.
Key Specifications
| Specification | Value |
|---|---|
| Model parameters | ~1.8 billion |
| Token generation rate | 11 Hz |
| Supported languages | English, Spanish, French, Chinese, Japanese, Korean, Hindi, Arabic, and 20+ additional languages |
| Voice cloning | Zero-shot from ≥10 seconds of audio |
| Latency (time to first audio) | < 280 ms (median) |
| Real-time factor | 0.04x (generates 25× faster than real-time) |
| Audio output | 24 kHz, 16-bit PCM |
| Streaming | Chunk-based progressive delivery |
Features
Zero-Shot Voice Cloning
NueSpeak Apex can replicate a speaker's voice characteristics from a single audio sample of 10 seconds or more. The cloning pipeline extracts:
- Timbre — The unique tonal quality of the voice
- Pitch contour — Natural intonation patterns
- Speaking rate — Base cadence and rhythm
- Accent characteristics — Regional pronunciation markers
No fine-tuning is required. The cloned voice is available instantly after sample processing (typically 2–4 seconds).
Voice Design
Beyond cloning, NueSpeak Apex supports text-based voice design. Describe the voice you want in natural language — for example, "a warm, professional female voice with a slight British accent" — and the engine synthesizes a matching voice profile.
Prosody Control
The engine provides fine-grained control over speech prosody:
- Speed — Adjustable from 0.5× to 2.0× normal rate
- Emphasis — Mark words or phrases for stress
- Pauses — Insert natural pauses of configurable duration
- Emotion — Subtle emotional coloring (neutral, warm, energetic, calm)
Multilingual Synthesis
NueSpeak Apex natively supports 28+ languages without model switching. The engine automatically detects the input language and applies appropriate phoneme mappings, prosody rules, and accent models. Code-switching within a single utterance is supported.
Telephony Optimization
For phone calls, NueSpeak Apex applies additional processing:
- 8 kHz / μ-law encoding compatibility for PSTN delivery
- Noise floor management — Minimizes artifacts audible on phone speakers
- Adaptive pacing — Slightly slower delivery for phone contexts to improve comprehension
- Caching — Generated audio is cached per text segment, eliminating redundant synthesis
Performance
Benchmarks
Measured on production infrastructure under typical load:
| Metric | Value |
|---|---|
| Mean Opinion Score (MOS) | 4.32 / 5.0 |
| Character Error Rate (speaker similarity) | 3.1% |
| Time to first byte (P50) | 245 ms |
| Time to first byte (P95) | 410 ms |
| Throughput | 4 concurrent streams per instance |
| Memory footprint | ~3.6 GB VRAM |
Comparison with Industry Standards
| Feature | NueSpeak Apex | Cloud TTS (typical) | Open-source TTS |
|---|---|---|---|
| Latency | < 280 ms | 300–800 ms | 500–2000 ms |
| Voice cloning | Zero-shot | Fine-tuning required | Varies |
| Multilingual | 28+ languages | 40+ languages | 5–15 languages |
| Streaming | Yes | Partial | Rare |
| MOS score | 4.32 | 4.0–4.3 | 3.5–4.0 |
Integration
NueSpeak Apex is deeply integrated into NueForm's platform:
- Form Builder — TTS audio is generated at publish time for all eligible questions.
- Telephony — Real-time synthesis during live phone calls with sub-300ms latency.
- Voice Designer — Create custom voices from text descriptions or audio samples.
- Caching Layer — Frequently used phrases are pre-synthesized and cached for instant delivery.
Audio Quality
NueSpeak Apex generates studio-quality speech at 24 kHz sample rate. For telephony, audio is transcoded to 8 kHz μ-law for optimal phone network delivery while preserving intelligibility.
The model excels at:
- Spelling and dictation — Clear character-by-character pronunciation for email addresses, names, and codes.
- Numbers and dates — Natural reading of numeric content with appropriate grouping.
- Conversational tone — Responses sound natural and engaging, not robotic.
Privacy & Security
- Voice samples used for cloning are stored encrypted and never shared with third parties.
- Audio is generated on NueForm's dedicated GPU infrastructure — no external API calls.
- Cloned voices are scoped to your account and cannot be accessed by other users.
- Voice data can be deleted at any time from the Telephony settings.