NueSpeak Apex | NueForm Docs

NueSpeak Apex is NueForm's proprietary text-to-speech engine, purpose-built for conversational form interactions. It powers both the telephony voice agent and the in-browser TTS narration feature.

Architecture

NueSpeak Apex is a transformer-based neural TTS model with a multi-scale generative architecture. It processes text through a hierarchical pipeline — semantic understanding, prosody prediction, and acoustic waveform synthesis — to produce natural, human-like speech in real time.

Key Specifications

Specification	Value
Model parameters	~1.8 billion
Token generation rate	11 Hz
Supported languages	English, Spanish, French, Chinese, Japanese, Korean, Hindi, Arabic, and 20+ additional languages
Voice cloning	Zero-shot from ≥10 seconds of audio
Latency (time to first audio)	< 280 ms (median)
Real-time factor	0.04x (generates 25× faster than real-time)
Audio output	24 kHz, 16-bit PCM
Streaming	Chunk-based progressive delivery

Features

Zero-Shot Voice Cloning

NueSpeak Apex can replicate a speaker's voice characteristics from a single audio sample of 10 seconds or more. The cloning pipeline extracts:

Timbre — The unique tonal quality of the voice
Pitch contour — Natural intonation patterns
Speaking rate — Base cadence and rhythm
Accent characteristics — Regional pronunciation markers

No fine-tuning is required. The cloned voice is available instantly after sample processing (typically 2–4 seconds).

Voice Design

Beyond cloning, NueSpeak Apex supports text-based voice design. Describe the voice you want in natural language — for example, "a warm, professional female voice with a slight British accent" — and the engine synthesizes a matching voice profile.

Prosody Control

The engine provides fine-grained control over speech prosody:

Speed — Adjustable from 0.5× to 2.0× normal rate
Emphasis — Mark words or phrases for stress
Pauses — Insert natural pauses of configurable duration
Emotion — Subtle emotional coloring (neutral, warm, energetic, calm)

Multilingual Synthesis

NueSpeak Apex natively supports 28+ languages without model switching. The engine automatically detects the input language and applies appropriate phoneme mappings, prosody rules, and accent models. Code-switching within a single utterance is supported.

Telephony Optimization

For phone calls, NueSpeak Apex applies additional processing:

8 kHz / μ-law encoding compatibility for PSTN delivery
Noise floor management — Minimizes artifacts audible on phone speakers
Adaptive pacing — Slightly slower delivery for phone contexts to improve comprehension
Caching — Generated audio is cached per text segment, eliminating redundant synthesis

Performance

Benchmarks

Measured on production infrastructure under typical load:

Metric	Value
Mean Opinion Score (MOS)	4.32 / 5.0
Character Error Rate (speaker similarity)	3.1%
Time to first byte (P50)	245 ms
Time to first byte (P95)	410 ms
Throughput	4 concurrent streams per instance
Memory footprint	~3.6 GB VRAM

Comparison with Industry Standards

Feature	NueSpeak Apex	Cloud TTS (typical)	Open-source TTS
Latency	< 280 ms	300–800 ms	500–2000 ms
Voice cloning	Zero-shot	Fine-tuning required	Varies
Multilingual	28+ languages	40+ languages	5–15 languages
Streaming	Yes	Partial	Rare
MOS score	4.32	4.0–4.3	3.5–4.0

Integration

NueSpeak Apex is deeply integrated into NueForm's platform:

Form Builder — TTS audio is generated at publish time for all eligible questions.
Telephony — Real-time synthesis during live phone calls with sub-300ms latency.
Voice Designer — Create custom voices from text descriptions or audio samples.
Caching Layer — Frequently used phrases are pre-synthesized and cached for instant delivery.

Audio Quality

NueSpeak Apex generates studio-quality speech at 24 kHz sample rate. For telephony, audio is transcoded to 8 kHz μ-law for optimal phone network delivery while preserving intelligibility.

The model excels at:

Spelling and dictation — Clear character-by-character pronunciation for email addresses, names, and codes.
Numbers and dates — Natural reading of numeric content with appropriate grouping.
Conversational tone — Responses sound natural and engaging, not robotic.

Privacy & Security

Voice samples used for cloning are stored encrypted and never shared with third parties.
Audio is generated on NueForm's dedicated GPU infrastructure — no external API calls.
Cloned voices are scoped to your account and cannot be accessed by other users.
Voice data can be deleted at any time from the Telephony settings.