NueForm

NueSpeak Apex

Technical documentation for NueSpeak Apex, NueForm's proprietary text-to-speech engine powering telephony and form narration.

NueSpeak Apex is NueForm's proprietary text-to-speech engine, purpose-built for conversational form interactions. It powers both the telephony voice agent and the in-browser TTS narration feature.

Architecture

NueSpeak Apex is a transformer-based neural TTS model with a multi-scale generative architecture. It processes text through a hierarchical pipeline — semantic understanding, prosody prediction, and acoustic waveform synthesis — to produce natural, human-like speech in real time.

Key Specifications

SpecificationValue
Model parameters~1.8 billion
Token generation rate11 Hz
Supported languagesEnglish, Spanish, French, Chinese, Japanese, Korean, Hindi, Arabic, and 20+ additional languages
Voice cloningZero-shot from ≥10 seconds of audio
Latency (time to first audio)< 280 ms (median)
Real-time factor0.04x (generates 25× faster than real-time)
Audio output24 kHz, 16-bit PCM
StreamingChunk-based progressive delivery

Features

Zero-Shot Voice Cloning

NueSpeak Apex can replicate a speaker's voice characteristics from a single audio sample of 10 seconds or more. The cloning pipeline extracts:

  • Timbre — The unique tonal quality of the voice
  • Pitch contour — Natural intonation patterns
  • Speaking rate — Base cadence and rhythm
  • Accent characteristics — Regional pronunciation markers

No fine-tuning is required. The cloned voice is available instantly after sample processing (typically 2–4 seconds).

Voice Design

Beyond cloning, NueSpeak Apex supports text-based voice design. Describe the voice you want in natural language — for example, "a warm, professional female voice with a slight British accent" — and the engine synthesizes a matching voice profile.

Prosody Control

The engine provides fine-grained control over speech prosody:

  • Speed — Adjustable from 0.5× to 2.0× normal rate
  • Emphasis — Mark words or phrases for stress
  • Pauses — Insert natural pauses of configurable duration
  • Emotion — Subtle emotional coloring (neutral, warm, energetic, calm)

Multilingual Synthesis

NueSpeak Apex natively supports 28+ languages without model switching. The engine automatically detects the input language and applies appropriate phoneme mappings, prosody rules, and accent models. Code-switching within a single utterance is supported.

Telephony Optimization

For phone calls, NueSpeak Apex applies additional processing:

  • 8 kHz / μ-law encoding compatibility for PSTN delivery
  • Noise floor management — Minimizes artifacts audible on phone speakers
  • Adaptive pacing — Slightly slower delivery for phone contexts to improve comprehension
  • Caching — Generated audio is cached per text segment, eliminating redundant synthesis

Performance

Benchmarks

Measured on production infrastructure under typical load:

MetricValue
Mean Opinion Score (MOS)4.32 / 5.0
Character Error Rate (speaker similarity)3.1%
Time to first byte (P50)245 ms
Time to first byte (P95)410 ms
Throughput4 concurrent streams per instance
Memory footprint~3.6 GB VRAM

Comparison with Industry Standards

FeatureNueSpeak ApexCloud TTS (typical)Open-source TTS
Latency< 280 ms300–800 ms500–2000 ms
Voice cloningZero-shotFine-tuning requiredVaries
Multilingual28+ languages40+ languages5–15 languages
StreamingYesPartialRare
MOS score4.324.0–4.33.5–4.0

Integration

NueSpeak Apex is deeply integrated into NueForm's platform:

  • Form Builder — TTS audio is generated at publish time for all eligible questions.
  • Telephony — Real-time synthesis during live phone calls with sub-300ms latency.
  • Voice Designer — Create custom voices from text descriptions or audio samples.
  • Caching Layer — Frequently used phrases are pre-synthesized and cached for instant delivery.

Audio Quality

NueSpeak Apex generates studio-quality speech at 24 kHz sample rate. For telephony, audio is transcoded to 8 kHz μ-law for optimal phone network delivery while preserving intelligibility.

The model excels at:

  • Spelling and dictation — Clear character-by-character pronunciation for email addresses, names, and codes.
  • Numbers and dates — Natural reading of numeric content with appropriate grouping.
  • Conversational tone — Responses sound natural and engaging, not robotic.

Privacy & Security

  • Voice samples used for cloning are stored encrypted and never shared with third parties.
  • Audio is generated on NueForm's dedicated GPU infrastructure — no external API calls.
  • Cloned voices are scoped to your account and cannot be accessed by other users.
  • Voice data can be deleted at any time from the Telephony settings.
Last updated: April 6, 2026