Voice Settings
Voice can be complicated.
Voice Settings & Troubleshooting
Media support in browsers has evolved over time, and is complicated now by there being material differences between browsers on Desktop versus mobile browsers.
Rest assured, Sapience takes care of 99% of the complexity for you, and will choose sensible default settings based upon the device you are on, and the browser you are using.
For those of you that want to tweak the voice settings we use for dictation, you can access them in Menu > Settings & Customization > Advanced > Voice Settings.
Most users should not touch these settings. For the audiophiles among you - go to town.

Voice Recording Details
Configure how audio is recorded and processed for voice transcription.
Transcription Model
Whisper: this is used widely in the industry, for example by Whisper Flow and any number of AI voice startups. Good in noisy environments, which most users are in (typing on keyboard, background air conditioning, etc).
GPT Transcribe: wider array of languages supported, and fewer errors if you have a pristine audio environment. Less tolerant.
Microphone Settings
Input Sample Rate
The frequency at which audio is captured from your microphone.
Option | Description | Recommendation |
16 kHz | Optimized for speech recognition | Recommended - Whisper/GPT models are trained on 16kHz audio |
44.1 kHz | CD quality audio | Unnecessary for speech; larger files |
48 kHz | Professional audio standard | Unnecessary for speech; larger files |
Audio Channels
Whether to record in mono (single channel) or stereo (dual channel).
Option | Description | Recommendation |
Mono | Single audio channel | Recommended - Speech only needs one channel; smaller files |
Stereo | Left and right channels | Only useful for music or spatial audio |
Echo Cancellation
Reduces echo from speakers being picked up by the microphone.
- Enable if you're not using headphones and speakers might create feedback
- Recommended: ON - The transcription API does not do this processing
Noise Suppression
Reduces background noise (fans, traffic, ambient sounds).
- Enable for noisy environments
- Recommended: ON - The transcription API does not do this processing
Auto Gain Control
Automatically adjusts microphone volume to maintain consistent levels.
- Enable to normalize volume if you speak at varying distances from the mic
- Recommended: ON - Helps ensure consistent audio levels
Recording Output
Output Format
The audio format used for the recorded file.
Format | Codec | File Size | API Compatibility | Recommendation |
MP3 | MPEG Layer 3 | Small | Best | Recommended - Most reliable with gpt-4o-transcribe |
WebM | Opus | Smallest | Unreliable | May cause "invalid format" errors with newer models |
WAV | PCM (uncompressed) | Large | Good | Works but creates unnecessarily large files |
Why MP3? OpenAI's gpt-4o-transcribe model has stricter format requirements than whisper-1. WebM/Opus files frequently cause "invalid file format" errors. MP3 provides the best balance of compatibility and file size.
Output Sample Rate
The sample rate of the recorded audio file.
- Should match Input Sample Rate for best quality
- Recommended: 16 kHz - Matches what transcription models expect
Buffer Size
Size of the audio processing buffer. Affects latency vs. stability.
Option | Trade-off |
4096 | Lower latency, may drop audio on slower devices |
8192 | Balanced |
16384 | Recommended - Most stable, slight latency increase |
Output Bitrate
Overall bitrate for the encoded audio.
Option | Quality | File Size |
64 kbps | Lower | Smallest |
96 kbps | Good | Small |
128 kbps | Recommended | Balanced |
192 kbps | High | Larger |
Encoder Bitrate
Internal encoder bitrate (primarily affects WebM/Opus encoding).
- Recommended: 96 kbps - Good quality for speech
MP3 Encoding
MP3 Encoder Bitrate
Bitrate used when converting audio to MP3 format.
Option | Use Case |
64 kbps | Minimize file size, acceptable quality |
96 kbps | Good balance |
128 kbps | Recommended - Clear speech quality |
192 kbps | High quality, larger files |
Transcription
Convert to MP3 Before Sending
Automatically converts audio to MP3 before sending to the transcription API.
- Enable if using WebM or WAV output format and experiencing API errors
- Not needed if Output Format is already set to MP3
- Adds slight processing time but ensures API compatibility
Recommended Configuration
For the most reliable voice transcription experience:
Microphone:
Input Sample Rate: 16 kHz
Channels: Mono
Echo Cancellation: ON
Noise Suppression: ON
Auto Gain Control: ON
Recording Output:
Output Format: MP3
Output Sample Rate: 16 kHz
Buffer Size: 16384
Output Bitrate: 128 kbps
Encoder Bitrate: 96 kbps
MP3 Encoding:
MP3 Encoder Bitrate: 128 kbps
Transcription:
Convert to MP3 Before Sending: OFF (not needed when format is MP3)
Troubleshooting
"Invalid file format" errors
- Change Output Format to MP3
- Or enable Convert to MP3 Before Sending
Audio sounds choppy or has gaps
- Increase Buffer Size to 16384
- Close other browser tabs using the microphone
Transcription misses words or is inaccurate
- Enable Noise Suppression to reduce background noise
- Ensure Input Sample Rate is 16 kHz
- Speak clearly and at a consistent distance from the microphone
Recording fails to start
- Check browser permissions for microphone access
- Ensure no other application is using the microphone
- Try refreshing the page