Audio transcription and speech generation
Transcribe audio files into text and generate spoken audio from text using AI models.
Why Use AI Audio
Your application needs to handle audio: a meeting recording you want to summarize, a voice memo you want to search, or a written response you want to read aloud. Building this from scratch means integrating directly with a speech provider, managing API keys, handling file uploads, and writing format conversion code.
Squid AI Audio gives you both transcription and text-to-speech in a single backend call:
- TypeScript
- Python
// Transcribe an audio file to text
const text = await this.squid.ai().audio().transcribe(audioFile, {
modelName: 'whisper-1',
});
// Generate speech from text
const speechFile = await this.squid.ai().audio().createSpeech('Welcome to Squid', {
modelName: 'tts-1',
voice: 'nova',
});
# Transcribe an audio file to text
text = await self.squid.ai().audio().transcribe(
audio_bytes,
'recording.mp3',
'audio/mpeg',
options={'modelName': 'whisper-1'},
)
# Generate speech from text
speech_bytes = await self.squid.ai().audio().create_speech(
'Welcome to Squid',
{'modelName': 'tts-1', 'voice': 'nova'},
)
Overview
Squid AI Audio wraps OpenAI's Whisper (speech-to-text) and TTS (text-to-speech) models behind a single API. Your backend calls a method, and Squid handles authentication, file upload, format conversion, and response decoding.
When to use AI Audio
| Use Case | Recommendation |
|---|---|
| Convert spoken audio into searchable text | transcribe() |
| Read text aloud in a synthesized voice | createSpeech() |
| Build voice input into an AI agent chat experience | Use the enable-transcription flag on the AI chat widget |
| Custom voice for an AI agent | Use agents with voice options |
How it works
- You call
this.squid.ai().audio().transcribe()or.createSpeech()from a backend service - The Squid backend authenticates the request, looks up the OpenAI API key from your application configuration, and forwards the call
- For transcription, the audio file streams to the model and the transcript returns as text
- For speech generation, the model returns binary audio that Squid wraps in a
Fileobject (TypeScript) orbytes(Python)
Quick Start
Prerequisites
- A Squid backend project initialized with
squid init - The
@squidcloud/backendpackage (TypeScript) orsquidcloud-backendpackage (Python) - OpenAI configured as an AI provider on your application in the Squid Console
Step 1: Create an executable that wraps the audio call
Audio operations require admin access to your Squid resources, so they must run on the backend. Wrap them in an executable so the client can call them safely.
- TypeScript
- Python
import { executable, SquidFile, SquidService } from '@squidcloud/backend';
export class AudioService extends SquidService {
@executable()
async transcribeAudio(audio: SquidFile): Promise<string> {
this.assertIsAuthenticated();
// SquidFile carries the original filename and MIME type from the client.
// Convert it to a native File so the audio client can stream it to OpenAI.
const file = new File([audio.data], audio.originalName, { type: audio.mimetype });
return this.squid.ai().audio().transcribe(file, {
modelName: 'whisper-1',
});
}
}
from squidcloud_backend import SquidFile, SquidService, executable
class AudioService(SquidService):
@executable()
async def transcribe_audio(self, audio: SquidFile) -> str:
self.assert_is_authenticated()
# SquidFile is a TypedDict carrying the file bytes plus metadata.
return await self.squid.ai().audio().transcribe(
audio['data'],
audio['originalName'],
audio['mimetype'],
options={'modelName': 'whisper-1'},
)
Step 2: Deploy or run the backend locally
squid start
To deploy to the cloud, see deploying your backend.
Step 3: Call the executable from the client
const fileInput = document.querySelector('input[type="file"]') as HTMLInputElement;
const audioFile = fileInput.files![0];
const transcript = await squid.executeFunction('transcribeAudio', audioFile);
console.log(transcript);
Authentication and Configuration
Audio methods require an authenticated Squid client with admin access. This is true for both transcribe() and createSpeech(). There are two recommended patterns:
- Wrap calls in executables. This is the standard pattern. The executable runs on the backend with admin context, and the client never sees the underlying API key. See executables.
- Call from a privileged backend service. Triggers, schedulers, webhooks, and other backend-only entry points already run with backend privileges, so you can call audio methods directly from them.
Calls from an unauthenticated browser client are rejected with an UNAUTHORIZED error.
OpenAI must be enabled as an external service in your Squid app. The API key is stored in the Squid Console; the audio client looks it up at request time.
Core Concepts
Transcription models
Squid supports three OpenAI transcription models:
| Model | Notes |
|---|---|
whisper-1 | Default. Supports the widest set of response formats and lowest cost. |
gpt-4o-transcribe | Higher accuracy. Returns JSON only. |
gpt-4o-mini-transcribe | Smaller and cheaper than gpt-4o-transcribe. Returns JSON only. |
Transcription options
AiAudioTranscribeOptions is a discriminated union on modelName. All variants share the following base fields:
| Field | Type | Required | Description |
|---|---|---|---|
modelName | string | Yes | One of the supported transcription models |
temperature | number | No | Sampling temperature |
prompt | string | No | Optional text to guide the transcription (e.g. proper nouns) |
whisper-1 additionally supports:
| Field | Type | Description |
|---|---|---|
responseFormat | string | One of 'json', 'text', 'srt', 'verbose_json', 'vtt'. Defaults to 'json'. |
gpt-4o-transcribe and gpt-4o-mini-transcribe only return 'json'.
The method always returns a plain string, regardless of the model's underlying response format.
Speech generation models
| Model | Notes |
|---|---|
tts-1 | Faster, lower latency, slightly lower fidelity |
tts-1-hd | Higher fidelity, slower |
gpt-4o-mini-tts | Newer GPT-4o family TTS model |
Speech generation options
AiAudioCreateSpeechOptions fields:
| Field | Type | Required | Description |
|---|---|---|---|
modelName | string | Yes | One of the supported TTS models |
voice | string | No | One of 'alloy', 'ash', 'ballad', 'coral', 'echo', 'fable', 'onyx', 'nova', 'sage', 'shimmer', 'verse'. Defaults to 'alloy'. |
responseFormat | string | No | Audio container format. One of 'mp3', 'opus', 'aac', 'flac', 'wav', 'pcm'. Defaults to 'mp3'. |
instructions | string | No | Free-form guidance for the speech style (e.g. "speak slowly and clearly") |
speed | number | No | Playback speed multiplier. Defaults to 1.0. |
Return shapes
| Method | TypeScript return | Python return |
|---|---|---|
transcribe() | Promise<string> | str |
createSpeech() | Promise<File> | bytes |
In TypeScript the file's MIME type and extension are inferred from responseFormat. For example, a request with responseFormat: 'mp3' returns a file named audio.mp3 with MIME type audio/mpeg.
Code Examples
Transcribe audio uploaded by a client
- TypeScript
- Python
import { executable, SquidFile, SquidService } from '@squidcloud/backend';
export class AudioService extends SquidService {
@executable()
async transcribeRecording(audio: SquidFile, languageHint?: string): Promise<string> {
this.assertIsAuthenticated();
const file = new File([audio.data], audio.originalName, { type: audio.mimetype });
return this.squid.ai().audio().transcribe(file, {
modelName: 'whisper-1',
// Use the prompt to bias the model toward correct spellings or terminology.
prompt: languageHint,
});
}
}
from squidcloud_backend import SquidFile, SquidService, executable
class AudioService(SquidService):
@executable()
async def transcribe_recording(
self,
audio: SquidFile,
language_hint: str | None = None,
) -> str:
self.assert_is_authenticated()
return await self.squid.ai().audio().transcribe(
audio['data'],
audio['originalName'],
audio['mimetype'],
options={
'modelName': 'whisper-1',
# Use the prompt to bias the model toward correct spellings.
'prompt': language_hint,
},
)
Generate spoken audio and return it to the client
This example creates an MP3 file from text and returns it as base64 so the client can play it.
- TypeScript
- Python
import { executable, SquidService } from '@squidcloud/backend';
export class SpeechService extends SquidService {
@executable()
async narrate(text: string): Promise<{ base64: string; mimeType: string }> {
this.assertIsAuthenticated();
const audioFile = await this.squid.ai().audio().createSpeech(text, {
modelName: 'tts-1-hd',
voice: 'nova',
responseFormat: 'mp3',
speed: 1.0,
});
// Convert the File to base64 so it can travel back to the client over JSON.
const buffer = Buffer.from(await audioFile.arrayBuffer());
return {
base64: buffer.toString('base64'),
mimeType: audioFile.type,
};
}
}
For long-lived storage of generated audio, upload the result to a storage connector instead of returning it inline.
import base64
from squidcloud_backend import SquidService, executable
class SpeechService(SquidService):
@executable()
async def narrate(self, text: str) -> dict:
self.assert_is_authenticated()
audio_bytes = await self.squid.ai().audio().create_speech(
text,
{
'modelName': 'tts-1-hd',
'voice': 'nova',
'responseFormat': 'mp3',
'speed': 1.0,
},
)
# Return the audio as base64 so it can travel back to the client over JSON.
return {
'base64': base64.b64encode(audio_bytes).decode('ascii'),
'mimeType': 'audio/mpeg',
}
Round-trip: generate speech then transcribe it back
A useful pattern for integration tests or sanity checks.
- TypeScript
- Python
const speechFile = await this.squid.ai().audio().createSpeech('The quick brown fox jumps over the lazy dog.', {
modelName: 'tts-1',
});
const transcript = await this.squid.ai().audio().transcribe(speechFile, {
modelName: 'whisper-1',
});
console.log(transcript); // "The quick brown fox jumps over the lazy dog."
audio_bytes = await self.squid.ai().audio().create_speech(
'The quick brown fox jumps over the lazy dog.',
{'modelName': 'tts-1'},
)
transcript = await self.squid.ai().audio().transcribe(
audio_bytes,
'speech.mp3',
'audio/mpeg',
options={'modelName': 'whisper-1'},
)
print(transcript) # "The quick brown fox jumps over the lazy dog."
Error Handling
Common errors
| Error | Cause | Solution |
|---|---|---|
UNAUTHORIZED | Audio call originated from a non-admin context (e.g. browser without backend) | Wrap the call in an executable or call from backend code |
Unsupported audio model | modelName is not in the supported list | Use one of the listed transcription or TTS models |
OpenAI external services are disabled | OpenAI is not enabled for the application | Enable OpenAI in the Squid Console and configure the API key |
| File too large | Uploaded audio exceeds the upstream provider limit | Whisper has a 25 MB upload limit. Split or compress audio above this size. |
Validate input before calling
The OpenAI Whisper API rejects files larger than 25 MB. Reject oversized uploads in your executable rather than waiting for the upstream error:
- TypeScript
- Python
@executable()
async transcribeAudio(audio: SquidFile): Promise<string> {
this.assertIsAuthenticated();
const MAX_BYTES = 25 * 1024 * 1024;
if (audio.size > MAX_BYTES) {
throw new Error('Audio file exceeds 25 MB. Split or compress before transcribing.');
}
if (!audio.mimetype.startsWith('audio/')) {
throw new Error('Only audio files are accepted');
}
const file = new File([audio.data], audio.originalName, { type: audio.mimetype });
return this.squid.ai().audio().transcribe(file, { modelName: 'whisper-1' });
}
@executable()
async def transcribe_audio(self, audio: SquidFile) -> str:
self.assert_is_authenticated()
MAX_BYTES = 25 * 1024 * 1024
if audio['size'] > MAX_BYTES:
raise ValueError('Audio file exceeds 25 MB. Split or compress before transcribing.')
if not audio['mimetype'].startswith('audio/'):
raise ValueError('Only audio files are accepted')
return await self.squid.ai().audio().transcribe(
audio['data'],
audio['originalName'],
audio['mimetype'],
options={'modelName': 'whisper-1'},
)
Best Practices
- Always wrap audio calls in executables. Audio methods require admin access. Exposing them directly from a browser leaks API keys and bypasses authentication.
- Validate file size and MIME type before calling
transcribe(). The upstream provider has its own limits, and rejecting bad input early gives users better error messages. - Use the
promptfield on transcription to feed in domain vocabulary, proper nouns, or expected language. This is the cheapest way to improve accuracy. - Pick
tts-1for streaming or low-latency UX, andtts-1-hdfor offline assets where quality matters more than latency. - Apply rate limiting to your audio executables. Speech generation and transcription both consume paid API quota.
- Cache generated speech. Text-to-speech for the same input always produces the same output for a given voice, so caching saves quota.
See Also
- Executables - Wrap audio calls so they can be called from a client
- AI agent - Build an AI agent with voice input and output
- AI chat widget - The chat widget exposes voice input via
enable-transcription - Rate and quota limiting - Protect audio executables from abuse