Skip to main content

Audio transcription and speech generation

Transcribe audio files into text and generate spoken audio from text using AI models.

Why Use AI Audio

Your application needs to handle audio: a meeting recording you want to summarize, a voice memo you want to search, or a written response you want to read aloud. Building this from scratch means integrating directly with a speech provider, managing API keys, handling file uploads, and writing format conversion code.

Squid AI Audio gives you both transcription and text-to-speech in a single backend call:

Backend code
// Transcribe an audio file to text
const text = await this.squid.ai().audio().transcribe(audioFile, {
modelName: 'whisper-1',
});

// Generate speech from text
const speechFile = await this.squid.ai().audio().createSpeech('Welcome to Squid', {
modelName: 'tts-1',
voice: 'nova',
});

Overview

Squid AI Audio wraps OpenAI's Whisper (speech-to-text) and TTS (text-to-speech) models behind a single API. Your backend calls a method, and Squid handles authentication, file upload, format conversion, and response decoding.

When to use AI Audio

Use CaseRecommendation
Convert spoken audio into searchable texttranscribe()
Read text aloud in a synthesized voicecreateSpeech()
Build voice input into an AI agent chat experienceUse the enable-transcription flag on the AI chat widget
Custom voice for an AI agentUse agents with voice options

How it works

  1. You call this.squid.ai().audio().transcribe() or .createSpeech() from a backend service
  2. The Squid backend authenticates the request, looks up the OpenAI API key from your application configuration, and forwards the call
  3. For transcription, the audio file streams to the model and the transcript returns as text
  4. For speech generation, the model returns binary audio that Squid wraps in a File object (TypeScript) or bytes (Python)

Quick Start

Prerequisites

  • A Squid backend project initialized with squid init
  • The @squidcloud/backend package (TypeScript) or squidcloud-backend package (Python)
  • OpenAI configured as an AI provider on your application in the Squid Console

Step 1: Create an executable that wraps the audio call

Audio operations require admin access to your Squid resources, so they must run on the backend. Wrap them in an executable so the client can call them safely.

Backend code
import { executable, SquidFile, SquidService } from '@squidcloud/backend';

export class AudioService extends SquidService {
@executable()
async transcribeAudio(audio: SquidFile): Promise<string> {
this.assertIsAuthenticated();

// SquidFile carries the original filename and MIME type from the client.
// Convert it to a native File so the audio client can stream it to OpenAI.
const file = new File([audio.data], audio.originalName, { type: audio.mimetype });

return this.squid.ai().audio().transcribe(file, {
modelName: 'whisper-1',
});
}
}

Step 2: Deploy or run the backend locally

squid start

To deploy to the cloud, see deploying your backend.

Step 3: Call the executable from the client

Client code
const fileInput = document.querySelector('input[type="file"]') as HTMLInputElement;
const audioFile = fileInput.files![0];

const transcript = await squid.executeFunction('transcribeAudio', audioFile);
console.log(transcript);

Authentication and Configuration

Audio methods require an authenticated Squid client with admin access. This is true for both transcribe() and createSpeech(). There are two recommended patterns:

  1. Wrap calls in executables. This is the standard pattern. The executable runs on the backend with admin context, and the client never sees the underlying API key. See executables.
  2. Call from a privileged backend service. Triggers, schedulers, webhooks, and other backend-only entry points already run with backend privileges, so you can call audio methods directly from them.

Calls from an unauthenticated browser client are rejected with an UNAUTHORIZED error.

OpenAI must be enabled as an external service in your Squid app. The API key is stored in the Squid Console; the audio client looks it up at request time.

Core Concepts

Transcription models

Squid supports three OpenAI transcription models:

ModelNotes
whisper-1Default. Supports the widest set of response formats and lowest cost.
gpt-4o-transcribeHigher accuracy. Returns JSON only.
gpt-4o-mini-transcribeSmaller and cheaper than gpt-4o-transcribe. Returns JSON only.

Transcription options

AiAudioTranscribeOptions is a discriminated union on modelName. All variants share the following base fields:

FieldTypeRequiredDescription
modelNamestringYesOne of the supported transcription models
temperaturenumberNoSampling temperature
promptstringNoOptional text to guide the transcription (e.g. proper nouns)

whisper-1 additionally supports:

FieldTypeDescription
responseFormatstringOne of 'json', 'text', 'srt', 'verbose_json', 'vtt'. Defaults to 'json'.

gpt-4o-transcribe and gpt-4o-mini-transcribe only return 'json'.

The method always returns a plain string, regardless of the model's underlying response format.

Speech generation models

ModelNotes
tts-1Faster, lower latency, slightly lower fidelity
tts-1-hdHigher fidelity, slower
gpt-4o-mini-ttsNewer GPT-4o family TTS model

Speech generation options

AiAudioCreateSpeechOptions fields:

FieldTypeRequiredDescription
modelNamestringYesOne of the supported TTS models
voicestringNoOne of 'alloy', 'ash', 'ballad', 'coral', 'echo', 'fable', 'onyx', 'nova', 'sage', 'shimmer', 'verse'. Defaults to 'alloy'.
responseFormatstringNoAudio container format. One of 'mp3', 'opus', 'aac', 'flac', 'wav', 'pcm'. Defaults to 'mp3'.
instructionsstringNoFree-form guidance for the speech style (e.g. "speak slowly and clearly")
speednumberNoPlayback speed multiplier. Defaults to 1.0.

Return shapes

MethodTypeScript returnPython return
transcribe()Promise<string>str
createSpeech()Promise<File>bytes

In TypeScript the file's MIME type and extension are inferred from responseFormat. For example, a request with responseFormat: 'mp3' returns a file named audio.mp3 with MIME type audio/mpeg.

Code Examples

Transcribe audio uploaded by a client

Backend code
import { executable, SquidFile, SquidService } from '@squidcloud/backend';

export class AudioService extends SquidService {
@executable()
async transcribeRecording(audio: SquidFile, languageHint?: string): Promise<string> {
this.assertIsAuthenticated();

const file = new File([audio.data], audio.originalName, { type: audio.mimetype });

return this.squid.ai().audio().transcribe(file, {
modelName: 'whisper-1',
// Use the prompt to bias the model toward correct spellings or terminology.
prompt: languageHint,
});
}
}

Generate spoken audio and return it to the client

This example creates an MP3 file from text and returns it as base64 so the client can play it.

Backend code
import { executable, SquidService } from '@squidcloud/backend';

export class SpeechService extends SquidService {
@executable()
async narrate(text: string): Promise<{ base64: string; mimeType: string }> {
this.assertIsAuthenticated();

const audioFile = await this.squid.ai().audio().createSpeech(text, {
modelName: 'tts-1-hd',
voice: 'nova',
responseFormat: 'mp3',
speed: 1.0,
});

// Convert the File to base64 so it can travel back to the client over JSON.
const buffer = Buffer.from(await audioFile.arrayBuffer());
return {
base64: buffer.toString('base64'),
mimeType: audioFile.type,
};
}
}

For long-lived storage of generated audio, upload the result to a storage connector instead of returning it inline.

Round-trip: generate speech then transcribe it back

A useful pattern for integration tests or sanity checks.

Backend code
const speechFile = await this.squid.ai().audio().createSpeech('The quick brown fox jumps over the lazy dog.', {
modelName: 'tts-1',
});

const transcript = await this.squid.ai().audio().transcribe(speechFile, {
modelName: 'whisper-1',
});

console.log(transcript); // "The quick brown fox jumps over the lazy dog."

Error Handling

Common errors

ErrorCauseSolution
UNAUTHORIZEDAudio call originated from a non-admin context (e.g. browser without backend)Wrap the call in an executable or call from backend code
Unsupported audio modelmodelName is not in the supported listUse one of the listed transcription or TTS models
OpenAI external services are disabledOpenAI is not enabled for the applicationEnable OpenAI in the Squid Console and configure the API key
File too largeUploaded audio exceeds the upstream provider limitWhisper has a 25 MB upload limit. Split or compress audio above this size.

Validate input before calling

The OpenAI Whisper API rejects files larger than 25 MB. Reject oversized uploads in your executable rather than waiting for the upstream error:

Backend code
@executable()
async transcribeAudio(audio: SquidFile): Promise<string> {
this.assertIsAuthenticated();

const MAX_BYTES = 25 * 1024 * 1024;
if (audio.size > MAX_BYTES) {
throw new Error('Audio file exceeds 25 MB. Split or compress before transcribing.');
}

if (!audio.mimetype.startsWith('audio/')) {
throw new Error('Only audio files are accepted');
}

const file = new File([audio.data], audio.originalName, { type: audio.mimetype });
return this.squid.ai().audio().transcribe(file, { modelName: 'whisper-1' });
}

Best Practices

  1. Always wrap audio calls in executables. Audio methods require admin access. Exposing them directly from a browser leaks API keys and bypasses authentication.
  2. Validate file size and MIME type before calling transcribe(). The upstream provider has its own limits, and rejecting bad input early gives users better error messages.
  3. Use the prompt field on transcription to feed in domain vocabulary, proper nouns, or expected language. This is the cheapest way to improve accuracy.
  4. Pick tts-1 for streaming or low-latency UX, and tts-1-hd for offline assets where quality matters more than latency.
  5. Apply rate limiting to your audio executables. Speech generation and transcription both consume paid API quota.
  6. Cache generated speech. Text-to-speech for the same input always produces the same output for a given voice, so caching saves quota.

See Also