Audio transcription and speech generation

Transcribe audio files into text and generate spoken audio from text using AI models.

Why Use AI Audio

Your application needs to handle audio: a meeting recording you want to summarize, a voice memo you want to search, or a written response you want to read aloud. Building this from scratch means integrating directly with a speech provider, managing API keys, handling file uploads, and writing format conversion code.

Squid AI Audio gives you both transcription and text-to-speech in a single backend call:

TypeScript
Python

Backend code
// Transcribe an audio file to text
const text = await this.squid.ai().audio().transcribe(audioFile, {
  modelName: 'whisper-1',
});

// Generate speech from text
const speechFile = await this.squid.ai().audio().createSpeech('Welcome to Squid', {
  modelName: 'tts-1',
  voice: 'nova',
});

Backend code
# Transcribe an audio file to text
text = await self.squid.ai().audio().transcribe(
    audio_bytes,
    'recording.mp3',
    'audio/mpeg',
    options={'modelName': 'whisper-1'},
)

# Generate speech from text
speech_bytes = await self.squid.ai().audio().create_speech(
    'Welcome to Squid',
    {'modelName': 'tts-1', 'voice': 'nova'},
)

Overview

Squid AI Audio wraps OpenAI's Whisper (speech-to-text) and TTS (text-to-speech) models behind a single API. Your backend calls a method, and Squid handles authentication, file upload, format conversion, and response decoding.

When to use AI Audio

Use Case	Recommendation
Convert spoken audio into searchable text	`transcribe()`
Read text aloud in a synthesized voice	`createSpeech()`
Build voice input into an AI agent chat experience	Use the `enable-transcription` flag on the AI chat widget
Custom voice for an AI agent	Use agents with voice options

How it works

You call this.squid.ai().audio().transcribe() or .createSpeech() from a backend service
The Squid backend authenticates the request, looks up the OpenAI API key from your application configuration, and forwards the call
For transcription, the audio file streams to the model and the transcript returns as text
For speech generation, the model returns binary audio that Squid wraps in a File object (TypeScript) or bytes (Python)

Quick Start

Prerequisites

A Squid backend project initialized with squid init
The @squidcloud/backend package (TypeScript) or squidcloud-backend package (Python)
OpenAI configured as an AI provider on your application in the Squid Console

Step 1: Create an executable that wraps the audio call

Audio operations require admin access to your Squid resources, so they must run on the backend. Wrap them in an executable so the client can call them safely.

TypeScript
Python

Backend code
import { executable, SquidFile, SquidService } from '@squidcloud/backend';

export class AudioService extends SquidService {
  @executable()
  async transcribeAudio(audio: SquidFile): Promise<string> {
    this.assertIsAuthenticated();

    // SquidFile carries the original filename and MIME type from the client.
    // Convert it to a native File so the audio client can stream it to OpenAI.
    const file = new File([audio.data], audio.originalName, { type: audio.mimetype });

    return this.squid.ai().audio().transcribe(file, {
      modelName: 'whisper-1',
    });
  }
}

Backend code
from squidcloud_backend import SquidFile, SquidService, executable


class AudioService(SquidService):
    @executable()
    async def transcribe_audio(self, audio: SquidFile) -> str:
        self.assert_is_authenticated()

        # SquidFile is a TypedDict carrying the file bytes plus metadata.
        return await self.squid.ai().audio().transcribe(
            audio['data'],
            audio['originalName'],
            audio['mimetype'],
            options={'modelName': 'whisper-1'},
        )

Step 2: Deploy or run the backend locally

squid start

To deploy to the cloud, see deploying your backend.

Step 3: Call the executable from the client

Client code
const fileInput = document.querySelector('input[type="file"]') as HTMLInputElement;
const audioFile = fileInput.files![0];

const transcript = await squid.executeFunction('transcribeAudio', audioFile);
console.log(transcript);

Authentication and Configuration

Audio methods require an authenticated Squid client with admin access. This is true for both transcribe() and createSpeech(). There are two recommended patterns:

Wrap calls in executables. This is the standard pattern. The executable runs on the backend with admin context, and the client never sees the underlying API key. See executables.
Call from a privileged backend service. Triggers, schedulers, webhooks, and other backend-only entry points already run with backend privileges, so you can call audio methods directly from them.

Calls from an unauthenticated browser client are rejected with an UNAUTHORIZED error.

OpenAI must be enabled as an external service in your Squid app. The API key is stored in the Squid Console; the audio client looks it up at request time.

Core Concepts

Transcription models

Squid supports three OpenAI transcription models:

Model	Notes
`whisper-1`	Default. Supports the widest set of response formats and lowest cost.
`gpt-4o-transcribe`	Higher accuracy. Returns JSON only.
`gpt-4o-mini-transcribe`	Smaller and cheaper than `gpt-4o-transcribe`. Returns JSON only.

Transcription options

AiAudioTranscribeOptions is a discriminated union on modelName. All variants share the following base fields:

Field	Type	Required	Description
`modelName`	`string`	Yes	One of the supported transcription models
`temperature`	`number`	No	Sampling temperature
`prompt`	`string`	No	Optional text to guide the transcription (e.g. proper nouns)

whisper-1 additionally supports:

Field	Type	Description
`responseFormat`	`string`	One of `'json'`, `'text'`, `'srt'`, `'verbose_json'`, `'vtt'`. Defaults to `'json'`.

gpt-4o-transcribe and gpt-4o-mini-transcribe only return 'json'.

The method always returns a plain string, regardless of the model's underlying response format.

Speech generation models

Model	Notes
`tts-1`	Faster, lower latency, slightly lower fidelity
`tts-1-hd`	Higher fidelity, slower
`gpt-4o-mini-tts`	Newer GPT-4o family TTS model

Speech generation options

AiAudioCreateSpeechOptions fields:

Field	Type	Required	Description
`modelName`	`string`	Yes	One of the supported TTS models
`voice`	`string`	No	One of `'alloy'`, `'ash'`, `'ballad'`, `'coral'`, `'echo'`, `'fable'`, `'onyx'`, `'nova'`, `'sage'`, `'shimmer'`, `'verse'`. Defaults to `'alloy'`.
`responseFormat`	`string`	No	Audio container format. One of `'mp3'`, `'opus'`, `'aac'`, `'flac'`, `'wav'`, `'pcm'`. Defaults to `'mp3'`.
`instructions`	`string`	No	Free-form guidance for the speech style (e.g. "speak slowly and clearly")
`speed`	`number`	No	Playback speed multiplier. Defaults to `1.0`.

Return shapes

Method	TypeScript return	Python return
`transcribe()`	`Promise<string>`	`str`
`createSpeech()`	`Promise<File>`	`bytes`

In TypeScript the file's MIME type and extension are inferred from responseFormat. For example, a request with responseFormat: 'mp3' returns a file named audio.mp3 with MIME type audio/mpeg.

Code Examples

Transcribe audio uploaded by a client

TypeScript
Python

Backend code
import { executable, SquidFile, SquidService } from '@squidcloud/backend';

export class AudioService extends SquidService {
  @executable()
  async transcribeRecording(audio: SquidFile, languageHint?: string): Promise<string> {
    this.assertIsAuthenticated();

    const file = new File([audio.data], audio.originalName, { type: audio.mimetype });

    return this.squid.ai().audio().transcribe(file, {
      modelName: 'whisper-1',
      // Use the prompt to bias the model toward correct spellings or terminology.
      prompt: languageHint,
    });
  }
}

Backend code
from squidcloud_backend import SquidFile, SquidService, executable


class AudioService(SquidService):
    @executable()
    async def transcribe_recording(
        self,
        audio: SquidFile,
        language_hint: str | None = None,
    ) -> str:
        self.assert_is_authenticated()

        return await self.squid.ai().audio().transcribe(
            audio['data'],
            audio['originalName'],
            audio['mimetype'],
            options={
                'modelName': 'whisper-1',
                # Use the prompt to bias the model toward correct spellings.
                'prompt': language_hint,
            },
        )

Generate spoken audio and return it to the client

This example creates an MP3 file from text and returns it as base64 so the client can play it.

TypeScript
Python

Backend code
import { executable, SquidService } from '@squidcloud/backend';

export class SpeechService extends SquidService {
  @executable()
  async narrate(text: string): Promise<{ base64: string; mimeType: string }> {
    this.assertIsAuthenticated();

    const audioFile = await this.squid.ai().audio().createSpeech(text, {
      modelName: 'tts-1-hd',
      voice: 'nova',
      responseFormat: 'mp3',
      speed: 1.0,
    });

    // Convert the File to base64 so it can travel back to the client over JSON.
    const buffer = Buffer.from(await audioFile.arrayBuffer());
    return {
      base64: buffer.toString('base64'),
      mimeType: audioFile.type,
    };
  }
}

For long-lived storage of generated audio, upload the result to a storage connector instead of returning it inline.

Backend code
import base64

from squidcloud_backend import SquidService, executable


class SpeechService(SquidService):
    @executable()
    async def narrate(self, text: str) -> dict:
        self.assert_is_authenticated()

        audio_bytes = await self.squid.ai().audio().create_speech(
            text,
            {
                'modelName': 'tts-1-hd',
                'voice': 'nova',
                'responseFormat': 'mp3',
                'speed': 1.0,
            },
        )

        # Return the audio as base64 so it can travel back to the client over JSON.
        return {
            'base64': base64.b64encode(audio_bytes).decode('ascii'),
            'mimeType': 'audio/mpeg',
        }

Round-trip: generate speech then transcribe it back

A useful pattern for integration tests or sanity checks.

TypeScript
Python

Backend code
const speechFile = await this.squid.ai().audio().createSpeech('The quick brown fox jumps over the lazy dog.', {
  modelName: 'tts-1',
});

const transcript = await this.squid.ai().audio().transcribe(speechFile, {
  modelName: 'whisper-1',
});

console.log(transcript); // "The quick brown fox jumps over the lazy dog."

Backend code
audio_bytes = await self.squid.ai().audio().create_speech(
    'The quick brown fox jumps over the lazy dog.',
    {'modelName': 'tts-1'},
)

transcript = await self.squid.ai().audio().transcribe(
    audio_bytes,
    'speech.mp3',
    'audio/mpeg',
    options={'modelName': 'whisper-1'},
)

print(transcript)  # "The quick brown fox jumps over the lazy dog."

Error Handling

Common errors

Error	Cause	Solution
`UNAUTHORIZED`	Audio call originated from a non-admin context (e.g. browser without backend)	Wrap the call in an executable or call from backend code
`Unsupported audio model`	`modelName` is not in the supported list	Use one of the listed transcription or TTS models
`OpenAI external services are disabled`	OpenAI is not enabled for the application	Enable OpenAI in the Squid Console and configure the API key
File too large	Uploaded audio exceeds the upstream provider limit	Whisper has a 25 MB upload limit. Split or compress audio above this size.

Validate input before calling

The OpenAI Whisper API rejects files larger than 25 MB. Reject oversized uploads in your executable rather than waiting for the upstream error:

TypeScript
Python

Backend code
@executable()
async transcribeAudio(audio: SquidFile): Promise<string> {
  this.assertIsAuthenticated();

  const MAX_BYTES = 25 * 1024 * 1024;
  if (audio.size > MAX_BYTES) {
    throw new Error('Audio file exceeds 25 MB. Split or compress before transcribing.');
  }

  if (!audio.mimetype.startsWith('audio/')) {
    throw new Error('Only audio files are accepted');
  }

  const file = new File([audio.data], audio.originalName, { type: audio.mimetype });
  return this.squid.ai().audio().transcribe(file, { modelName: 'whisper-1' });
}

Backend code
@executable()
async def transcribe_audio(self, audio: SquidFile) -> str:
    self.assert_is_authenticated()

    MAX_BYTES = 25 * 1024 * 1024
    if audio['size'] > MAX_BYTES:
        raise ValueError('Audio file exceeds 25 MB. Split or compress before transcribing.')

    if not audio['mimetype'].startswith('audio/'):
        raise ValueError('Only audio files are accepted')

    return await self.squid.ai().audio().transcribe(
        audio['data'],
        audio['originalName'],
        audio['mimetype'],
        options={'modelName': 'whisper-1'},
    )

Best Practices

Always wrap audio calls in executables. Audio methods require admin access. Exposing them directly from a browser leaks API keys and bypasses authentication.
Validate file size and MIME type before calling transcribe(). The upstream provider has its own limits, and rejecting bad input early gives users better error messages.
Use the prompt field on transcription to feed in domain vocabulary, proper nouns, or expected language. This is the cheapest way to improve accuracy.
Pick tts-1 for streaming or low-latency UX, and tts-1-hd for offline assets where quality matters more than latency.
Apply rate limiting to your audio executables. Speech generation and transcription both consume paid API quota.
Cache generated speech. Text-to-speech for the same input always produces the same output for a given voice, so caching saves quota.

Transcribe audio files into text and generate spoken audio from text using AI models.​

Why Use AI Audio​

Overview​

When to use AI Audio​

How it works​

Quick Start​

Prerequisites​

Step 1: Create an executable that wraps the audio call​

Step 2: Deploy or run the backend locally​

Step 3: Call the executable from the client​

Authentication and Configuration​

Core Concepts​

Transcription models​

Transcription options​

Speech generation models​

Speech generation options​

Return shapes​

Code Examples​

Transcribe audio uploaded by a client​

Generate spoken audio and return it to the client​

Round-trip: generate speech then transcribe it back​

Error Handling​

Common errors​

Validate input before calling​

Best Practices​

See Also​