Audio & Speech

The `WhisperClient` provides speech-to-text capabilities powered by OpenAI's Whisper model. It supports transcription in multiple languages and translation of non-English audio to English.

Quick Start

Two lines of code to transcribe or translate any audio file. The client handles file upload, API communication, and retry logic.

WhisperClient whisper = new WhisperClient();

// Transcribe audio from a file
String text = whisper.transcribe(new File("speech.mp3"));

// Translate non-English audio to English
String english = whisper.translate(new File("french_speech.mp3"));

Builder

When you need to use a custom API key or base URL (for example, if you are running a Whisper-compatible server), use the builder.

WhisperClient whisper = WhisperClient.builder()
    .model("whisper-1")          // Model name (default: "whisper-1")
    .apiKey("sk-...")            // API key (default: OPENAI_API_KEY env var)
    .baseUrl("https://...")      // Base URL (default: OPENAI_BASE_URL or OpenAI)
    .build();

Transcription

Convert speech audio into text. Three overloads are available, from a simple one-liner to a fully configurable version with language hints, timestamps, and custom response formats.

// 1. File in, text out
String text = whisper.transcribe(new File("speech.mp3"));

// 2. AudioPart in, result out (default options)
TranscriptionResult result = whisper.transcribe(AudioPart.fromFile(new File("speech.mp3")));

// 3. AudioPart + options, result out
TranscriptionResult result = whisper.transcribe(
    AudioPart.fromFile(new File("meeting.wav")),
    TranscriptionOptions.builder()
        .language("en")
        .responseFormat(ResponseFormat.VERBOSE_JSON)
        .timestampGranularities(List.of("word", "segment"))
        .temperature(0.0f)
        .prompt("Technical meeting about AI architecture")
        .build()
);

// Access result fields
String text = result.getText();
result.getLanguage().ifPresent(lang -> System.out.println("Detected: " + lang));
result.getDuration().ifPresent(dur -> System.out.println("Duration: " + dur + "s"));

if (result.hasWords()) {
    result.getWords().forEach(w -> System.out.println(w));
}
if (result.hasSegments()) {
    result.getSegments().forEach(s -> System.out.println(s));
}

TranscriptionOptions

Fine-tune the transcription by specifying the language, providing a context prompt, or requesting word-level timestamps.

Parameter	Type	Default	Description
`language`	`String`	Auto-detect	ISO-639-1 language code (e.g., `"en"`, `"tr"`, `"fr"`)
`prompt`	`String`	None	Optional prompt to guide style or continue a previous segment
`responseFormat`	`ResponseFormat`	`JSON`	Output format (see below)
`temperature`	`float`	Provider default	Sampling temperature (0.0 = deterministic)
`timestampGranularities`	`List<String>`	Empty	`"word"` and/or `"segment"` (requires `VERBOSE_JSON` format)

TranscriptionResult

The result always includes the transcribed text. When using VERBOSE_JSON format, you also get the detected language, audio duration, and optional word/segment timestamps.

Method	Return Type	Description
`getText()`	`String`	The transcribed text
`getLanguage()`	`Optional<String>`	Detected language (verbose JSON only)
`getDuration()`	`Optional<Double>`	Audio duration in seconds (verbose JSON only)
`getSegments()`	`List<Map<String, Object>>`	Segment-level timestamps
`getWords()`	`List<Map<String, Object>>`	Word-level timestamps
`hasSegments()`	`boolean`	Whether segment data is present
`hasWords()`	`boolean`	Whether word data is present

Translation

Translate audio in any supported language into English text. The source language is detected automatically -- you do not need to specify it.

// Simple file translation
String english = whisper.translate(new File("turkish_speech.mp3"));

// With options
String english = whisper.translate(
    AudioPart.fromFile(new File("german_lecture.wav")),
    TranslationOptions.builder()
        .responseFormat(ResponseFormat.TEXT)
        .temperature(0.0f)
        .prompt("Academic lecture on physics")
        .build()
);

TranslationOptions

Similar to transcription options but without a language parameter, since translation always auto-detects the source language.

Parameter	Type	Default	Description
`prompt`	`String`	None	Optional prompt to guide translation style
`responseFormat`	`ResponseFormat`	`JSON`	Output format
`temperature`	`float`	Provider default	Sampling temperature

Translation always outputs English. There is no language parameter -- the source language is detected automatically.

ResponseFormat

Choose the output format based on what you need. Use TEXT for simple transcriptions, VERBOSE_JSON for timestamps and metadata, or SRT/VTT for subtitle generation.

Value	Description
`JSON`	Returns `{"text": "..."}`
`TEXT`	Returns plain text
`SRT`	Returns SubRip subtitle format
`VTT`	Returns WebVTT subtitle format
`VERBOSE_JSON`	Returns text + language, duration, segments, and word timestamps

// Subtitle generation
TranscriptionResult srt = whisper.transcribe(
    AudioPart.fromFile(new File("video.mp4")),
    TranscriptionOptions.builder()
        .responseFormat(ResponseFormat.SRT)
        .build()
);

AudioPart is a content wrapper from tnsai-core that handles the details of encoding and formatting audio data for API submission. Create one from whichever source you have -- file, bytes, Base64 string, or URL.

// From file (reads and Base64-encodes)
AudioPart audio = AudioPart.fromFile(new File("speech.wav"));

// From Base64 string
AudioPart audio = AudioPart.fromBase64(base64String, "audio/mp3");

// From byte array
AudioPart audio = AudioPart.fromBytes(rawBytes, "audio/wav");

// From URL
AudioPart audio = AudioPart.fromUrl("https://example.com/audio.mp3");

Supported Audio Formats

The following audio formats are accepted by the Whisper API: mp3, mp4, mpeg, mpga, m4a, wav, webm, ogg, flac, aac, aiff.

Maximum file size: 25 MB.

Configuration

Set your OpenAI API key to authenticate with the Whisper service. An optional base URL override is available for self-hosted or proxy deployments.

Environment Variable	Required	Description
`OPENAI_API_KEY`	Yes	OpenAI API key
`OPENAI_BASE_URL`	No	Custom base URL (default: `https://api.openai.com/v1`)

Error Handling

The client includes built-in resilience so you do not need to implement retry logic yourself. Transient errors (network issues, rate limits) are retried up to 3 times with exponential backoff. Permanent failures throw LLMException.

try {
    String text = whisper.transcribe(new File("speech.mp3"));
} catch (LLMException e) {
    System.err.println("Transcription failed: " + e.getMessage());
} catch (IllegalArgumentException e) {
    System.err.println("File too large or invalid: " + e.getMessage());
}

Audio & Speech

On this page