Audio & Speech
The `WhisperClient` provides speech-to-text capabilities powered by OpenAI's Whisper model. It supports transcription in multiple languages and translation of non-English audio to English.
Quick Start
Two lines of code to transcribe or translate any audio file. The client handles file upload, API communication, and retry logic.
WhisperClient whisper = new WhisperClient();
// Transcribe audio from a file
String text = whisper.transcribe(new File("speech.mp3"));
// Translate non-English audio to English
String english = whisper.translate(new File("french_speech.mp3"));Builder
When you need to use a custom API key or base URL (for example, if you are running a Whisper-compatible server), use the builder.
WhisperClient whisper = WhisperClient.builder()
.model("whisper-1") // Model name (default: "whisper-1")
.apiKey("sk-...") // API key (default: OPENAI_API_KEY env var)
.baseUrl("https://...") // Base URL (default: OPENAI_BASE_URL or OpenAI)
.build();Transcription
Convert speech audio into text. Three overloads are available, from a simple one-liner to a fully configurable version with language hints, timestamps, and custom response formats.
// 1. File in, text out
String text = whisper.transcribe(new File("speech.mp3"));
// 2. AudioPart in, result out (default options)
TranscriptionResult result = whisper.transcribe(AudioPart.fromFile(new File("speech.mp3")));
// 3. AudioPart + options, result out
TranscriptionResult result = whisper.transcribe(
AudioPart.fromFile(new File("meeting.wav")),
TranscriptionOptions.builder()
.language("en")
.responseFormat(ResponseFormat.VERBOSE_JSON)
.timestampGranularities(List.of("word", "segment"))
.temperature(0.0f)
.prompt("Technical meeting about AI architecture")
.build()
);
// Access result fields
String text = result.getText();
result.getLanguage().ifPresent(lang -> System.out.println("Detected: " + lang));
result.getDuration().ifPresent(dur -> System.out.println("Duration: " + dur + "s"));
if (result.hasWords()) {
result.getWords().forEach(w -> System.out.println(w));
}
if (result.hasSegments()) {
result.getSegments().forEach(s -> System.out.println(s));
}TranscriptionOptions
Fine-tune the transcription by specifying the language, providing a context prompt, or requesting word-level timestamps.
| Parameter | Type | Default | Description |
|---|---|---|---|
language | String | Auto-detect | ISO-639-1 language code (e.g., "en", "tr", "fr") |
prompt | String | None | Optional prompt to guide style or continue a previous segment |
responseFormat | ResponseFormat | JSON | Output format (see below) |
temperature | float | Provider default | Sampling temperature (0.0 = deterministic) |
timestampGranularities | List<String> | Empty | "word" and/or "segment" (requires VERBOSE_JSON format) |
TranscriptionResult
The result always includes the transcribed text. When using VERBOSE_JSON format, you also get the detected language, audio duration, and optional word/segment timestamps.
| Method | Return Type | Description |
|---|---|---|
getText() | String | The transcribed text |
getLanguage() | Optional<String> | Detected language (verbose JSON only) |
getDuration() | Optional<Double> | Audio duration in seconds (verbose JSON only) |
getSegments() | List<Map<String, Object>> | Segment-level timestamps |
getWords() | List<Map<String, Object>> | Word-level timestamps |
hasSegments() | boolean | Whether segment data is present |
hasWords() | boolean | Whether word data is present |
Translation
Translate audio in any supported language into English text. The source language is detected automatically -- you do not need to specify it.
// Simple file translation
String english = whisper.translate(new File("turkish_speech.mp3"));
// With options
String english = whisper.translate(
AudioPart.fromFile(new File("german_lecture.wav")),
TranslationOptions.builder()
.responseFormat(ResponseFormat.TEXT)
.temperature(0.0f)
.prompt("Academic lecture on physics")
.build()
);TranslationOptions
Similar to transcription options but without a language parameter, since translation always auto-detects the source language.
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt | String | None | Optional prompt to guide translation style |
responseFormat | ResponseFormat | JSON | Output format |
temperature | float | Provider default | Sampling temperature |
Translation always outputs English. There is no language parameter -- the source language is detected automatically.
ResponseFormat
Choose the output format based on what you need. Use TEXT for simple transcriptions, VERBOSE_JSON for timestamps and metadata, or SRT/VTT for subtitle generation.
| Value | Description |
|---|---|
JSON | Returns {"text": "..."} |
TEXT | Returns plain text |
SRT | Returns SubRip subtitle format |
VTT | Returns WebVTT subtitle format |
VERBOSE_JSON | Returns text + language, duration, segments, and word timestamps |
// Subtitle generation
TranscriptionResult srt = whisper.transcribe(
AudioPart.fromFile(new File("video.mp4")),
TranscriptionOptions.builder()
.responseFormat(ResponseFormat.SRT)
.build()
);AudioPart
AudioPart is a content wrapper from tnsai-core that handles the details of encoding and formatting audio data for API submission. Create one from whichever source you have -- file, bytes, Base64 string, or URL.
// From file (reads and Base64-encodes)
AudioPart audio = AudioPart.fromFile(new File("speech.wav"));
// From Base64 string
AudioPart audio = AudioPart.fromBase64(base64String, "audio/mp3");
// From byte array
AudioPart audio = AudioPart.fromBytes(rawBytes, "audio/wav");
// From URL
AudioPart audio = AudioPart.fromUrl("https://example.com/audio.mp3");Supported Audio Formats
The following audio formats are accepted by the Whisper API: mp3, mp4, mpeg, mpga, m4a, wav, webm, ogg, flac, aac, aiff.
Maximum file size: 25 MB.
Configuration
Set your OpenAI API key to authenticate with the Whisper service. An optional base URL override is available for self-hosted or proxy deployments.
| Environment Variable | Required | Description |
|---|---|---|
OPENAI_API_KEY | Yes | OpenAI API key |
OPENAI_BASE_URL | No | Custom base URL (default: https://api.openai.com/v1) |
Error Handling
The client includes built-in resilience so you do not need to implement retry logic yourself. Transient errors (network issues, rate limits) are retried up to 3 times with exponential backoff. Permanent failures throw LLMException.
try {
String text = whisper.transcribe(new File("speech.mp3"));
} catch (LLMException e) {
System.err.println("Transcription failed: " + e.getMessage());
} catch (IllegalArgumentException e) {
System.err.println("File too large or invalid: " + e.getMessage());
}Reasoning
Advanced reasoning strategies for complex problem solving. TnsAI provides multiple reasoning executors based on recent AI research, from simple chain-of-thought to graph-based reasoning with merging and refinement.
LLM Caching
Reduce latency and cost with semantic response caching. The cache uses similarity matching so that near-identical prompts return cached responses without hitting the API.