TnsAI
LLM

Audio & Speech

The `WhisperClient` provides speech-to-text capabilities powered by OpenAI's Whisper model. It supports transcription in multiple languages and translation of non-English audio to English.

Quick Start

Two lines of code to transcribe or translate any audio file. The client handles file upload, API communication, and retry logic.

WhisperClient whisper = new WhisperClient();

// Transcribe audio from a file
String text = whisper.transcribe(new File("speech.mp3"));

// Translate non-English audio to English
String english = whisper.translate(new File("french_speech.mp3"));

Builder

When you need to use a custom API key or base URL (for example, if you are running a Whisper-compatible server), use the builder.

WhisperClient whisper = WhisperClient.builder()
    .model("whisper-1")          // Model name (default: "whisper-1")
    .apiKey("sk-...")            // API key (default: OPENAI_API_KEY env var)
    .baseUrl("https://...")      // Base URL (default: OPENAI_BASE_URL or OpenAI)
    .build();

Transcription

Convert speech audio into text. Three overloads are available, from a simple one-liner to a fully configurable version with language hints, timestamps, and custom response formats.

// 1. File in, text out
String text = whisper.transcribe(new File("speech.mp3"));

// 2. AudioPart in, result out (default options)
TranscriptionResult result = whisper.transcribe(AudioPart.fromFile(new File("speech.mp3")));

// 3. AudioPart + options, result out
TranscriptionResult result = whisper.transcribe(
    AudioPart.fromFile(new File("meeting.wav")),
    TranscriptionOptions.builder()
        .language("en")
        .responseFormat(ResponseFormat.VERBOSE_JSON)
        .timestampGranularities(List.of("word", "segment"))
        .temperature(0.0f)
        .prompt("Technical meeting about AI architecture")
        .build()
);

// Access result fields
String text = result.getText();
result.getLanguage().ifPresent(lang -> System.out.println("Detected: " + lang));
result.getDuration().ifPresent(dur -> System.out.println("Duration: " + dur + "s"));

if (result.hasWords()) {
    result.getWords().forEach(w -> System.out.println(w));
}
if (result.hasSegments()) {
    result.getSegments().forEach(s -> System.out.println(s));
}

TranscriptionOptions

Fine-tune the transcription by specifying the language, providing a context prompt, or requesting word-level timestamps.

ParameterTypeDefaultDescription
languageStringAuto-detectISO-639-1 language code (e.g., "en", "tr", "fr")
promptStringNoneOptional prompt to guide style or continue a previous segment
responseFormatResponseFormatJSONOutput format (see below)
temperaturefloatProvider defaultSampling temperature (0.0 = deterministic)
timestampGranularitiesList<String>Empty"word" and/or "segment" (requires VERBOSE_JSON format)

TranscriptionResult

The result always includes the transcribed text. When using VERBOSE_JSON format, you also get the detected language, audio duration, and optional word/segment timestamps.

MethodReturn TypeDescription
getText()StringThe transcribed text
getLanguage()Optional<String>Detected language (verbose JSON only)
getDuration()Optional<Double>Audio duration in seconds (verbose JSON only)
getSegments()List<Map<String, Object>>Segment-level timestamps
getWords()List<Map<String, Object>>Word-level timestamps
hasSegments()booleanWhether segment data is present
hasWords()booleanWhether word data is present

Translation

Translate audio in any supported language into English text. The source language is detected automatically -- you do not need to specify it.

// Simple file translation
String english = whisper.translate(new File("turkish_speech.mp3"));

// With options
String english = whisper.translate(
    AudioPart.fromFile(new File("german_lecture.wav")),
    TranslationOptions.builder()
        .responseFormat(ResponseFormat.TEXT)
        .temperature(0.0f)
        .prompt("Academic lecture on physics")
        .build()
);

TranslationOptions

Similar to transcription options but without a language parameter, since translation always auto-detects the source language.

ParameterTypeDefaultDescription
promptStringNoneOptional prompt to guide translation style
responseFormatResponseFormatJSONOutput format
temperaturefloatProvider defaultSampling temperature

Translation always outputs English. There is no language parameter -- the source language is detected automatically.

ResponseFormat

Choose the output format based on what you need. Use TEXT for simple transcriptions, VERBOSE_JSON for timestamps and metadata, or SRT/VTT for subtitle generation.

ValueDescription
JSONReturns {"text": "..."}
TEXTReturns plain text
SRTReturns SubRip subtitle format
VTTReturns WebVTT subtitle format
VERBOSE_JSONReturns text + language, duration, segments, and word timestamps
// Subtitle generation
TranscriptionResult srt = whisper.transcribe(
    AudioPart.fromFile(new File("video.mp4")),
    TranscriptionOptions.builder()
        .responseFormat(ResponseFormat.SRT)
        .build()
);

AudioPart

AudioPart is a content wrapper from tnsai-core that handles the details of encoding and formatting audio data for API submission. Create one from whichever source you have -- file, bytes, Base64 string, or URL.

// From file (reads and Base64-encodes)
AudioPart audio = AudioPart.fromFile(new File("speech.wav"));

// From Base64 string
AudioPart audio = AudioPart.fromBase64(base64String, "audio/mp3");

// From byte array
AudioPart audio = AudioPart.fromBytes(rawBytes, "audio/wav");

// From URL
AudioPart audio = AudioPart.fromUrl("https://example.com/audio.mp3");

Supported Audio Formats

The following audio formats are accepted by the Whisper API: mp3, mp4, mpeg, mpga, m4a, wav, webm, ogg, flac, aac, aiff.

Maximum file size: 25 MB.

Configuration

Set your OpenAI API key to authenticate with the Whisper service. An optional base URL override is available for self-hosted or proxy deployments.

Environment VariableRequiredDescription
OPENAI_API_KEYYesOpenAI API key
OPENAI_BASE_URLNoCustom base URL (default: https://api.openai.com/v1)

Error Handling

The client includes built-in resilience so you do not need to implement retry logic yourself. Transient errors (network issues, rate limits) are retried up to 3 times with exponential backoff. Permanent failures throw LLMException.

try {
    String text = whisper.transcribe(new File("speech.mp3"));
} catch (LLMException e) {
    System.err.println("Transcription failed: " + e.getMessage());
} catch (IllegalArgumentException e) {
    System.err.println("File too large or invalid: " + e.getMessage());
}

On this page