Multimodal Tools
Three aggregator toolkits in tnsai-tools give an agent text→image, text→speech, and speech→text capability without the consumer having to write provider plumbing. Each toolkit is a function-shape POJO (RFC #188) that exposes one @Tool-annotated method per backend provider, so the LLM can pick a provider at call time based on quality, latency, or cost.
| Toolkit | BuiltInTool enum | Methods | Modality |
|---|---|---|---|
ImageGenTools | IMAGE_GEN_TOOLS | dalle3_generate, flux_generate, stability_generate | Text → image |
TextToSpeechTools | TEXT_TO_SPEECH_TOOLS | elevenlabs_tts, cartesia_tts, deepgram_tts | Text → speech |
SpeechToTextTools | SPEECH_TO_TEXT_TOOLS | deepgram_transcribe, assemblyai_transcribe, replicate_whisper | Audio file → text |
Pairs with the older OpenAI-only MEDIA_TOOLS (openai_tts, whisper_transcribe) — register both when you want OpenAI plus a non-OpenAI fallback in the same agent.
Quick start
Register the toolkits the same way as any other built-in tool. The framework instantiates the backing POJO, scans for @Tool methods, and exposes each as a separate function the LLM can call:
import com.tnsai.agents.AgentBuilder;
import com.tnsai.enums.BuiltInTool;
import com.tnsai.llm.providers.OpenAIClient;
Agent artist = AgentBuilder.create()
.llm(new OpenAIClient("gpt-4o"))
.role(myRole)
.builtInTools(
BuiltInTool.IMAGE_GEN_TOOLS, // dalle3_generate, flux_generate, stability_generate
BuiltInTool.TEXT_TO_SPEECH_TOOLS, // elevenlabs_tts, cartesia_tts, deepgram_tts
BuiltInTool.SPEECH_TO_TEXT_TOOLS // deepgram_transcribe, assemblyai_transcribe, replicate_whisper
)
.build();
String response = artist.chat("Draw me a watercolor painting of a giraffe on Mars");
// LLM picks dalle3_generate, calls it; the agent reply embeds the URL.Each @Tool reads its API key from a process environment variable on first call. Missing keys throw IllegalStateException with the exact variable name in the message.
Image generation — IMAGE_GEN_TOOLS
| Method | Backend | Auth | Output shape |
|---|---|---|---|
dalle3_generate | OpenAI DALL-E 3 | OPENAI_API_KEY | {provider, model, urls[], revised_prompt?} |
flux_generate | Black Forest Labs FLUX (via Replicate sync) | REPLICATE_API_TOKEN | {provider, model, urls[]} |
stability_generate | Stability AI Stable Image v2 | STABILITY_API_KEY | {provider, model, urls[], finish_reason?} |
dalle3_generate and flux_generate return provider-hosted URLs. DALL-E URLs expire roughly an hour after generation — fetch and restage if you need persistence. FLUX URLs live for the duration documented by Replicate.
stability_generate returns image bytes, which the tool encodes inline as a data:image/png;base64,… data URI. The same pattern applies to every TTS method below — uniform shape so the agent doesn't need per-provider plumbing.
// Each method validates parameters before the HTTP call so a bad LLM
// argument fails fast with a clear message instead of a provider 4xx.
String json = new ImageGenTools().dalle3Generate(
"A watercolor giraffe on Mars",
"1024x1024", // 1024x1024 | 1792x1024 | 1024x1792
"hd", // standard | hd
1 // DALL-E 3 supports n=1 only; tool rejects n>1 upfront
);Defaults are provider-appropriate: DALL-E 1024x1024 standard, FLUX schnell model with 1:1 aspect ratio in webp, Stability core model with 1:1 aspect ratio.
Text-to-speech — TEXT_TO_SPEECH_TOOLS
| Method | Backend | Auth | Strength |
|---|---|---|---|
elevenlabs_tts | ElevenLabs Multilingual v2 (4 model variants) | ELEVENLABS_API_KEY | Quality leader, voice cloning |
cartesia_tts | Cartesia Sonic-2 | CARTESIA_API_KEY | Latency leader (~75 ms TTFB) |
deepgram_tts | Deepgram Aura (12 voice presets) | DEEPGRAM_API_KEY | Cheapest of the three |
All three return the same envelope:
{
"provider": "elevenlabs",
"model": "eleven_multilingual_v2",
"audio_uri": "data:audio/mpeg;base64,SUQzBAAAAAAAJ...",
"audio_bytes": 45920
}audio_uri is ready for <audio src=…> playback or attachment to a message channel without a separate hosting step. The 2000-character per-call cap is enforced before the HTTP request — it's the most conservative of the three providers' limits, so the LLM gets a clean validation error instead of an upstream 4xx.
ElevenLabs defaults to the well-known "Rachel" voice (21m00Tcm4TlvDq8ikWAM); override via voiceId. Deepgram picks the voice through the model parameter directly (aura-asteria-en default, aura-orion-en for male voices, etc.).
Speech-to-text — SPEECH_TO_TEXT_TOOLS
| Method | Backend | Auth | Strength |
|---|---|---|---|
deepgram_transcribe | Deepgram Nova-2 (sync) | DEEPGRAM_API_KEY | Fastest, cheapest |
assemblyai_transcribe | AssemblyAI Universal-2 (async, polled) | ASSEMBLYAI_API_KEY | Best accuracy |
replicate_whisper | OpenAI Whisper hosted on Replicate | REPLICATE_API_TOKEN | Same key as FLUX |
Uniform output shape:
{
"provider": "deepgram",
"model": "nova-2",
"text": "The full transcript...",
"language": "en",
"confidence": 0.97
}All three accept MP3, WAV, M4A, OGG, FLAC, and (where the upstream supports it) WebM. The 50 MB shared cap is enforced before the upload; the OpenAI-side path in MEDIA_TOOLS#whisperTranscribe keeps its own 25 MB cap because the upstream rejects larger files.
assemblyai_transcribe is internally upload → submit → poll; the loop has a 5-minute hard cap so a @Tool call returns deterministically. Longer transcripts should batch through AssemblyAI's webhook flow, which is out of scope for the synchronous tool surface.
Environment variables
| Variable | Used by |
|---|---|
OPENAI_API_KEY | dalle3_generate, plus MEDIA_TOOLS (whisper_transcribe, openai_tts) |
REPLICATE_API_TOKEN | flux_generate, replicate_whisper (one key, two providers) |
STABILITY_API_KEY | stability_generate |
ELEVENLABS_API_KEY | elevenlabs_tts |
CARTESIA_API_KEY | cartesia_tts |
DEEPGRAM_API_KEY | deepgram_tts, deepgram_transcribe (one key, two methods) |
ASSEMBLYAI_API_KEY | assemblyai_transcribe |
Every key is read on first call, not at agent build time — registering a toolkit with no keys present is fine; the IllegalStateException only fires when the agent actually invokes that specific method.
Cost notes
April 2026 list pricing — providers may change. Check each provider's current pricing page before relying on these numbers in production:
| Method | Approximate cost |
|---|---|
dalle3_generate | $0.04 / image (standard 1024×1024); $0.08 (HD or wide) |
flux_generate | ~$0.003 / image (schnell), ~$0.025 (dev), ~$0.055 (pro) |
stability_generate | $0.03 / image (core), $0.035 (sd3), $0.08 (ultra) |
elevenlabs_tts | ~$0.18 / 1k chars (Creator tier) |
cartesia_tts | ~$0.030 / 1k chars |
deepgram_tts | ~$0.015 / 1k chars |
deepgram_transcribe | $0.0043 / minute |
assemblyai_transcribe | $0.27 / hour (~$0.0045 / minute) |
replicate_whisper | ~$0.0007 / minute (Nvidia T4) |
Cost is not yet emitted as a ToolEvent attribute — when that ships, multimodal spend will flow through the same Cost Governance layer as LLM calls and surface as tnsai_multimodal_cost_usd_total{provider, model, modality} in Prometheus.
Implementation notes
- No vendor SDKs. Every method uses
OkHttp + Jacksondirectly against the provider REST API. Keeps thetnsai-toolstransitive graph lean and avoids version-conflict diamond problems with consumers who already pull the same SDK at a different version. - Test seam via package-private constructors. Each toolkit has a test-only constructor that swaps the provider base URL for a
MockWebServerand shares a singleOkHttpClient. The public no-arg constructor (the only oneBuiltInToolever calls) wires the real upstream. - Per-provider input validation. Each method checks size / model / aspect-ratio enums before the HTTP call so a bad LLM argument fails fast with a message naming the allowed values, not an opaque upstream 4xx.
Future work
MediaStoreSPI. A pluggable store (LocalFilesystemMediaStore,S3MediaStore,InMemoryMediaStore,ExpiringUrlPassthroughStore) so that DALL-E URLs can be persisted past their 1-hour TTL and TTS bytes can be served from a stable URL instead of inline data URIs.ToolEvent.cost_usdattribute. Once the tool-event hook system carries cost, multimodal spend flows intoCostBudgetenforcement and atnsai_multimodal_cost_usd_totalPrometheus metric.- Async job pattern. Video generation, 8K image variants, and long-form audiobook TTS exceed the synchronous
@Toolmodel. A future surface returns{job_id, status: "pending"}and lets the agent poll or subscribe via webhook. ContentModerationHook. Optional pre-call hook for local models (Stable Diffusion, Piper) that ship without server-side moderation. Hosted providers already enforce their own policy and surface violations as standard error codes.- Additional providers. Imagen, Ideogram, Piper local TTS, and a Replicate-platform meta-tool were in the original scope; not shipped yet.
- Integration tests behind
RUN_INTEGRATION_TESTS. The current test suite usesMockWebServerfor canned responses; an opt-in env-flag tier would call live providers when API keys are present.
See also
- Catalog — full shipped toolkit inventory; multimodal entries to be added under
mediaandimagecategories - Audio & Speech — direct-call
WhisperClientfor non-agent transcription workflows (no@Toolwrapping) - Cost Governance — budget layer that multimodal spend will join once
ToolEvent.cost_usdis wired - Custom Tools — write your own
@Toolmethods following the same function-shape POJO pattern