Three aggregator toolkits in tnsai-tools give an agent text→image, text→speech, and speech→text capability without the consumer having to write provider plumbing. Each toolkit is a function-shape POJO (RFC #188) that exposes one @Tool-annotated method per backend provider, so the LLM can pick a provider at call time based on quality, latency, or cost.

Toolkit	`BuiltInTool` enum	Methods	Modality
`ImageGenTools`	`IMAGE_GEN_TOOLS`	`dalle3_generate`, `flux_generate`, `stability_generate`	Text → image
`TextToSpeechTools`	`TEXT_TO_SPEECH_TOOLS`	`elevenlabs_tts`, `cartesia_tts`, `deepgram_tts`	Text → speech
`SpeechToTextTools`	`SPEECH_TO_TEXT_TOOLS`	`deepgram_transcribe`, `assemblyai_transcribe`, `replicate_whisper`	Audio file → text

Pairs with the older OpenAI-only MEDIA_TOOLS (openai_tts, whisper_transcribe) — register both when you want OpenAI plus a non-OpenAI fallback in the same agent.

Quick start

Register the toolkits the same way as any other built-in tool. The framework instantiates the backing POJO, scans for @Tool methods, and exposes each as a separate function the LLM can call:

import com.tnsai.agents.AgentBuilder;
import com.tnsai.enums.BuiltInTool;
import com.tnsai.llm.providers.OpenAIClient;

Agent artist = AgentBuilder.create()
    .llm(new OpenAIClient("gpt-4o"))
    .role(myRole)
    .builtInTools(
        BuiltInTool.IMAGE_GEN_TOOLS,        // dalle3_generate, flux_generate, stability_generate
        BuiltInTool.TEXT_TO_SPEECH_TOOLS,   // elevenlabs_tts, cartesia_tts, deepgram_tts
        BuiltInTool.SPEECH_TO_TEXT_TOOLS    // deepgram_transcribe, assemblyai_transcribe, replicate_whisper
    )
    .build();

String response = artist.chat("Draw me a watercolor painting of a giraffe on Mars");
// LLM picks dalle3_generate, calls it; the agent reply embeds the URL.

Each @Tool reads its API key from a process environment variable on first call. Missing keys throw IllegalStateException with the exact variable name in the message.

Image generation — `IMAGE_GEN_TOOLS`

Method	Backend	Auth	Output shape
`dalle3_generate`	OpenAI DALL-E 3	`OPENAI_API_KEY`	`{provider, model, urls[], revised_prompt?}`
`flux_generate`	Black Forest Labs FLUX (via Replicate sync)	`REPLICATE_API_TOKEN`	`{provider, model, urls[]}`
`stability_generate`	Stability AI Stable Image v2	`STABILITY_API_KEY`	`{provider, model, urls[], finish_reason?}`

dalle3_generate and flux_generate return provider-hosted URLs. DALL-E URLs expire roughly an hour after generation — fetch and restage if you need persistence. FLUX URLs live for the duration documented by Replicate.

stability_generate returns image bytes, which the tool encodes inline as a data:image/png;base64,… data URI. The same pattern applies to every TTS method below — uniform shape so the agent doesn't need per-provider plumbing.

// Each method validates parameters before the HTTP call so a bad LLM
// argument fails fast with a clear message instead of a provider 4xx.
String json = new ImageGenTools().dalle3Generate(
    "A watercolor giraffe on Mars",
    "1024x1024",      // 1024x1024 | 1792x1024 | 1024x1792
    "hd",             // standard | hd
    1                 // DALL-E 3 supports n=1 only; tool rejects n>1 upfront
);

Defaults are provider-appropriate: DALL-E 1024x1024 standard, FLUX schnell model with 1:1 aspect ratio in webp, Stability core model with 1:1 aspect ratio.

Text-to-speech — `TEXT_TO_SPEECH_TOOLS`

Method	Backend	Auth	Strength
`elevenlabs_tts`	ElevenLabs Multilingual v2 (4 model variants)	`ELEVENLABS_API_KEY`	Quality leader, voice cloning
`cartesia_tts`	Cartesia Sonic-2	`CARTESIA_API_KEY`	Latency leader (~75 ms TTFB)
`deepgram_tts`	Deepgram Aura (12 voice presets)	`DEEPGRAM_API_KEY`	Cheapest of the three

All three return the same envelope:

{
  "provider": "elevenlabs",
  "model": "eleven_multilingual_v2",
  "audio_uri": "data:audio/mpeg;base64,SUQzBAAAAAAAJ...",
  "audio_bytes": 45920
}

audio_uri is ready for <audio src=…> playback or attachment to a message channel without a separate hosting step. The 2000-character per-call cap is enforced before the HTTP request — it's the most conservative of the three providers' limits, so the LLM gets a clean validation error instead of an upstream 4xx.

ElevenLabs defaults to the well-known "Rachel" voice (21m00Tcm4TlvDq8ikWAM); override via voiceId. Deepgram picks the voice through the model parameter directly (aura-asteria-en default, aura-orion-en for male voices, etc.).

Speech-to-text — `SPEECH_TO_TEXT_TOOLS`

Method	Backend	Auth	Strength
`deepgram_transcribe`	Deepgram Nova-2 (sync)	`DEEPGRAM_API_KEY`	Fastest, cheapest
`assemblyai_transcribe`	AssemblyAI Universal-2 (async, polled)	`ASSEMBLYAI_API_KEY`	Best accuracy
`replicate_whisper`	OpenAI Whisper hosted on Replicate	`REPLICATE_API_TOKEN`	Same key as FLUX

Uniform output shape:

{
  "provider": "deepgram",
  "model": "nova-2",
  "text": "The full transcript...",
  "language": "en",
  "confidence": 0.97
}

All three accept MP3, WAV, M4A, OGG, FLAC, and (where the upstream supports it) WebM. The 50 MB shared cap is enforced before the upload; the OpenAI-side path in MEDIA_TOOLS#whisperTranscribe keeps its own 25 MB cap because the upstream rejects larger files.

assemblyai_transcribe is internally upload → submit → poll; the loop has a 5-minute hard cap so a @Tool call returns deterministically. Longer transcripts should batch through AssemblyAI's webhook flow, which is out of scope for the synchronous tool surface.

Environment variables

Variable	Used by
`OPENAI_API_KEY`	`dalle3_generate`, plus `MEDIA_TOOLS` (`whisper_transcribe`, `openai_tts`)
`REPLICATE_API_TOKEN`	`flux_generate`, `replicate_whisper` (one key, two providers)
`STABILITY_API_KEY`	`stability_generate`
`ELEVENLABS_API_KEY`	`elevenlabs_tts`
`CARTESIA_API_KEY`	`cartesia_tts`
`DEEPGRAM_API_KEY`	`deepgram_tts`, `deepgram_transcribe` (one key, two methods)
`ASSEMBLYAI_API_KEY`	`assemblyai_transcribe`

Every key is read on first call, not at agent build time — registering a toolkit with no keys present is fine; the IllegalStateException only fires when the agent actually invokes that specific method.

Cost notes

April 2026 list pricing — providers may change. Check each provider's current pricing page before relying on these numbers in production:

Method	Approximate cost
`dalle3_generate`	$0.04 / image (standard 1024×1024); $0.08 (HD or wide)
`flux_generate`	~$0.003 / image (`schnell`), ~$0.025 (`dev`), ~$0.055 (`pro`)
`stability_generate`	$0.03 / image (`core`), $0.035 (`sd3`), $0.08 (`ultra`)
`elevenlabs_tts`	~$0.18 / 1k chars (Creator tier)
`cartesia_tts`	~$0.030 / 1k chars
`deepgram_tts`	~$0.015 / 1k chars
`deepgram_transcribe`	$0.0043 / minute
`assemblyai_transcribe`	$0.27 / hour (~$0.0045 / minute)
`replicate_whisper`	~$0.0007 / minute (Nvidia T4)

Cost is not yet emitted as a ToolEvent attribute — when that ships, multimodal spend will flow through the same Cost Governance layer as LLM calls and surface as tnsai_multimodal_cost_usd_total{provider, model, modality} in Prometheus.

Implementation notes

No vendor SDKs. Every method uses OkHttp + Jackson directly against the provider REST API. Keeps the tnsai-tools transitive graph lean and avoids version-conflict diamond problems with consumers who already pull the same SDK at a different version.
Test seam via package-private constructors. Each toolkit has a test-only constructor that swaps the provider base URL for a MockWebServer and shares a single OkHttpClient. The public no-arg constructor (the only one BuiltInTool ever calls) wires the real upstream.
Per-provider input validation. Each method checks size / model / aspect-ratio enums before the HTTP call so a bad LLM argument fails fast with a message naming the allowed values, not an opaque upstream 4xx.

Future work

MediaStore SPI. A pluggable store (LocalFilesystemMediaStore, S3MediaStore, InMemoryMediaStore, ExpiringUrlPassthroughStore) so that DALL-E URLs can be persisted past their 1-hour TTL and TTS bytes can be served from a stable URL instead of inline data URIs.
ToolEvent.cost_usd attribute. Once the tool-event hook system carries cost, multimodal spend flows into CostBudget enforcement and a tnsai_multimodal_cost_usd_total Prometheus metric.
Async job pattern. Video generation, 8K image variants, and long-form audiobook TTS exceed the synchronous @Tool model. A future surface returns {job_id, status: "pending"} and lets the agent poll or subscribe via webhook.
ContentModerationHook. Optional pre-call hook for local models (Stable Diffusion, Piper) that ship without server-side moderation. Hosted providers already enforce their own policy and surface violations as standard error codes.
Additional providers. Imagen, Ideogram, Piper local TTS, and a Replicate-platform meta-tool were in the original scope; not shipped yet.
Integration tests behind RUN_INTEGRATION_TESTS. The current test suite uses MockWebServer for canned responses; an opt-in env-flag tier would call live providers when API keys are present.

Multimodal Tools