Google’s Gemini audio models get sharper voice agents, live speech translation
Gemini 2.5 Flash Native Audio improves function calling, instruction following and multi‑turn dialogue. A new live speech translation beta starts in the Google Translate app, including for Android users in India.
Google has upgraded its Gemini audio stack to make voice interactions more natural and reliable, introducing an updated Gemini 2.5 Flash Native Audio for live voice agents and a live speech translation beta in the Google Translate app.
The company said the model update enhances complex task handling, while the translation feature preserves a speaker’s intonation and pacing, and is rolling out on Android in the United States, Mexico and India from today. iOS support and more regions will follow, according to the company.
What is new in Gemini 2.5 Flash Native Audio
Google says the refreshed model strengthens three areas that matter for real‑time agents. It improves function calling so the system can reliably decide when to fetch live data and blend it back into the response.
It tightens instruction adherence, with the company reporting a 90 percent rate, up from 84 percent. It also smoothens multi‑turn conversations by retrieving context from earlier turns more effectively.
On the ComplexFuncBench Audio evaluation, the model leads with a score of 71.5 percent, as per Google’s benchmarks.
The update is available to developers through Google AI Studio and Vertex AI, and is beginning to roll out in consumer experiences such as Gemini Live and Search Live, bringing native audio to Search Live for the first time, the company noted.
Live speech translation comes to Translate
Alongside voice agents, Google is piloting streaming speech‑to‑speech translation for headphones. The beta supports continuous listening as well as two‑way conversations, automatically switching the output language based on who is speaking.
Google says it covers more than 70 languages across 2,000 language pairs, adds auto‑detection and noise robustness, and retains style features such as pitch and pacing.
How does live speech translation preserve the speaker’s voice?
According to Google, the system combines Gemini’s native audio understanding with multilingual modelling to capture intonation, pacing and pitch, then renders the translated output with style transfer so the translated voice sounds closer to the original speaker.
This is designed to make cross‑language conversations feel less robotic, particularly in noisy or outdoor settings, where the model also applies noise filtering.
Availability for builders
- Gemini 2.5 Flash Native Audio is generally available on Vertex AI and in preview via the Gemini API, and can be tried in Google AI Studio.
- Gemini 2.5 Flash and 2.5 Pro text‑to‑speech models are available via the Gemini API in Google AI Studio.
- The live speech translation beta is rolling out on Android via the Translate app, with iOS and more regions slated to follow.
For developers and enterprises, sharper function calling and instruction following can reduce failed hand‑offs to tools, and long agent scripts, potentially lowering support costs and improving customer satisfaction.
Native audio that better maintains context can also make brainstorming, tutoring and customer service scenarios feel more human, industry watchers said. The India launch of live translation on Android, if it works reliably in noisy environments, could be especially useful for travel, retail counters and public‑facing services.
- The updates also fit within Google’s wider push to embed Gemini across products this year.


