Google rolls out latest AI models at its flagship developer conference I/O

Just a day after OpenAI released its powerful new language model, Google paraded its own advances in vision, language, and multimodal AI capabilities at its developer conference.

Wednesday May 15, 2024 , 6 min Read

Google on Tuesday unveiled a barrage of multimodal AI models and integrations at its annual I/O developer conference.

Just a day after OpenAI released its powerful new language model, Google chief executive officer (CEO) Sundar Pichai made several big ticket announcements, with a focus on artificial intelligence (AI) at the search engine giant's I/O developers event in California. Here are some of the announcements made by Google.

Google will do the Googling for you

Google is doubling down on AI for Search, with AI Overviews becoming the default experience for millions of users in the US.

The AI Overviews, powered by customised Gemini models, uses generative AI to synthesise information and provide concise summaries and key website links for queries. However, there are ongoing concerns around incorrect or plagiarised information in these AI-generated overviews, which is already seen in Nvidia-backed AI search engine, Perplexity.

Google is leaning into Gemini's multi-step reasoning capabilities to handle more complex searches with nuanced criteria like finding yoga studios based on location, popularity, and membership offers. New AI-powered planning tools will suggest meal plans from web recipes that drive traffic back to publisher sites when users want full details.

The tension between enhancing search with AI while driving traffic and visibility for web publishers will be something to watch as these AI-powered features roll out at scale. User trust in the AI outputs will also be a key factor for adoption.

Google is also upgrading its visual search, allowing users to get troubleshooting help by uploading videos of malfunctioning devices without needing to describe intricate details in text. Users can also add a document for more context-specific search in the Gemini app. Crucially, these features are, for now, exclusive to Android.

Gemini gets an upgrade

Google unveiled a series of updates to its Gemini family of multimodal AI models at the I/O conference. Gemini 1.5 Flash, the newest addition to the Gemini series, can quickly summarise conversations, caption images and videos, and extract data from large documents and tables.

The tech giant also announced an improved Gemini 1.5 Pro, with improvements in code generation for software development, reasoning, conversations and audio/image understanding. Its context window now extends to 2 million tokens. Additionally, the Nano model now understands multimodal inputs beyond just text. Context windows are important because they help AI models recall information during a session. Having a longer context window allows LLMs to better understand and produce more coherent and contextually relevant text over longer passages and documents.

Looking ahead, Google previewed Gemma 2 - the next generation of its open models built using the same architecture as Gemini. Google said at 27 billion parameters, Gemma 2 delivers performance comparable to Facebook's Llama 3 70B at less than half the size.

Google Photos is also getting a major AI upgrade with "Ask Photos". a new experimental feature powered by Gemini models. It allows users to find specific memories or information by asking natural language questions like "Show me photos from national parks I've visited." Gemini's multimodal capabilities understand photo contents to provide relevant results and even create trip highlights with captions.

More Gemini integrations

Google is infusing more AI capabilities from its Gemini models into Workspace apps like Gmail, Docs, Sheets, and Drive. The highlight is Gemini 1.5 Pro coming to the sidebar across these apps, allowing users to get AI-powered assistance while working across different files and data.

Gemini can now summarise entire email threads and provide contextual smart reply suggestions based on the conversation context. A new Gmail Q&A feature lets you query Gemini to find specific information buried across your inbox and Drive files.

Beyond email, Gemini's multi-app access lets you ask it to compile information from disparate sources like expense receipts across emails and documents into organised sheets and folders.

Google is even prototyping a "Gemini-powered teammate" called Chip that has its own Google account to live within chats and collaboratively assemble outlines or track projects using information across your Workspace data.

Google is launching customisable AI assistant--Gems--powered by Gemini that can be tuned for specific tasks and personalities. Users will be able to create their own Gems, similar to OpenAI's custom chatbots, by providing instructions to Gemini. Gems are coming soon for Gemini Advanced users.

As Google infuses more AI assistance directly into its productivity software, it aims to reduce app switching while harnessing Gemini's reasoning abilities tuned specifically to each user's data and workflows.

Gen AI text to video and images

Google also unveiled Veo, its most advanced video generation model to date. Veo can generate high-quality 1080p videos lasting over a minute, across a wide range of cinematic and visual styles.

Veo's language understanding capabilities allow it to capture the nuances and tones specified in prompts, while enabling cinematic effects like time-lapses or aerial shots. It supports masked video editing by regenerating specific areas based on additional prompts. Veo can also condition its video generations based on reference images provided alongside text descriptions.

Some of its features will debut through an experimental "VideoFX" tool, with select creator access initially. Future integration into products like YouTube Shorts is also planned as Google explores creative applications of this powerful video synthesis capability.

Google also unveiled Imagen 3, a high-quality text-to-image model capable of generating detailed images with richer lighting and fewer artifacts. Google said it has improved understanding of natural language prompts and can capture a wide range of visual styles and nuanced details specified via detailed prompts.

Imagen 3 will be available in multiple versions optimised for different tasks--from quick sketches to high-resolution images. It is being launched as a private preview for select creators within Google's ImageFX tool, with upcoming integration into Vertex AI.

Watermarking Gen AI outputs

Google is expanding its SynthID digital watermarking toolkit to cover AI-generated text from its Gemini app/web experience, and videos from its new Veo generative video model. SynthID embeds imperceptible watermarks during the content generation process itself.

For text, it modulates the probability scores of tokens (words/phrases) being generated, creating a unique pattern that can be detected as AI-generated. For video, it watermarks every individual frame by modifying pixel values invisibly to the human eye.

While not a complete solution, SynthID watermarking aims to help identify AI-generated content at scale and prevent malicious misuse like spreading misinformation. Google plans to open-source the text watermarking technique this summer through its Responsible Generative AI Toolkit.

New cloud hardware

The company also unveiled its sixth-generation Tensor Processing Units (TPUs) called Trillium, designed to power the next wave of advanced AI models and applications. Trillium TPUs deliver a 4.7x increase in peak compute performance per chip compared to the previous TPU v5e, along with doubled high-bandwidth memory capacity and interconnect bandwidth.

This massive performance and efficiency leap enables training foundation models like Gemini 1.5 Flash and Imagen 3 faster while serving them with reduced latency and lower costs. Critically, Trillium TPUs are Google's most energy-efficient yet, being over 67% more power-efficient than TPU v5e.

Edited by Megha Reddy