Google made significant advancements in artificial intelligence at I/O 2024. The company is shaping the future of AI with new features in the Gemini model family, faster and more efficient models, productive media tools, innovative search experiences and Trillium, the 6th generation of Google Cloud TPU.
Gemini is now faster and smarter!
Google DeepMind CEO Demis Hassabis announced updates to the Gemini model family. Gemini 1.0, the first native multimodal model launched in December and available in three sizes (Ultra, Pro, Nano), was soon followed by 1.5 Pro with improved performance and a context window of 1 million tokens.
Developers and enterprise customers also found 1.5 Pro’s long context window, multimodal reasoning capabilities and impressive overall performance very useful. In response to user feedback that some applications require lower latency and lower cost of service, Google has added a new member to the Gemini family: 1.5 Flash.
Gemini 1.5 Flash
Optimized for speed and efficiency, this lightweight model is ideal and cost-effective for high-volume, high-frequency tasks. Offering a 1 million token extended context window, 1.5 Flash excels at tasks such as summarization, chat applications, image and video captioning, and data extraction from long documents and tables. Trained through “distillation” by the larger 1.5 Pro model, 1.5 Flash transfers core knowledge and skills to a smaller, more efficient model.
Gemini 1.5 Pro
Google has also significantly improved 1.5 Pro, the best model for overall performance. The context window has been expanded to 2 million tokens. Data and algorithmic improvements have enhanced code generation, logical reasoning and planning, multi-round speech, and voice and image understanding.
1.5 Pro can now follow increasingly complex and nuanced instructions, including product-level behavior determinants such as role, format and style. Control over the model’s responses has been enhanced for specific use cases, such as creating the personality and response style of the chat app or automating workflows through multiple function calls. Users can guide model behavior by adjusting System instructions.
Gemini Nano
Moving beyond just text input, Gemini Nano can now also process images as a network. Starting with Pixel phones, applications that use Gemini Nano with Multimodality will be able to understand the world as humans do. Not only with text input, but also with voice and spoken language.
The future of AI assistants: Project Astra
As part of its mission to responsibly develop AI to benefit humanity, Google DeepMind announced Project Astra, with the goal of developing universal AI agents that can help in everyday life. Astra aims to develop AI agents that can understand and act on context in the same way that humans understand and react to the complex world.
These agents will serve as proactive, approachable and personalized assistants. Users will be able to talk to these agents naturally and without delay. Astra is designed to process and remember video and speech input.
Built on the Gemini model and other mission-specific models, these agents process information faster by continuously encoding video frames, combining video and speech input into an event timeline, and caching that information for efficient recall. Some of Astra’s features will be integrated into Google products such as the Gemini app later this year.
New generative media models and tools
Google also introduced new generative media models and tools for creative work:
Veo
Veo is Google’s most capable video creation model to date, capable of creating high-quality 1080p videos that can exceed one minute in length. Supporting a variety of cinematic and visual styles, Veo understands natural language and visual semantics to create videos that reflect the user’s creative vision. The model also understands cinematic terms such as ‘timelapse’ or ‘aerial shot of a landscape’, providing unprecedented creative control.
It creates consistent and coherent shots, with people, animals and objects moving realistically throughout the footage. Google is inviting a range of filmmakers and content creators to test the model to discover how Veo can best support the storyteller’s creative process.
Imagen 3
Imagen 3, Google’s highest quality text-to-image model, can produce incredibly detailed and photorealistic images. By understanding natural language and better interpreting user commands, Imagen 3 can incorporate small details from long commands and create text within an image.
Music AI Sandbox
Music AI Sandbox, a toolkit that supports musicians to create new instrumental pieces from scratch, transform sound and support their creative work, aims to open up a new playing field for creativity.
Artificial intelligence enhancements
Google is committed not only to advancing technology, but also to doing so responsibly. That’s why measures are being taken to address the challenges posed by generative technologies and help people work responsibly with AI-generated content.
These measures include collaborating with the creative community and other stakeholders, gathering insights for safe and responsible development and deployment of technologies, listening to feedback, and giving creators a voice. Google believes that AI technologies should be used to benefit humanity and is working to ensure that they are developed ethically, responsibly and fairly.