Introduction
The newest flag model from Google AI, Gemini Ultra 2, is here – and it sets new records in almost all benchmarks that matter. Having been introduced at the Google I/O event, it is now available to everyone via the Gemini API and Google One AI Premium service.
This release is much more significant for Google than yet another product rollout because this marks Google’s biggest push yet against OpenAI’s dominance in both consumer-grade and enterprise-grade AI services. After spending many months in reaction mode, scrambling to catch up with OpenAI since its spectacular success with ChatGPT late in 2022, Google had finally reorganized all of its AI research teams into a unified Gemini product family.
Gemini Ultra 2 is the first product that comes out of this reorganization effort and is truly capable of living up to Google’s theoretical expectations.
How good is Gemini Ultra 2 in practice – does it deliver results that are better, worse, or on par with those of the competitors?
New Features in Gemini Ultra 2
Video understanding, up to 4K resolution. The model can comprehend up to two-hour-long movies in their entirety, extract all important scenes and plots, create summaries of movie content, and even detect inconsistencies in the continuity of events. What is even more important, Ultra 2 can do this without using any sampling techniques that rely on treating video files as just a sequence of still images – the model actually understands video as a temporal entity with motion and logical flow between scenes.
This feature proved to be extremely valuable in practice during initial developer testing of Ultra 2 for video content moderation, creation of chapter outlines and indexes for long-form educational videos, and accessibility purposes in generating audio descriptions for video content.
Language understanding features were also greatly improved in Gemini Ultra 2: the model now operates with a massive context window, which is equivalent to 2 million tokens. That’s roughly equivalent to four whole novels stored inside of a single session. Developers have already started developing legal review tools and full-codebase analysis utilities on top of that.
As for real-world use cases, Ultra 2 could help your legal department to review all documents and communications from a multi-party case and spot all discrepancies and contradictions across the entire body of text.
Audio comprehension is also greatly improved: Gemini Ultra 2 is capable of generating transcripts, performing speaker identification, and analyzing semantics and meaning of audio tracks. All this can be done fast enough to serve as a basis for creating meeting minutes based on automatically generated transcripts of recorded meetings.
Benchmarks & Performance
According to Google, Gemini Ultra 2 reaches a new state-of-the-art score on all the MMLU (Massive Multitask Language Understanding) benchmark, MATH tests of mathematical problem solving, and HumanEval coding test. Stanford University’s research group CRFM independently confirmed most of Google’s findings, albeit reporting some inconsistent performance on common-sense reasoning tests.
On the MMLU benchmark, Ultra 2 showed 91.2% of correct answers, beating Gemini Ultra 1’s 90.0% and GPT-4o’s 88.7%. On MATH, which is known for its challenging problem sets at the level of mathematical Olympiads, the new version of Gemini showed 83.4% of correct answers, which is again a state-of-the-art result, and no other available model beats that.
Finally, the HumanEval test shows 88.1% of correct answers to instruction-following code completion prompts, trailing slightly behind the best available Claude 3.5 Sonnet.
Direct comparison of Gemini Ultra 2 with GPT-4o and Claude 3 Opus in various standardized tests shows that the former beats the latter two at multimodal tasks and follows a bit more closely behind when dealing with nuances in instruction-following. This distinction will be meaningful depending on what type of tasks you use your AI model for.
For tasks where understanding images and video is needed, Gemini Ultra 2 is superior to both GPT-4o and Claude 3 Opus. For tasks that involve instruction following and completing multi-step tasks according to a very detailed script, Gemini still performs close to or slightly worse than its competitors.
Latency results, which are critical when it comes to consumer-grade interactive applications, are not as favorable. Because of the way the model is structured, its multimodal functionality causes some latency, although the output is much higher-quality than in GPT-4o. This issue is addressed by Google via introducing a streaming mode that starts generating tokens while waiting for user prompts.
Pricing & Availability
Ultra 2 is accessible to all users of Google One subscription, which costs $19.99/month and additionally provides unlimited Google Drive storage (2TB) and other Google workspace features. Users that already use enhanced Google Drive subscriptions, in effect, already use the premium Google One plan, which makes the additional price for AI Premium negligible.
For developers, API access is tiered. You get a free monthly quota (depending on your usage needs) and then pay $21.00 per million output tokens and $7.00 per million input tokens. Comparatively, GPT-4o and Anthropic (for Claude 3. Opus) charge $30/$10.
Google Workspace corporate customers can use Gemini Ultra 2 via Duet AI, which packages all AI-based features, including Deep Research, document summarization, and meeting intelligence. According to Google, Gemini Ultra 2 is being gradually rolled out to Google Workspace apps such as Docs, Sheets, and Slides, which is great news for enterprise users.
Another advantage of enterprise-level access to Gemini Ultra 2, which was requested by several customers in regulated industries, is the Data Region policy, which helps you specify regions of data processing for compliance reasons. As far as we know, only EU financial institutions would benefit from this feature.
Deep Research
By far the most impressive tool that came with Gemini Ultra 2 is Deep Research feature, which allows you to perform complex web researches without writing a line of code. It uses its massive memory capacity and internet connectivity to split your task into subtasks, search for information related to these subtasks on the Internet, synthesize results, and provide you with a comprehensive report with citations.
During our trials, this function performed impressively well, compiling reports of the kind that a junior analyst would typically need to devote two to four hours of work to. Of course, results are far from perfect: the selection of sources is skewed towards domain authority of these sources (rather than recent publications). Sometimes the synthesis algorithm misses disagreements between the sources it works with. Still, having all the basics on the subject covered is an excellent starting point for further in-depth investigation.
The system also has a robust citation system, which is quite rare in AI-driven products; all claims in the report are backed up with references, with a distinction between single-source claims and claims backed up by multiple sources made.
Conclusion
Gemini Ultra 2 is a remarkable AI update that advances the state-of-the-art in multimodal AI. Its video comprehension capabilities, combined with enormous context window, are groundbreaking. There is hardly anyone working with Google ecosystem extensively who should not consider trying this product.
Gemini Ultra 2 is not perfect – for example, it shows some latency on multimodal tasks, is verbose, and has a strong ecosystem lock-in. Nevertheless, this is a remarkable achievement in AI engineering that demonstrates what AI companies like Google are able to build today.

