In a significant leap forward for artificial intelligence, few hours ago, Google has unveiled Gemini, its most advanced AI model yet, heralding a new era in AI's evolution. Sundar Pichai, CEO of Google and Alphabet, views this transition as more profound than the shifts to mobile and the web, with AI poised to revolutionize various aspects of life and drive unparalleled innovation.
Developed by Google DeepMind, led by Demis Hassabis, Gemini represents the culmination of years of AI research and development. Unlike its predecessors, Gemini is a multimodal model adept at understanding and processing a diverse array of data types, including text, images, and audio. This advanced capability enables Gemini to tackle complex tasks with remarkable efficiency and accuracy.
Check out the below video to understand how good Gemini can be.
Gemini is structured in three distinct models: Gemini Ultra, Pro, and Nano. Each is optimized for different use cases, from high-complexity tasks (Ultra) to on-device efficiency (Nano). Remarkably, Gemini Ultra has demonstrated superior performance over human experts in various tasks, including language understanding, a testament to its advanced capabilities. Keep a note of Nano specifically, as it would bring a sea change to the on-device LLMs.
One of Gemini's standout features is its native multimodality, which allows it to understand and reason across different types of inputs seamlessly, as you have seen in the video. This marks a significant improvement over previous models that required separate training for different data types. Gemini's sophistication in handling complex data sets suggests its potential for groundbreaking applications in fields like science and finance.
Google’s new benchmark approach to MMLU enables Gemini to use its reasoning capabilities to think more carefully before answering difficult questions, leading to significant improvements over just using its first impression. You can see the metrics below.
Gemini Ultra set a new standard by getting a score of 59.4% on the MMMU (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI) benchmark. This benchmark has many multimodal tasks from different areas that show it can think carefully.
In image benchmarks, Gemini Ultra surpassed previous top-performing models without relying on object character recognition (OCR) systems, which typically extract text from images. This achievement underscores Gemini's inherent multimodal proficiency and suggests its potential for more advanced reasoning capabilities.
Another notable aspect is Gemini's coding capabilities. It understands and generates code in several programming languages, surpassing existing models in coding benchmarks. This positions Gemini as a foundational model for coding, significantly advancing AI's role in software development. Check out the video below.
The infrastructure behind Gemini is also impressive. It's trained on Google’s AI-optimized Tensor Processing Units (TPUs), ensuring high performance and efficiency. These TPUs have been central to Google's AI-powered products and have enabled large-scale AI model training.
By integrating Gemini across its products, Google is bringing its advanced AI capabilities to a broader audience. Gemini Pro is set to enhance Google products like Bard, a significant upgrade since its launch. Gemini Nano will debut in the Pixel 8 Pro smartphone, indicating Google's commitment to integrating AI into mobile devices. This rollout hints at Gemini's extensive applicability across various platforms and services, including Search, Ads, and Chrome.
For developers and enterprise customers, access to Gemini Pro will be available through Google AI Studio and Google Cloud Vertex AI. This accessibility ensures that a wider range of users can leverage Gemini's advanced capabilities for various applications.
Comparatively, Gemini represents a significant stride beyond existing AI models like OpenAI's ChatGPT. While ChatGPT has gained acclaim for its conversational abilities and knowledge processing, Gemini's multimodal capabilities and integration into a broader range of applications set it apart. Furthermore, Gemini's deployment in consumer-facing products like the Pixel series and Bard indicates Google's strategy to embed AI more deeply into everyday technology, potentially changing how billions interact with digital platforms.
As Google continues to refine and expand Gemini, it remains committed to safety and ethical AI development, per the blog. The company has undertaken comprehensive evaluations for bias and toxicity, ensuring that Gemini aligns with its AI principles. This responsible approach to AI development underlines Google's dedication to advancing AI in a way that's beneficial and safe for all users. I hope to see this withheld into the future.
Gemini marks a pivotal moment in AI development. Its advanced multimodal capabilities, integration into consumer products, and robust infrastructure position it as a significant player in the AI landscape, promising to reshape the way we interact with technology and access information.
Key Highlights of Gemini 1.0 Rollout
Gemini Pro in Google Products: Gemini is now integrated into Google products, enhancing Bard with advanced reasoning and planning. Bard's upgrade, available in English in over 170 countries, plans to support more languages and modalities.
Gemini in Pixel 8 Pro: The Pixel 8 Pro is the first smartphone to use Gemini Nano, powering features like Summarize in the Recorder app and Smart Reply in Gboard for WhatsApp, with more apps to follow.
Broader Integration: Gemini will soon be available in other Google services like Search, Ads, Chrome, and Duet AI. It has already improved the Search experience, reducing latency by 40%.
Access for Developers: From December 13, developers and enterprise customers can access Gemini Pro via the Gemini API in Google AI Studio or Google Cloud Vertex AI. Gemini Nano will be available for Android developers through AICore in Android 14 on Pixel 8 Pro devices.
Gemini Ultra: Currently undergoing safety checks and refinements, Gemini Ultra will soon be available for select groups for feedback before a broader release next year. Bard Advanced, featuring Gemini Ultra, will also launch next year.
This marks the beginning of a new era in AI at Google, with ongoing efforts to expand Gemini's capabilities and its potential to transform global living and working standards.
What about Apple?
Let’s think about Apple for a second. Is Apple being silent and letting everyone else ride the AI wave while they have such a huge ecosystem with devices and services being available? The answer is no. In my opinion, it is evident from their release of MLX, an array framework for machine learning on Apple silicon, that they are testing the waters silently on their devices. I think this is very analogous to bringing their HTTP Live Streaming many years ago, which has become a standard for many later.
Credit: @awnihannun from Apple Machine Learning Research
On a funny note, someone highlighted MLX as below.
Some key features of MLX include (from their github):
Familiar APIs: MLX has a Python API that closely follows NumPy. MLX also has a fully featured C++ API, which closely mirrors the Python API. MLX has higher-level packages like
mlx.nn
andmlx.optimizers
with APIs that closely follow PyTorch to simplify building more complex models.Composable function transformations: MLX has composable function transformations for automatic differentiation, automatic vectorization, and computation graph optimization.
Lazy computation: Computations in MLX are lazy. Arrays are only materialized when needed.
Dynamic graph construction: Computation graphs in MLX are built dynamically. Changing the shapes of function arguments does not trigger slow compilations, and debugging is simple and intuitive.
Multi-device: Operations can run on any of the supported devices (currently, the CPU and GPU).
Unified memory: A notable difference from MLX and other frameworks is the unified memory model. Arrays in MLX live in shared memory. Operations on MLX arrays can be performed on any of the supported device types without moving data.
Machine learning researchers created MLX with machine learning researchers in mind. The framework is intended to be user-friendly but still efficient for training and deploying models. The design of the framework itself is also conceptually simple. We intend to make it easy for researchers to extend and improve MLX, with the goal of quickly exploring new ideas.
The design of MLX is inspired by frameworks like NumPy, PyTorch, Jax, and ArrayFire.
Examples
The MLX examples repo has a variety of examples, including:
Transformer language model training.
Large-scale text generation with LLaMA and finetuning with LoRA.
Generating images with Stable Diffusion.
Speech recognition with OpenAI's Whisper.
I am planning to try MLX over the holidays.
Overall, I see that Gemini has brought in the next step of evolution from ChatGPT, after a year, but with a whole lot of capabilities.