Estimated reading time: 3 minutes
Google is really pushing ahead with Gemini, their big bundle of AI models, apps, and services. It seems like it could be pretty cool in some ways, but there are definitely areas where it’s not quite hitting the mark — at least according to our casual look into it.
So, what exactly is Gemini? How do you even use it? And is it any better or worse than what other companies are offering?
To help you stay in the loop with all things Gemini, we’ve whipped up this handy guide. We’ll keep it updated with any new Gemini stuff that Google puts out, whether it’s new models, features, or just their plans for the whole Gemini shebang.
What is Gemini?
Gemini is Google’s long-promised, next-gen GenAI model family, developed by Google’s AI research labs DeepMind and Google Research. It comes in three flavors:
- Gemini Ultra, the flagship Gemini model.
- Gemini Pro, a “lite” Gemini model.
- Gemini Nano, a smaller “distilled” model that runs on mobile devices like the Pixel 8 Pro.
All Gemini models were trained to be “natively multimodal” — in other words, able to work with and use more than just words. They were pretrained and fine-tuned on a variety of audio, images and videos, a large set of codebases and text in different languages.
This sets Gemini apart from models such as Google’s own LaMDA, which was trained exclusively on text data. LaMDA can’t understand or generate anything other than text (e.g., essays, email drafts), but that isn’t the case with Gemini models.
What’s the difference between the Gemini apps and Gemini models?
Google, proving once again that it lacks a knack for branding, didn’t make it clear from the outset that Gemini is separate and distinct from the Gemini apps on the web and mobile (formerly Bard). The Gemini apps are simply an interface through which certain Gemini models can be accessed — think of it as a client for Google’s GenAI.
Incidentally, the Gemini apps and models are also totally independent from Imagen 2, Google’s text-to-image model that’s available in some of the company’s dev tools and environments. Don’t worry — you’re not the only one confused by this.
What can Gemini do?
Because the Gemini models are multimodal, they can in theory perform a range of multimodal tasks, from transcribing speech to captioning images and videos to generating artwork. Few of these capabilities have reached the product stage yet (more on that later), but Google’s promising all of them — and more — at some point in the not-too-distant future.
Of course, it’s a bit hard to take the company at its word.
Google seriously underdelivered with the original Bard launch. And more recently it ruffled feathers with a video purporting to show Gemini’s capabilities that turned out to have been heavily doctored and was more or less aspirational.
