Chameleon: Meta's mixed-modal AI outshines GPT-4 and Gemini

Meta's AI Chameleon is a unique large language model that can generate both text and image together! Here is how this AI beats other AI models in mixed media tasks!

Wednesday July 03, 2024 , 3 min Read

The AI boom is here and things have started to become interesting. As the world struggled with doing monotonous tasks, AI came to the rescue. Today, anyone can generate customised text, images, audio, and video via AI models.

But what if there was an AI that could make images with text or vice-versa? Mark Zuckerberg-owned Meta has achieved this with their new multi-modal LLM! Meta has also found their AI to have outperformed competitors like GPT-4 and Gemini in certain tasks! Let's explore this AI in detail!

What is Meta Chameleon?

The Road to Wealth: 3 Essential AI Skills for 2024

The Fundamental AI Research (FAIR) team at Meta recently launched 5 new AI models including a new family of models called CM3leon (pronounced as “chameleon”). It is a mixed-modal AI named Chameleon which can not just understand but generate text and images.

Key capabilities of Chameleon AI

For context, today most LLMs generate a singular output, for instance, converting text into voice. According to Meta's blog, however, their new mixed-modal AI goes a step ahead by simultaneously processing and producing text and image outputs. What is more interesting is that Chameleon is trained different approach called a multi-token prediction.

While it sounds very technical, the logic is quite simple. Large language models are typically trained to predict the next word. They do so with the help of the context provided by the preceding text. So LLMs can do this one at a time but Chameleon is trained to predict numerous future words.

Also, to enhance LLM performance, Meta trained its AI by utilising a single token-based representation for text and image. In simple words, this architecture allows Meta's AI to make mixed media outcomes including text-only responses.

Moreover, Meta built this AI in 2 sizes. They are namely Chameleon-7B and Chameleon-34B having 7 billion and 34 billion parameters respectively. Moreover, it is important to note that both these AI models were pre-trained with more than 4 trillion tokens of mixed text and picture data. Afterwards, they were adjusted to ensure proper alignment and safety using smaller datasets.

Meta's AI beats GPT-4 in mixed media tasks

In their research paper, Meta unveiled that their "state-of-the-art" AI model has shown exceptional performance on several benchmarks including visual question answering and image captioning tasks.

Meta's Chameleon outperforms LLM models like Flamingo, Llava-1.5 and IDEFICS in visual question answering and image captioning benchmarks. Additionally, it's at par with Mixtral 8x7B and Gemini-Pro when it comes to common-sense reasoning and reading comprehension.

But going a step further, Meta tested their AI model with humans as judges where Chameleon's output was compared against baseline models like ChatGPT and Gemini. The result of this experimentation showed that Chameleon-34B outperforms Gemini-Pro and GPT-4 with a 60.4% and 51.6% preference against Gemini-Pro and GPT-4 in pairwise comparison.

Is Meta Chameleon free to use?

As of now, Meta has not officially launched this AI model to the public due to safety issues. However, a revised version of this AI is available on request use but only under a "research-only license"

The bottom line

AI startups and tech giants are in fierce competition with their AI models, each claiming that theirs is the best. In this race, Meta has released 5 multi-modal LLMs, including Chameleon AI, which is trained using a unique approach that we have not seen before. So far, no other AI model has been able to generate text and images simultaneously, which is why Meta's AI is pioneering.

Advertise with us