Skip to content
March 20 2024

#163 Apple Makes Its Entrance During Cocktail Hour!

Blog Details

<< Previous Edition: The Dangers of AI Policing

Regular readers are well aware that I often discuss the notion that the Generative Seven, or G7, holds what seems to be an exclusive privilege when it comes to pioneering foundational models. The sheer scale of resources required for the development of these models means that, typically, only entities backed by the G7 or receiving their explicit support can venture into this domain.

Mistral stands out as a notable outlier in this scenario, yet it's important to remember that exceptions don't create the rule.

In G7 i.e. Microsoft, Meta, Tesla, Amazon, Nvidia, Google & Apple, Apple was the only one which did not make any announcement in this field. Like a missing element in the periodic table, it had to come sooner or later and it came last week in the form of  the MM1 paper with a focus on building performant Multimodal Large Language Models (MLLMs).

MLLMs: image(s) + text => text.

This paper highlights a key technique known as Ablation, a term borrowed from biology. Ablation involves methodically eliminating or adjusting specific elements of a model or training process to assess their individual effects on overall performance. Let's take a moment to revisit some concepts we have previously discussed.

Quick Review: Training Stages

There are three stages in building performant large language models:

  1. Pre-training

  2. Supervised Fine Tuning (SFT)

  3. The (in)famous Reinforced Learning through Human Feedback (RLHF)

All three have forward and backward passes but use different information before putting the car in reverse gear. Pre-training does not require explict labeing and the labeled data is generated through actual text itself. The idea is to learn general patterns in general corpus.

In SFT, the model is trained in a smaller subset of labeled data. This labeing is done to make the model a specialist for a specific purpose or set of purposes. One example which is given to make it easy to understand is to train model to answer questions.

The RLHF stage is not about what is accurate answer but how humans feel about it. You can easily understand why it's important but why it is most prone to abuse and is liberally abused. Whenever a model is acused of bias, it's mostly in this stage.

Tasting Rounds: Zero-Shot, One-Shot, and Few-Shots

Zero-shot learning is a challenging scenario where the language model is not provided with any explicit examples or task-specific training data. The model is expected to perform a task solely based on its pre-existing knowledge and understanding of language and concepts. Zero-shot learning evaluates the model's ability to transfer its learned knowledge and skills to new and unseen tasks without any additional guidance or adaptation.

In one-shot learning, the language model is provided with a single example of a task or prompt along with the desired output. The model is then expected to generate accurate responses or complete similar tasks based on this single example. One-shot learning tests the model's ability to quickly learn and generalize from a single instance, without the need for extensive fine-tuning or additional training data.

Few-shot learning involves providing the language model with a small number of examples (usually between 1 and 10) for a specific task. The model is expected to learn from these few examples and generate accurate responses or complete similar tasks based on the limited information provided. Few-shot learning assesses the model's capability to quickly adapt and generalize from a small set of examples, demonstrating its sample efficiency and transferability.

Now tell me how to make a cocktail like a Mixer!

Now, let's dive into the key focus of the MM1 paper: crafting the perfect blend of ingredients to create a high-performing multi-modal Large Language Model (MLLM). Just like a skilled mixologist carefully selects and combines various elements to create a delightful cocktail, the researchers at Apple have meticulously explored the optimal composition of image-caption, interleaved image-text, and text-only data to achieve state-of-the-art (SOTA) few-shot results across multiple benchmarks.

The goal is to create a model that excels in multi-image reasoning and few-shot prompting, much like a master mixologist who can effortlessly create a wide range of cocktails with just a few key ingredients. By finding the right balance and proportion of these data types, the MM1 model aims to deliver exceptional performance in various tasks, showcasing its versatility and adaptability.

Once the perfect recipe is discovered, the next challenge is to scale it up. Just as a successful cocktail recipe can be scaled to serve a larger crowd, the MM1 model is scaled to create a family of models with up to 30B parameters, including both dense models and Mixture of Experts (MoE) variants. This scaling process ensures that the model can handle more complex tasks and accommodate a wider range of applications, much like a mixologist expanding their repertoire to cater to diverse tastes and preferences.


In this article, we aimed to provide a concise overview of Apple's recent advancements in the field of multi-modal Large Language Models (LLMs). While Apple's innovations may not be considered groundbreaking from a fundamental research perspective, their contributions lie in the applied domain, focusing on practical implementation and performance optimization. However, it is important to acknowledge that future developments may challenge this thesis, as Apple continues to push the boundaries of what is possible with LLMs.

Regardless of the extent of Apple's fundamental innovations, their work has played a crucial role in shedding light on the future of open-weight models. By openly sharing their research and insights, Apple has contributed to the growing trend of transparency and collaboration within the AI community. This shift towards open-weight models as first-class citizens marks a significant milestone in the evolution of LLMs and their applications.