December 7 2023

#115 An In-Depth Look at Elo and MMLU Scores for Leading Language Models

generative AI, large language models

<< Previous Edition: Rigid Rules to Fluid Future

The landscape of generative AI is continually shifting, bringing to the fore groundbreaking models that redefine our understanding of artificial intelligence. My original plan was to delve into the nuances of the Elo measurement system, a key tool for assessing the performance of language models. Yet, in a classic case of scope creep, the recent announcement of Google's Gemini model has shifted my focus. This development nudges us towards a broader, more encompassing exploration, where we compare and contrast two pivotal evaluation metrics in the AI arena: Elo and MMLU.

The Divergent Scales of Elo Ratings and MMLU

Elo ratings, adapted from chess, have emerged as a key metric for assessing the prowess of Large Language Models (LLMs). They measure how well these models can answer questions or solve problems, positioning them in a competitive landscape. Take, for example, OpenAI's GPT-4 Turbo, which impressively scored 1210 in Elo rating. However, this score alone doesn't fully capture the model's capabilities, as it lacks a corresponding MMLU score for a more holistic evaluation.

In contrast, the Massive Multitask Language Understanding (MMLU) score offers a broader perspective, examining a model's ability to reason and comprehend across various subjects. A prime example is Gemini Ultra, which boasts a notable MMLU score of 90.0, highlighting its extensive knowledge base and understanding. Yet, without an Elo rating for a direct comparison, the assessment of its full potential remains somewhat obscured.

Elo Ratings: An Objective Measure

Elo ratings, drawn from the strategic game of chess, have found a new purpose in measuring the capabilities of Large Language Models (LLMs). This system is designed to objectively assess a model’s proficiency by evaluating its response accuracy, providing a clear competitive standing in the AI landscape.

The Strategy of Scoring

One of the key features of the Elo system is its method of score adjustment, which is not uniform across different levels. For instance, when a lower-rated LLM (say at 900) outperforms a higher-rated one (at 1100), the Elo system penalizes the higher-rated model more severely than it rewards the lower-rated one. This asymmetrical approach ensures that the ratings accurately reflect the expected performance levels and the significance of an upset.

Reflecting on the LLM Elo Spectrum and Chess Expertise

GrandMasters: Supreme performance, rated 2500 or higher. Chess Example: Magnus Carlsen.

LLM Example: [Future LLMs could be here as AI technology advances].

International Masters: Advanced performance with ratings between 2400 and 2500.

Chess Example: Judit Polgar.

LLM Example: [This spot awaits future AI developments].

FIDE Masters: Demonstrating significant expertise, rated between 2300 and 2400.

Chess Example: Hou Yifan.

LLM Example: [Anticipated future achievements of AI models].

FIDE Candidate Masters/National Masters: Competence with ratings between 2200 and 2300.

Chess Example: Wei Yi.

LLM Example: [A potential milestone for upcoming LLMs].

Experts/National Candidate Masters: Proficient, their Elo between 2000 and 2200.

Chess Example: Alireza Firouzja.

LLM Example: [A goal for future language models].

Class A Players: Strong performers rated between 1800 and 1999.

Chess Example: Tania Sachdev.

LLM Example: [An attainable target for next-gen AI].

Class B Players: Solid with Elo scores between 1600 and 1799.

Chess Example: Rucha Pujari.

LLM Example: [Aspiring for future AI models].

Class C Players: Reliable, their ratings fall between 1400 and 1599.

Chess Example: Praggnanandhaa R.

LLM Example: [An objective for emerging LLMs].

Class D Players: Showing promise, rated between 1200 and 1399.

Chess Example: Vaishali R.

LLM Example: GPT-4 Turbo.

Class E Players: Novice, with ratings between 1000 and 1199.

Chess Example: Leon Mendonca.

LLM Example: GPT-4 and GPT-3.5, Llama 2 70b

Class F: Learners in the field, rated between 800 and 999.

Chess Example: Jennifer Shahade.

LLM Example: PaLM, Alpaca-13b

Here are more precise Elo rankings of LLMs provide a revealing snapshot of the state of generative AI:

- GPT-4-Turbo: At the top with an Elo score of 1210.

- GPT-4: Following with a rating of 1159.

- Claude-1: Not far behind at 1146.

- Claude-2: Holding its ground with 1125.

- Claude-Instant-1: A contender at 1106.

- GPT-3.5-Turbo: Competing with a score of 1100.

The fact that the highest-ranking LLM is at a Class D level, a rank characterized by developing skill in the chess world, speaks volumes about the nascent stage of generative AI. It highlights that while we've made significant strides, we are still in the early days of this technological evolution.

Deciphering MMLU Scores for Language Models

MMLU: A Multidimensional Metric

The Massive Multitask Language Understanding (MMLU) score is a comprehensive benchmark that offers a panoramic view of a language model's cognitive prowess. Unlike the Elo system's competitive ranking, MMLU evaluates a model's comprehension and reasoning across a spectrum of knowledge domains, from literature and history to science and mathematics.

Evaluating Cognitive Breadth and Depth

By posing an array of complex questions, MMLU assesses an LLM's ability to navigate through nuanced and contextually rich tasks. It's an attempt to measure an AI's equivalent of academic and intellectual versatility, pushing the boundaries of what it means for a machine to understand language and concepts.

While MMLU doesn't sort LLMs into classes, the scores it assigns are telling of their overall linguistic and cognitive abilities. A higher score is indicative of a model's advanced comprehension skills, akin to a well-rounded education in a vast array of subjects.

MMLU Scores in Perspective

The latest MMLU scores for leading models reflect their current capabilities:

- Gemini Ultra: Standing out with an MMLU score of 90.0, it demonstrates exceptional understanding and reasoning across diverse disciplines.

- GPT-4: Not far behind, it boasts an MMLU score of 86.4, showing a strong command over a wide range of topics and complex problem-solving abilities.

- Claude-1: With a respectable score of 77, it shows proficiency in handling multifaceted tasks.

- Claude-2: Scoring 78.5, it too showcases its capability in understanding and reasoning within a varied context.

- Claude-Instant-1: With a score of 73.4, this model indicates competitive cognitive skills.

- GPT-3.5-Turbo: Earns a score of 70, suggesting a solid foundational understanding across subjects, though there's room for growth.

Conclusion

Both OpenAI and Google have successfully showcased their strengths. However, it is important to acknowledge that as innovation progresses, the momentum behind it intensifies. At RoostGPT, we proudly offer seamless integration of both GPT and Vertex models, and we celebrate the triumph of the ecosystem.

>> Next Edition: Navigating the LLM Layout