MMLU

Parmi les nombreux benchmarks retenus pour comparer les performances de Llama 2 à ses concurrents, MMLU est le premier cité dans l’article de présentation du logiciel.

Note de Llama 2 / MMLU

La note obtenue par Llama 2 pour ce benchmark est de 68,9 pour la version 70B, ce qui le situe approximativement au niveau de GPT 3.5 mais loin derrière GPT4 (86.4) qui domine tous les LLM sur ces critères.

Description de MMLU

MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.

https://paperswithcode.com/dataset/mmlu

Exemple de question dans MMLU (disponible sur HuggingFace) :

question (string)

« Davis decided to kill Adams. He set out for Adams’s house. Before he got there he saw Brooks, who resembled Adams. Thinking that Brooks was Adams, Davis shot at Brooks. The shot missed Brooks but wounded Case, who was some distance away. Davis had not seen Case. In a prosecution under a statute that proscribes any attempt to commit murder, the district attorney should indicate that the intended victim(s) was/were »

choices (sequence)

[ « Adams only. », « Brooks only. », « Case only. », « Adams and Brooks » ]

answer (class label)

1 (B)

Classement de Llama 2 / ses concurrents

RankModelAverage (%)Parameters (Billions)Tokens (Billions)YearTags
1GPT-486.42023few-shot
(few-shot, k=5)
2Flan-PaLM 281.22023
(L)
3PaLM 278.32023
(large)
4Flan-PaLM75.25402022fine-tuned
(5-shot, finetuned, CoT + SC)
5Flan-U-PaLM 540B74.15402022fine-tuned
6Flan-PaLM72.25402022fine-tuned
(5-shot, finetuned)
7Codex + REPLUG LSR71.82023few-shot
(few-shot, k=5)
8Codex + REPLUG71.42023few-shot
(few-shot, k=5)
9Flan-PaLM 540B70.95402022fine-tuned
(CoT)
10U-PaLM70.75402022few-shot
(few-shot, k=5)
11Flan-PaLM70.25402022fine-tuned
(5-shot, finetuned, CoT)
12GPT-3.5702023few-shot
(few-shot, k=5)
13Flan-U-PaLM69.85402022fine-tuned
(CoT)
14PaLM 540B69.35407802022few-shot
(few-shot, k=5)
15LLaMA 65B68.96514002023fine-tuned
(fine-tuned)
16LLaMA 2 70B68.9702023
(few-shot, k=5)
17Codex68.31752023few-shot
(few-shot, k=5)
18Chinchilla67.57014002022few-shot
(few-shot, k=5)
19Flan-cont-PaLM66.1622022
20LLaMA 65B63.46514002023few-shot
(few-shot, k=5)
21Flan-cont-PaLM625402022
(CoT)
22Gopher60.02803002021few-shot
(few-shot, k=5)
23Flan-PaLM 62B59.6622022
24LLaMA 33B57.83314002023
(few-shot, k=5)
25Flan-PaLM 62B56.92022fine-tuned
(CoT)
26Flan-T5-XXL55.1112022
27GPT-353.91753002020fine-tuned
(fine-tuned)
28GAL 120B52.61204502022zero-shotfew-shot
(zero-shot)
29Flan-T5-XL52.432022
30Flan-PaLM 8B49.382022
31UnifiedQA48.9112020fine-tuned
32Flan-T5-XXL48.62022
(CoT)
33Atlas47.9112022
(few-shot, k=5)
34LLaMA 13B46.9132023
(few-shot, k=5)
35Flan-T5-XL45.52022
(CoT)
36Flan-T5-Large 780M45.12022
37GLM-130B44.82022
38GPT-3 175B43.92020few-shot
(few-shot, k=5)
39GPT-3 6.7B43.26.72020fine-tuned
(fine-tuned)
40Flan-PaLM 8B41.32022
(CoT)
41Flan-T5-Large40.52022
(CoT)
42Bloomberg GPT39.182023
(few-shot, k=5)
43BLOOM 176B39.131762023
(few-shot, k=5)
44OPT 66B35.99662023
(few-shot, k=5)
45GPT-NeoX35.952023
(few-shot, k=5)
46Flan-T5-Base 250M35.92022
47LLaMA 7B35.172023
(few-shot, k=5)
48Flan-T5-Base33.72022fine-tuned
(CoT)
49GPT-NeoX-20B33.6203002022few-shot
(few-shot, k=5)
50GPT-2 1.5B32.41.53002019fine-tuned
(fine-tuned)
51Gopher-7.1B29.57.13002021few-shot
(few-shot, k=5)
52Flan-T5-Small 80M28.72022
53GPT-NeoX-20B28.6203002022zero-shot
(zero-shot)
54RoBERTa27.90.3542019fine-tuned
(fine-tuned)
55GPT-J-6B27.363002021zero-shot
(zero-shot)
56Gopher-1.4B27.31.43002021few-shot
(few-shot, k=5)
57ALBERT27.10.0312019fine-tuned
(fine-tuned)
58GPT-3 13B26132020few-shot
(few-shot, k=5)
59GPT-3 2.7B25.92.72020few-shot
(few-shot, k=5)
60Gopher-0.4B25.70.43002021few-shot
(few-shot, k=5)
61Random Baseline25.02020
62GPT-3 6.7B24.96.72020few-shot
(few-shot, k=5)
63Flan-T5-Small12.12022
(CoT)
64GPT-3 175B1752020
(few-shot, k=5)
65Minerva 540B-maj1@165402022few-shot
(few-shot, k=5)
66Minerva 540B5402022few-shot
(few-shot, k=5)
67Minerva 62B-maj1@16622022few-shot
(few-shot, k=5)
68Minerva 62B622022few-shot
(few-shot, k=5)
69Minerva 8B-maj1@1682022few-shot
(few-shot, k=5)
70PaLM 62B622022few-shot
(few-shot, k=5)
71Minerva 8B82022few-shot
(few-shot, k=5)
72PaLM 8B82022few-shot
(few-shot, k=5)
73Flan-T5-Small802022
74Flan-T5-Base2502022
75Flan-T5-Large7802022

Publié

dans

,

par