MMLU – Llama2 par BackProp

Parmi les nombreux benchmarks retenus pour comparer les performances de Llama 2 à ses concurrents, MMLU est le premier cité dans l’article de présentation du logiciel.

Note de Llama 2 / MMLU

La note obtenue par Llama 2 pour ce benchmark est de 68,9 pour la version 70B, ce qui le situe approximativement au niveau de GPT 3.5 mais loin derrière GPT4 (86.4) qui domine tous les LLM sur ces critères.

Description de MMLU

MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.
https://paperswithcode.com/dataset/mmlu

Exemple de question dans MMLU (disponible sur HuggingFace) :

question (string)

« Davis decided to kill Adams. He set out for Adams’s house. Before he got there he saw Brooks, who resembled Adams. Thinking that Brooks was Adams, Davis shot at Brooks. The shot missed Brooks but wounded Case, who was some distance away. Davis had not seen Case. In a prosecution under a statute that proscribes any attempt to commit murder, the district attorney should indicate that the intended victim(s) was/were »

choices (sequence)

[ « Adams only. », « Brooks only. », « Case only. », « Adams and Brooks » ]

answer (class label)

1 (B)

Classement de Llama 2 / ses concurrents

Rank	Model	Average (%)	Parameters (Billions)	Tokens (Billions)	Year	Tags
1	GPT-4	86.4			2023	few-shot
	(few-shot, k=5)
2	Flan-PaLM 2	81.2			2023
	(L)
3	PaLM 2	78.3			2023
	(large)
4	Flan-PaLM	75.2	540		2022	fine-tuned
	(5-shot, finetuned, CoT + SC)
5	Flan-U-PaLM 540B	74.1	540		2022	fine-tuned
6	Flan-PaLM	72.2	540		2022	fine-tuned
	(5-shot, finetuned)
7	Codex + REPLUG LSR	71.8			2023	few-shot
	(few-shot, k=5)
8	Codex + REPLUG	71.4			2023	few-shot
	(few-shot, k=5)
9	Flan-PaLM 540B	70.9	540		2022	fine-tuned
	(CoT)
10	U-PaLM	70.7	540		2022	few-shot
	(few-shot, k=5)
11	Flan-PaLM	70.2	540		2022	fine-tuned
	(5-shot, finetuned, CoT)
12	GPT-3.5	70			2023	few-shot
	(few-shot, k=5)
13	Flan-U-PaLM	69.8	540		2022	fine-tuned
	(CoT)
14	PaLM 540B	69.3	540	780	2022	few-shot
	(few-shot, k=5)
15	LLaMA 65B	68.9	65	1400	2023	fine-tuned
	(fine-tuned)
16	LLaMA 2 70B	68.9	70		2023
	(few-shot, k=5)
17	Codex	68.3	175		2023	few-shot
	(few-shot, k=5)
18	Chinchilla	67.5	70	1400	2022	few-shot
	(few-shot, k=5)
19	Flan-cont-PaLM	66.1	62		2022
20	LLaMA 65B	63.4	65	1400	2023	few-shot
	(few-shot, k=5)
21	Flan-cont-PaLM	62	540		2022
	(CoT)
22	Gopher	60.0	280	300	2021	few-shot
	(few-shot, k=5)
23	Flan-PaLM 62B	59.6	62		2022
24	LLaMA 33B	57.8	33	1400	2023
	(few-shot, k=5)
25	Flan-PaLM 62B	56.9			2022	fine-tuned
	(CoT)
26	Flan-T5-XXL	55.1	11		2022
27	GPT-3	53.9	175	300	2020	fine-tuned
	(fine-tuned)
28	GAL 120B	52.6	120	450	2022	zero-shotfew-shot
	(zero-shot)
29	Flan-T5-XL	52.4	3		2022
30	Flan-PaLM 8B	49.3	8		2022
31	UnifiedQA	48.9	11		2020	fine-tuned
32	Flan-T5-XXL	48.6			2022
	(CoT)
33	Atlas	47.9	11		2022
	(few-shot, k=5)
34	LLaMA 13B	46.9	13		2023
	(few-shot, k=5)
35	Flan-T5-XL	45.5			2022
	(CoT)
36	Flan-T5-Large 780M	45.1			2022
37	GLM-130B	44.8			2022
38	GPT-3 175B	43.9			2020	few-shot
	(few-shot, k=5)
39	GPT-3 6.7B	43.2	6.7		2020	fine-tuned
	(fine-tuned)
40	Flan-PaLM 8B	41.3			2022
	(CoT)
41	Flan-T5-Large	40.5			2022
	(CoT)
42	Bloomberg GPT	39.18			2023
	(few-shot, k=5)
43	BLOOM 176B	39.13	176		2023
	(few-shot, k=5)
44	OPT 66B	35.99	66		2023
	(few-shot, k=5)
45	GPT-NeoX	35.95			2023
	(few-shot, k=5)
46	Flan-T5-Base 250M	35.9			2022
47	LLaMA 7B	35.1	7		2023
	(few-shot, k=5)
48	Flan-T5-Base	33.7			2022	fine-tuned
	(CoT)
49	GPT-NeoX-20B	33.6	20	300	2022	few-shot
	(few-shot, k=5)
50	GPT-2 1.5B	32.4	1.5	300	2019	fine-tuned
	(fine-tuned)
51	Gopher-7.1B	29.5	7.1	300	2021	few-shot
	(few-shot, k=5)
52	Flan-T5-Small 80M	28.7			2022
53	GPT-NeoX-20B	28.6	20	300	2022	zero-shot
	(zero-shot)
54	RoBERTa	27.9	0.354		2019	fine-tuned
	(fine-tuned)
55	GPT-J-6B	27.3	6	300	2021	zero-shot
	(zero-shot)
56	Gopher-1.4B	27.3	1.4	300	2021	few-shot
	(few-shot, k=5)
57	ALBERT	27.1	0.031		2019	fine-tuned
	(fine-tuned)
58	GPT-3 13B	26	13		2020	few-shot
	(few-shot, k=5)
59	GPT-3 2.7B	25.9	2.7		2020	few-shot
	(few-shot, k=5)
60	Gopher-0.4B	25.7	0.4	300	2021	few-shot
	(few-shot, k=5)
61	Random Baseline	25.0			2020
62	GPT-3 6.7B	24.9	6.7		2020	few-shot
	(few-shot, k=5)
63	Flan-T5-Small	12.1			2022
	(CoT)
64	GPT-3 175B		175		2020
	(few-shot, k=5)
65	Minerva 540B-maj1@16		540		2022	few-shot
	(few-shot, k=5)
66	Minerva 540B		540		2022	few-shot
	(few-shot, k=5)
67	Minerva 62B-maj1@16		62		2022	few-shot
	(few-shot, k=5)
68	Minerva 62B		62		2022	few-shot
	(few-shot, k=5)
69	Minerva 8B-maj1@16		8		2022	few-shot
	(few-shot, k=5)
70	PaLM 62B		62		2022	few-shot
	(few-shot, k=5)
71	Minerva 8B		8		2022	few-shot
	(few-shot, k=5)
72	PaLM 8B		8		2022	few-shot
	(few-shot, k=5)
73	Flan-T5-Small		80		2022
74	Flan-T5-Base		250		2022
75	Flan-T5-Large		780		2022