Natural Questions – Llama2 par BackProp

Parmi les nombreux benchmarks retenus pour comparer les performances de Llama 2 à ses concurrents, Natural Questions est le troisième cité dans l’article de présentation du logiciel.

Note de Llama 2 / Natural Questions

La note obtenue par Llama 2 pour ce benchmark est de 33 pour la version 70B, ce qui le classe en 1ère position de la liste présentée.

ChatBot n’a pas été testé. GPT-3 175B a été testé en few shots. Son score est de 29.9. Il semble que Llama 2 ait la meilleure note des LLM testés en one shot, mais en few shots les résultats des autres LLM sont meilleurs.

Description de Natural Questions

We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples with 5-way annotated sequestered as test data.
https://aclanthology.org/Q19-1026/

Classement de Llama 2 / ses concurrents

Rank	Model	EM
1	Atlas	64
	(full, Wiki-dec-2018 index)
2	Atlas	60,4
	(full, Wiki-dec-2021+CC index)
3	FiE	58,4
4	R2-D2	55,9
	(full)
5	ReAtt	54,7
6	FiD-KD	54,7
	(full)
7	EMDR^2	52,5
8	FID	51,4
	(full)
9	RETRO + DPR	45,5
	(full)
10	Codex + REPLUG LSR	45,5
	(Few-Shot)
11	Atlas	45,1
	(few-shot, k=64, Wiki-Dec-2018 index)
12	Codex + REPLUG	44,7
	(Few-Shot)
13	RAG	44,5
14	Atlas	42,4
	(few-shot, k=64, Wiki-dec-2021+CC index)
15	DPR	41,5
16	REALM	40,4
17	LLaMA 65B	39,9
	(few-shot, k=64)
18	PaLM-540B	39,6
	(Few-Shot, k=64)
19	PaLM 2-L	37,5
	(one-shot)
20	Chinchilla	35,5
	(few-shot, k=64)
21	LLaMA 65B	35
	(few-shot, k=5)
22	LLaMA 2 70B	33
	(one-shot)
23	GLaM 62B/64E	32,5
	(Few-Shot)
24	PaLM 2-M	32
	(one-shot)
25	LLaMA 65B	31
	(one-shot)
26	GPT-3 175B	29,9
	(Few-Shot, k=64)
27	PaLM-540B	29,3
	(One-Shot)
28	Gopher	28,2
	(few-shot, k=64)
29	GLaM 62B/64E	26,3
	(One-Shot)
30	PaLM 2-S	25,3
	(one-shot)
31	LLaMA 33B	24,9
	(zero-shot)
32	GLaM 62B/64E	24,7
	(Zero-Shot)
33	PaLM-540B	21,2
	(Zero-Shot)
34	Neo-6B	19,7
	(QA)
35	Neo-6B	19,6
	(QA + WS)
36	Neo-6B	13,7
	(Few-Shot)