Tuning de Llama 2 1/n – Llama2 par BackProp

L’article de référence retenu pour expliquer le tuning de Llama 2 est celui ci : Fine-Tune Your Own Llama 2 Model in a Colab Notebook – A practical introduction to LLM fine-tuning de Maxime Labonne (il existe une autre version ici).

L’article date de fin juillet 2023.

J’ai fait quelques modifications au code afin de l’adapter à mes besoins. Mon notebook est ici et le notebook originel est ici. Il y a peu de différences entre les deux. Elles sont signalées dans cet article.

Introduction

J’ai choisi de travailler avec ce notebook pour plusieurs raisons :

il part du modèle Llama-2 en version 7B et non des versions supérieures qui requièrent beaucoup de mémoire
il utilise la version d’HuggingFace, ce qui simplifie la vie
Il utilise la version chat (ce que je veux étudier ici)
l’article est très clair et est parfait pour une présentation pédagogique.

La version choisie est Llama-2-7b-chat-hf.

HW

J’ai essayé ce notebook sur une version standard de Colab avec un GPU T4 mais j’ai eu un problème de mémoire. J’ai choisi de travailler avec la version V100.

La version V100 devrait être suffisante.

NVIDIA T4 – NVIDIA T4 focuses explicitly on deep learning, machine learning, and data analytics. With the ability to perform a high-speed computational system, it offers various features. This advanced GPU model is quite energy-efficient.

NVIDIA V100 – NVIDIA V100 offers advanced features in the world of data science and AI. It comes with the facility of optimum memory usage. The 32 GB models of NVIDIA V100 graphics card can compile the tasks of 100 computers into one computer at a time. As a result, it becomes pretty efficient.

NVIDIA A100 – NVIDIA A100 is an evolutionary step that enables more high-speed computation than the previous models. It comprises a single AI infrastructure that contains all the interference and analytics together. Comparing this graphics card with the previous ones provides computational speed up to 20 times. It comes with both 40 GB and 80GB models. The 80 GB model provides the fastest bandwidth in the world compared to other models (2 TB/s).
https://www.linkedin.com/pulse/nvidia-a30-vs-t4-v100-a100-rtx-8000-gpu-cards-kuldeep-saxena/

!nvidia-smi -L

GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-5fda5759-68b8-faaa-01c8-01b5a4b1908f)

Fine Tuning

Pour ceux qui ne sont pas familiers avec le tuning des large language models (LLM) c’est parfois un peu ésotérique. Je vais donc y aller pas à pas en essayant d’expliquer chaque ligne de code compliquée ou qui nécessite un peu de background LLM.

L’objectif du notebook décrit dans l’article est le suivant :

In this article, we will see why instruction tuning works and how to implement it in a Google Colab notebook to create your own Llama 2 model.
https://towardsdatascience.com/fine-tune-your-own-llama-2-model-in-a-colab-notebook-df9823a04a32

Nous allons effectuer un fine tuning de Llama 2.

La méthode utilisée ici est le SFT (Supervised Fine-Tuning). Le SFT est intéressant sous conditions :

In other words, fine-tuning will be of little help if the model has never seen the kind of data you’re interested in. However, if that’s the case, SFT can be extremely performant.
https://towardsdatascience.com/fine-tune-your-own-llama-2-model-in-a-colab-notebook-df9823a04a32

Mémoire

Les poids de Llama 2–7b sont de (7b × 2 bytes = 14 GB en FP16) ce qui ne permet pas de tenir un mémoire avec une configuration T4 d’autant plus qu’il faut ajouter de l’overhead à tout ça. Raison pour laquelle on utiliser des méthodes de PEFT (Parameter-efficient fine-tuning) telles que LoRA ou QLoRA

Installation de librairies

!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

accelerate

🤗 Accelerate is a library that enables the same PyTorch code to be run across any distributed configuration by adding just four lines of code! In short, training and inference at scale made simple, efficient and adaptable.
https://huggingface.co/docs/accelerate/index

🤗 Accelerate was created for PyTorch users who like to write the training loop of PyTorch models but are reluctant to write and maintain the boilerplate code needed to use multi-GPUs/TPU/fp16.

🤗 Accelerate abstracts exactly and only the boilerplate code related to multi-GPUs/TPU/fp16 and leaves the rest of your code unchanged.
https://pypi.org/project/accelerate/

A la date d’aujourd’hui, la version est la 0.21.0.

Bizarrement, je ne vois aucune appel à accelerate dans le code.

peft

Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model’s parameters. Fine-tuning large-scale PLMs is often prohibitively costly. In this regard, PEFT methods only fine-tune a small number of (extra) model parameters, thereby greatly decreasing the computational and storage costs. Recent State-of-the-Art PEFT techniques achieve performance comparable to that of full fine-tuning.

Seamlessly integrated with 🤗 Accelerate for large scale models leveraging DeepSpeed and Big Model Inference.
https://pypi.org/project/peft/

PEFT, dans sa version actuelle 0.4.0, supporte LoRA.

bitsandbytes

The bitsandbytes is a lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions.
https://pypi.org/project/bitsandbytes/

La version courante est la 0.41 alors que celle qui est installée est la 0.40.2

transformers

je n’insiste pas sur cette librairie phare de HuggingFace.

trl

trl is a full stack library where we provide a set of tools to train transformer language models with Reinforcement Learning, from the Supervised Fine-tuning step (SFT), Reward Modeling step (RM) to the Proximal Policy Optimization (PPO) step. The library is built on top of the transformers library by 🤗 Hugging Face. Therefore, pre-trained language models can be directly loaded via transformers. At this point most of decoder architectures and encoder-decoder architectures are supported. Refer to the documentation or the examples/ folder for example code snippets and how to run these tools.
https://pypi.org/project/trl/

La version courante est la 0.5.0 et celle installée la 0.4.7.

Variables système

J’ajoute systématiquement à mon code, celui ci-dessous, qui me permet de lire dans Google Drive un fichier qui contient mes variables système, donc celle liée à Hugging Face.

Pour travailler avec les fichiers Llama 2 il faut d’abord une autorisation de Meta. Une fois celle-ci obtenue, le token d’HF permet de récupérer les fichiers.

!pip install colab-env -qU
!pip install openai

import colab_env
import os
from dotenv import load_dotenv
import openai

load_dotenv()

# API configuration
openai.api_key = os.getenv("OPENAI_API_KEY")

g_api_key = os.getenv("GOOGLE_API_KEY")
g_cse_id = os.getenv("GOOGLE_CSE_ID")
my_activeloop_org_id = os.getenv("ACTIVELOOP_ORG_ID")
hf_api_key = os.getenv("HUGGINGFACEHUB_API_TOKEN")

os.environ["GOOGLE_CSE_ID"] = g_cse_id
os.environ["GOOGLE_API_KEY"] = g_api_key

Imports

import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

J’explique plus bas la raison de ces imports.

Modèle

# The model that you want to train from the Hugging Face hub
model_name = "NousResearch/Llama-2-7b-chat-hf"

# The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k"

# Fine-tuned model name
new_model = "llama-2-7b-miniguanaco"

Le modèle qu’on entraîne

Le nom du modèle est Llama-2-7b-chat-hf c’est à dire la version 2 de Llama en 7 Billion (milliards) de paramètres, entraînée pour le chat et disponible au format d’Hugging Face.

Pourquoi ne pas avoir choisi meta-llama/Llama-2-7b-chat-hf ? c’eût été plus logique.

Instruction

Le dataset mlabonne/guanaco-llama2-1k est un sous-ensemble (de 1000 exemples) d’un fichier créé par Tim Dettmers, formaté pour être compatible avec les prompts Llama 2

This is a subset (1k samples) of the excellent timdettmers/openassistant-guanaco dataset, processed to match Llama 2’s prompt format as described in this article.

Il se présente de la façon suivante (en plusieurs langues) :

« <s>[INST] J’habite dans le centre de Paris et je ne sais pas où aller faire mes courses pour ne pas payer trop cher. Quel supermarché est considéré comme étant le moins cher ? [/INST] Les drives sont les supermarchés les moins chers. Voici une liste des drives de Paris du 2ème au 6ème arrondissement: Paris 2ème arrondissement Auchan Drive – Paris Réaumur (2ème) Courses U Drive – Paris Petits Carreaux (2ème) Paris 3ème arrondissement Courses U Drive – Paris Temple (3ème) Paris 4ème arrondissement Carrefour Drive – Paris Beaubourg (4ème) Paris 5ème arrondissement Auchan Drive – Paris Pierre Nicole (5ème) Auchan Drive – Piéton Paris Monge (5ème) Auchan Drive – Piéton Paris Saint Germain (5ème) Auchan Drive – Piéton Paris Claude Bernard (5ème) Carrefour Drive – Paris Pascal (5ème) Carrefour Drive – Paris Lacépède (5ème) Drive Intermarché – Paris Fossés Saint-Bernard (5ème) Paris 6ème arrondissement Auchan Drive – Piéton Paris Assas (6ème) Auchan Drive – Piéton Paris Bréa (6ème) Carrefour Drive – Paris Notre Dame des Champs (6ème) </s>\ <s>[INST] Qu’est ce qu’un drive pour piéton ? Un drive = voiture ? [/INST] En effet, « drive » vient de l’anglais et veut dire voiture. En France, cet anglicisme est utilisé pour le service de retrait en magasin avec sa voiture. Pour indiquer que le service est disponible aux piétons, certaines enseignes utilisent le terme « drive piéton », qui est un oxymore. Cette distinction existe parce que certaines enseignes, notamment en centre-ville, n’ont pas la place pour un parking voiture, et n’offre donc que le service piéton. </s> »
https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k/viewer/mlabonne–guanaco-llama2-1k/train?row=5

Dans le cas de Llama 2, les auteurs utilisent un template de la forme :

<s>[INST] <<SYS>>
System prompt
<</SYS>>

User prompt [/INST] Model answer </s>
https://towardsdatascience.com/fine-tune-your-own-llama-2-model-in-a-colab-notebook-df9823a04a32

La structure du prompt Llama 2 est :

<s>: the beginning of the entire sequence.<<SYS>>\n: the beginning of the system message.

\n<</SYS>>\n\n: the end of the system message.

[INST]: the beginning of some instructions.

[/INST]

Nouveau modèle

Le nom du nouveau modèle qu’on va créer est llama-2-7b-miniguanaco.

Après avoir exécuté le notebook, vous devriez avoir quelque chose comme ça dans votre environnement HF.

QLoRA

Pourquoi ces valeurs ont-elles été choisies ? Je n’en sais rien. L’expérience sûrement.

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

On retrouvera ces paramètres dans PEFT quelques lignes plus loin :

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

lora_r

r (int): Lora attention dimension.

r: the rank of the update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.
https://huggingface.co/docs/peft/conceptual_guides/lora

Dans notre exemple, sa valeur est 64.

lora_alpha

lora_alpha (int): The alpha parameter for Lora scaling.

Dans notre exemple, sa valeur est 16.

alpha is a scaling factor that adjusts the magnitude of the combined result (original model output plus low-rank adaptation)
https://lightning.ai/pages/community/article/lora-llm/

lora_dropout

lora_dropout (float): The dropout probability for Lora layers.

Dans notre exemple, sa valeur est 0.1.

bitsandbytes

Cette librairie nous permet la quantization, c’est à dire la réduction en mémoire du modèle tout en gardant une performance similaire.

Experimentially, we have discovered that instead of using the 4-byte FP32 precision, we can get an almost identical inference outcome with 2-byte BF16/FP16 half-precision, which halves the model size. It’d be amazing to cut it further, but the inference quality outcome starts to drop dramatically at lower precision.

To remediate that, we introduce 8-bit quantization. This method uses a quarter precision, thus needing only 1/4th of the model size! But it’s not done by just dropping another half of the bits.

Quantization is done by essentially “rounding” from one data type to another. For example, if one data type has the range 0..9 and another 0..4, then the value “4” in the first data type would be rounded to “2” in the second data type. However, if we have the value “3” in the first data type, it lies between 1 and 2 of the second data type, then we would usually round to “2”. This shows that both values “4” and “3” of the first data type have the same value “2” in the second data type. This highlights that quantization is a noisy process that can lead to information loss, a sort of lossy compression.
https://huggingface.co/blog/hf-bitsandbytes-integration

🤗 Transformers is closely integrated with most used modules on bitsandbytes. You can load your model in 8-bit precision with few lines of code. This is supported by most of the GPU hardwares since the 0.37.0 release of bitsandbytes.
https://huggingface.co/docs/transformers/main_classes/quantization

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

use_4bit

By using load_in_4bit=True when calling the .from_pretrained method, you can divide your memory use by 4 (roughly)
https://huggingface.co/docs/transformers/main_classes/quantization

Note that once a model has been loaded in 4-bit it is currently not possible to push the quantized weights on the Hub. Note also that you cannot train 4-bit weights as this is not supported yet.
https://huggingface.co/docs/transformers/main_classes/quantization

bnb_4bit_compute_dtype

bnb_4bit_compute_dtype (torch.dtype or str, optional, defaults to torch.float32) — This sets the computational type which might be different than the input time. For example, inputs might be fp32, but computation can be set to bf16 for speedups.
https://huggingface.co/docs/transformers/main_classes/quantization

Le type retenu ici est float16

bnb_4bit_quant_type

bnb_4bit_quant_type (str, {fp4, nf4}, defaults to fp4) — This sets the quantization data type in the bnb.nn.Linear4Bit layers. Options are FP4 and NF4 data types which are specified by fp4 or nf4.
https://huggingface.co/docs/transformers/main_classes/quantization

Dans notre cas, nf4

You can also use the NF4 data type, which is a new 4bit datatype adapted for weights that have been initialized using a normal distribution.
https://huggingface.co/docs/transformers/main_classes/quantization

use_nested_quant

We introduce nested quantization, a method that defines multiple quantization levels with nested quantization grids, and progressively refines all latents from the coarsest to the finest quantization level.
https://arxiv.org/abs/2102.02913

Pas de nested quantization pour nous. Pourquoi ? Je ne sais pas.

Suite dans le prochain article.