Tuning de Llama 2 2/n – Llama2 par BackProp

Cet article est la suite de 1/n.

L’article de référence retenu pour expliquer le tuning de Llama 2 est celui ci : Fine-Tune Your Own Llama 2 Model in a Colab Notebook – A practical introduction to LLM fine-tuning de Maxime Labonne.

L’article date de fin juillet 2023.

J’ai fait quelques modifications au code afin de l’adapter à mes besoins. Mon notebook est ici et le notebook originel est ici. Il y a peu de différences entre les deux. Elles sont signalées dans cet article.

TrainingArguments parameters

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 25

Je ne vais pas commenter tous les paramètres relatifs à l’entraînement du modèle car beaucoup sont plus le fruit de l’expérience que de considérations théoriques. Par exemple, pourquoi un batch_size de 4 et non de 8 comme on le fait souvent, pourquoi un learning rate de 2e-4, pourquoi weight_decay = 0.001, …

Ce qui m’étonne le plus, c’est le nombre d’epochs. Pourquoi une seul ?

Pour rappel :

In easy words
Epoch: Epoch is considered as number of one pass from entire dataset
Steps: In tensorflow one steps is considered as number of epochs multiplied by examples divided by batch size
https://stackoverflow.com/questions/38340311/what-is-the-difference-between-steps-and-epochs-in-tensorflow

D’après ce que j’ai lu, Llama 2 a été entraîné avec l’optimizer AdamW, ce qui explique pourquoi on le retrouve ici (idem pour la gestion du learning rate, même si les paramètres ont des valeurs différentes).

Optimizer and Hyperparameters: Llama 2 models are trained using the AdamW optimizer with specific hyperparameters. The learning rate schedule follows a cosine decay, with a weight decay of 0.1 and gradient clipping of 1.0. The models use a warmup strategy of 2,000 steps to stabilize training, and the learning rate and batch size vary according to the model size ^[
https://www.e2enetworks.com/blog/llama-2-the-new-open-source-language-model

SFT Parameters

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

Pour rappel, SFT (Supervised Fine-Tuning) est :

However, in some cases, updating the knowledge of the model is not enough and you want to modify the behavior of the LLM. In these situations, you will need a supervised fine-tuning (SFT) dataset, which is a collection of prompts and their corresponding responses. SFT datasets can be manually curated by users or generated by other LLMs. Supervised fine-tuning is especially important for LLMs such as ChatGPT, which have been designed to follow user instructions and stay on a specific task across long stretches of text.
http://wiki.backprop.fr/index.php?title=SFT

max_seq_length est la valeur par défaut, puisque la documentation d’HF dit ceci:

Make sure to pass a correct value for max_seq_length as the default value will be set to min(tokenizer.model_max_length, 1024).
https://huggingface.co/docs/trl/main/en/sft_trainer

Le packing est false.

SFTTrainer supports example packing, where multiple short examples are packed in the same input sequence to increase training efficiency. This is done with the ConstantLengthDataset utility class that returns constant length chunks of tokens from a stream of examples.
https://huggingface.co/docs/trl/main/en/sft_trainer

Je ne sais pas si un packing à true change fondamentalement le training…

Pour le reste, il est logique que device_map = {« »: 0} puisqu’on travaille sur GPU.