Fine-tuning RoBERTa base on SQuAD v2¶

This Jupyter notebook demonstrates the process of fine-tuning a RoBERTa (Robustly optimized Bidirectional Encoder Representations from Transformers) model using pytorch and the 🤗 libraries.

  • Click here to visit the 🤗 model card to see more details about this model or click here to visit a demonstration app for this model hosted as a 🤗 space.
  • The model is a fine-tuned version of roberta-base for QA
  • It was fine-tuned for context-based extractive question answering on the SQuAD v2 dataset, a dataset of English-language context-question-answer triples designed for extractive question answering training and benchmarking.
  • Version 2 of SQuAD (Stanford Question Answering Dataset) contains the 100,000 examples from SQuAD Version 1.1, along with 50,000 additional "unanswerable" questions, i.e. questions whose answer cannot be found in the provided context.
  • The original RoBERTa (Robustly Optimized BERT Pretraining Approach) model was introduced in this paper and this repository|

Load the necessary libraries

In [ ]:
!pip install -qU transformers evaluate accelerate
!pip install -qU torch torchvision torchaudio
!pip install -qU huggingface-hub
In [1]:
from huggingface_hub import notebook_login, HfApi,HfFolder
from datasets import load_dataset
import torch
import ipywidgets as widgets
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import (
    AutoTokenizer,
    DefaultDataCollator,
    AutoModelForQuestionAnswering,
    TrainingArguments,
    Trainer,
    default_data_collator,
    get_scheduler,
    
)
from accelerate import Accelerator
from tqdm.auto import tqdm
import numpy as np
import collections
import evaluate

Hugging Face API Token notebook login:

In [10]:
notebook_login()
VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

SQuAD v2 dataset¶

First we'll download the dataset from Hugging Face

Load the data¶

In [11]:
squad = load_dataset('squad_v2',use_auth_token=True)
Reusing dataset squad_v2 (/root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d)
  0%|          | 0/2 [00:00<?, ?it/s]

Tokenize and preprocess the data¶

In [12]:
model_checkpoint = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint,use_auth_token=True)

We will preprocess our training and validation dataset using a custom function preprocess_examples.

We'll do the following:

  • For each sample, tokenize the questions and context. All questions are fairly short, but the contexts can be quite long.
    • We use truncation while tokenizing the contexts to keep the pieces short.
    • This truncation is done according to the parameters return_overflowing_tokens, max_length, stride, and padding. Each tokenized context is broken into token sequences of length at most max_length, and consecutive ensequences overlap by 128 tokens (in order to make sure the entire answer appears in at least one sequence).
    • Then all sequences are padded at the end using the padding token to become sequences of length max_length.
  • We'll use several important pieces of data from the output of the tokenizer:
    • The overflow_to_sample_mapping which, for each tokenized sequence, provides the index of the sample from where that sequence came.
    • The offset_mapping which, for each token in each tokenized sequence, provides a pair (start,end) giving the character positions spanned by that token in the sample.
    • The sequence_ids which, for each sequence, give a list containing entries 0 (for tokens in coming from question), 1 (for tokens coming from context piece), and None (for special tokens)
  • Record for each sequence the starting and ending token position of the provided answer in the context:
    • If the answer is not in that sequence, record start_position = end_position = 0
    • If the answer is in that sequence, then:
      • retrieve the start and end positions of the context piece from sequence_ids
      • step inwards from start and end positions until we locate the answer, and record those positions
  • During evaluation, it will be helpful to have two additional columns. Retrieving them slows down the mapping process however, and we'll only include them for the evaluation set:
    • a modified version of the offset_mapping pairs - where the entries are the actual offset_mapping pairs for context tokens and None otherwise; this information comes from sequence_ids
    • a column containing the example_id the sequence came from
  • Drop the columns from the original training data, so that the resulting dataset has columns:
    • 'input_ids', 'attention_mask', 'start_positions', 'stop_positions'
    • in the case of the validation set, also the modified 'offset_mapping' and 'example_id'

This is accomplished via a custom function preprocess_examples.

Training data¶

In [13]:
from lib.utils import preprocess_examples

train_dataset = squad['train'].map(
    preprocess_examples,
    batched=True,
    remove_columns=squad['train'].column_names,
    fn_kwargs = {
        'tokenizer':tokenizer,
    }
)
Loading cached processed dataset at /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d/cache-9cf05c80fcf103ec.arrow

Validation data¶

Set is_test=True to retrieve additional columns

In [14]:
validation_dataset = squad['validation'].map(
    preprocess_examples,
    batched=True,
    remove_columns=squad['validation'].column_names,
    fn_kwargs = {
        'tokenizer':tokenizer,
        'is_test':True,
    }
)
  0%|          | 0/12 [00:00<?, ?ba/s]

Initialize dataloaders¶

We prepare our dataloaders. This training is best run on a CUDA device, i.e. a GPU. If you're running into 'CUDA out of memory' errors, try lowering the batch_size.

In [7]:
train_dataset.set_format("torch")
eval_dataset = validation_dataset.remove_columns(["example_id", "offset_mapping"])
eval_dataset.set_format("torch")

train_dataloader = DataLoader(
    train_dataset,
    shuffle=True,
    collate_fn=default_data_collator,
    batch_size=16
)
eval_dataloader = DataLoader(
    eval_dataset,
    collate_fn=default_data_collator,
    batch_size=16
)

Training the model¶

We first get set up for training:

  • We will use 🤗 accelerators library to handle training on our GPU. accelerators also automatically handles distributed training, if you're on a machine with multiple devices.
  • We'll use the AdamW (Adaptive moment estimation with Weight decay) optimization algorithm, with a linear learning rate scheduler.
  • We'll train for 3 epochs with a base learning rate of 3e-5.

Model, optimizer, accelerator, and learning rate scheduler¶

In [16]:
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

optimizer = AdamW(model.parameters(),lr = 3e-5)

accelerator = Accelerator(mixed_precision="fp16")
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

num_train_epochs=3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    'linear',
    optimizer=optimizer,
    num_warmup_steps = 0,
    num_training_steps = num_training_steps,
)

output_dir = 'roberta-finetuned-squad-v2-accelerate'
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForQuestionAnswering: ['lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Training loop¶

Finally, we'll execute our training loop. Notice that we don't need to include methods moving the model or batches to our CUDA device - accelerators handles this automatically.

In [9]:
from lib.utils import compute_metrics
In [11]:
progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training step
    model.train()
    for step, batch in enumerate(train_dataloader):
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)
        
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
    
    # Eval step    
    model.eval()
    start_logits = []
    end_logits = []
    accelerator.print('Evaluation!')
    for batch in tqdm(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)    
        start_logits.append(accelerator.gather(outputs.start_logits).cpu().numpy())
        end_logits.append(accelerator.gather(outputs.end_logits).cpu().numpy())
    
    # Concatenate logit arrays from batches    
    start_logits = np.concatenate(start_logits)
    end_logits = np.concatenate(end_logits)
    start_logits = start_logits[: len(validation_dataset)]
    end_logits = end_logits[: len(validation_dataset)]
    
    # Compute and report metrics
    metrics = compute_metrics(
        start_logits, end_logits, validation_dataset, squad['validation']
    )
    print(f"epoch {epoch}:", metrics)
    
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir,save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
  0%|          | 0/24717 [00:00<?, ?it/s]
Evaluation!
  0%|          | 0/761 [00:00<?, ?it/s]
  0%|          | 0/11873 [00:00<?, ?it/s]
epoch 0: {'exact': 77.7478312136781, 'f1': 80.8702752323304, 'total': 11873, 'HasAns_exact': 78.91363022941971, 'HasAns_f1': 85.16747264397094, 'HasAns_total': 5928, 'NoAns_exact': 76.58536585365853, 'NoAns_f1': 76.58536585365853, 'NoAns_total': 5945, 'best_exact': 77.7478312136781, 'best_exact_thresh': 0.0, 'best_f1': 80.87027523233047, 'best_f1_thresh': 0.0}
Evaluation!
  0%|          | 0/761 [00:00<?, ?it/s]
  0%|          | 0/11873 [00:00<?, ?it/s]
epoch 1: {'exact': 79.81133664617198, 'f1': 82.94151794597373, 'total': 11873, 'HasAns_exact': 78.2051282051282, 'HasAns_f1': 84.4744673705377, 'HasAns_total': 5928, 'NoAns_exact': 81.41295206055509, 'NoAns_f1': 81.41295206055509, 'NoAns_total': 5945, 'best_exact': 79.81133664617198, 'best_exact_thresh': 0.0, 'best_f1': 82.94151794597376, 'best_f1_thresh': 0.0}
Evaluation!
  0%|          | 0/761 [00:00<?, ?it/s]
  0%|          | 0/11873 [00:00<?, ?it/s]
epoch 2: {'exact': 80.45986692495578, 'f1': 83.52543495807724, 'total': 11873, 'HasAns_exact': 78.69433198380567, 'HasAns_f1': 84.83425932139885, 'HasAns_total': 5928, 'NoAns_exact': 82.22035323801514, 'NoAns_f1': 82.22035323801514, 'NoAns_total': 5945, 'best_exact': 80.45986692495578, 'best_exact_thresh': 0.0, 'best_f1': 83.52543495807726, 'best_f1_thresh': 0.0}
In [12]:
tokenizer.save_pretrained(output_dir)
Out[12]:
('roberta-finetuned-squad-v2-accelerate-run2/tokenizer_config.json',
 'roberta-finetuned-squad-v2-accelerate-run2/special_tokens_map.json',
 'roberta-finetuned-squad-v2-accelerate-run2/vocab.json',
 'roberta-finetuned-squad-v2-accelerate-run2/merges.txt',
 'roberta-finetuned-squad-v2-accelerate-run2/added_tokens.json',
 'roberta-finetuned-squad-v2-accelerate-run2/tokenizer.json')
Run 1:¶
  • 3 epochs
  • base_lr = 3e-5
  • linear scheduler
  • warmup = 0
In [ ]:
{'exact': 80.45986692495578,
 'f1': 83.52543495807724,
 'total': 11873,
 'HasAns_exact': 78.69433198380567,
 'HasAns_f1': 84.83425932139885,
 'HasAns_total': 5928,
 'NoAns_exact': 82.22035323801514,
 'NoAns_f1': 82.22035323801514,
 'NoAns_total': 5945}

Model inference via hugging face hub inference endpoint¶

Inference using our saved model doesn't require much code - but don't forget to set the handle_impossible_answers option so that the pipeline properly handles the questions with impossible answers correctly - it will output '' for such a question.

In [17]:
repo_id = 'etweedy/roberta-base-squad-v2'
model = AutoModelForQuestionAnswering.from_pretrained(repo_id)
tokenizer = AutoTokenizer.from_pretrained(repo_id)
Downloading (…)lve/main/config.json:   0%|          | 0.00/681 [00:00<?, ?B/s]
Downloading model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]
Downloading (…)okenizer_config.json:   0%|          | 0.00/351 [00:00<?, ?B/s]
Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]
Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]
Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]
In [18]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

repo_id = "etweedy/roberta-base-squad-v2"

QA_pipeline = pipeline('question-answering', model=repo_id, tokenizer=repo_id, handle_impossible_answer=True)
In [19]:
input = {
    'question': 'Who invented Twinkies?',
    'context': 'Twinkies were invented on April 6, 1930, by Canadian-born baker James Alexander Dewar for the Continental Baking Company in Schiller Park, Illinois.'
}
response = QA_pipeline(**input)
response
Out[19]:
{'score': 0.9599111080169678,
 'start': 64,
 'end': 85,
 'answer': 'James Alexander Dewar'}
In [21]:
input = {
    'question': 'When was James Alexander Dewar born?',
    'context': 'Twinkies were invented on April 6, 1930, by Canadian-born baker James Alexander Dewar for the Continental Baking Company in Schiller Park, Illinois.'
}
response = QA_pipeline(**input)
response
Out[21]:
{'score': 0.9994915127754211, 'start': 0, 'end': 0, 'answer': ''}