LAMBADA Method: How to use Data Augmentation in NLU?


Better NLG and NLU with Data Augmentation

One of our previous articles covered the LAMBADA method that makes use of Natural Language Generation (NLG) to generate training utterances for a Natural Language Understanding (NLU) task, namely intent classification. In this tutorial we walk you through the code to reproduce our PoC implementation of LAMBADA.

Before you go ahead with this tutorial, we suggest having a look at our article that explains the fundamental ideas and concepts applied by LAMBADA in more detail. In this tutorial we illustrate crucial methods providing an interactive COLAB notebook. Overall, we explain some of the key points of the code and demonstrate how to adjust parameters in order to match your requirements while omitting the less important parts. You can copy the notebook using your Google Account in order to follow along with the code. For training and testing you can insert your own data or use data that we provided.

Data Augmentation in NLU: Step 1 – Setting up the environment

We use distilBERT as a classification model and GPT-2 as text generation model. For both, we load pretrained weights and finetune them. In case of GPT-2 we apply the Huggingface Transfomers library to bootstrap a pretrained model and subsequently to fine-tune it. To load and fine-tune distilBERT we use Ktrain, a library that provides a high-level interface for language models, eliminating the need to worry about tokenization and other pre-processing tasks.

First, we install both libraries in our COLAB runtime:

!pip install ktrain 
!pip install transformers

Chatbots with Data Augmentation: Step 2 – Data

We use one of the pre-labelled chitchat data sets from Microsoft’s Azure QnA Maker. Next, we split the chitchat data set such that we obtain ten intents with ten utterances each as an initial training data set and the remaining 1047 samples as a test data set. In the following, we use the test data set in order to benchmark the different intent classifiers we train in this tutorial.

Subsequently, we load the training data from file train.csv  and split it in such a way to obtain six utterances per intent for training and four utterances per intent for validation.


import pandas
from sklearn.model_selection import train_test_split

data_train = pandas.read_csv('train.csv')
intents = data_train['intent'].unique()

X_train = []
X_valid = []
y_train = []
y_valid = []
for intent in intents:
    intent_X_train, intent_X_valid, intent_y_train, intent_y_valid = train_test_split(
    data_train[data_train['intent'] == intent]['utterance'],
        data_train[data_train['intent'] == intent]['intent'],


NLU with the LAMBADA Method: Step 3 – Training the initial intent classifier

We download the pretrained distilBERT model, transform the training and validation data from pure text into the valid format for our model and initialize a learner object, which is used in KTrain to train the model.

import ktrain
from ktrain import text
distil_bert = text.Transformer('distilbert-base-cased', maxlen=50, 
processed_train = distil_bert.preprocess_train(X_train, y_train)
processed_test = distil_bert.preprocess_test(X_valid, y_valid)
model = distil_bert.get_classifier()
learner = ktrain.get_learner(model, train_data=processed_train, 
val_data=processed_test, batch_size=10)

Now it’s time to train the model. We feed the training data to the network multiple times, specified by the number of epochs. In the beginning both monitored metrics, namely the loss function (decrease) and the accuracy (increase), should indicate improvement of the model with each epoch passed. However, after training the model for a while the validation loss will increase and the validation accuracy drop. This is a result of overfitting the training data and it is time to stop feeding the same data to the network.

The optimal number of epochs depends on your data set, model and training parameters. If you do not know the right number of epochs beforehand you can use a high number of epochs and activate checkpoints by setting the checkpoint_folder parameter to select the best performing model afterwards.

learner.fit_onecycle(5e-5, N_TRAINING_EPOCHS)
begin training using onecycle policy with max lr of 5e-05...
Train for 6 steps, validate for 2 steps
Epoch 1/12
6/6 [==============================] - 8s 1s/step - loss: 2.3111 - accuracy:
0.0667 - val_loss: 2.2881 - val_accuracy: 0.0750
Epoch 2/12
6/6 [==============================] - 0s 68ms/step - loss: 2.3042 - accuracy:
0.0833 - val_loss: 2.2772 - val_accuracy: 0.1750
Epoch 11/12
6/6 [==============================] - 0s 66ms/step - loss: 0.4738 - accuracy:
1.0000 - val_loss: 0.9785 - val_accuracy: 0.8500
Epoch 12/12
6/6 [==============================] - 0s 68ms/step - loss: 0.4434 - accuracy:
1.0000 - val_loss: 0.9687 - val_accuracy: 0.8500

<tensorflow.python.keras.callbacks.History at 0x7fa19a7134a8><span id="mce_marker" data-mce-type="bookmark" data-mce-fragment="1">​</span><span id="mce_marker" data-mce-type="bookmark" data-mce-fragment="1">​</span><span id="mce_marker" data-mce-type="bookmark" data-mce-fragment="1">​</span>

To check the performance of our trained classifier, we use our test data in the eval.csv file.

import numpy

data_test = pandas.read_csv('eval.csv')
test_intents = data_test["intent"].tolist()
test_utterances = data_test["utterance"].tolist()

predictor = ktrain.get_predictor(learner.model, preproc=distil_bert)
predictions = predictor.predict(test_utterances)

np_test_intents = numpy.array(test_intents)
np_predictions = numpy.array(predictions)

result = (np_test_intents == np_predictions)

print("Accuracy: {:.2f}%".format(result.sum()/len(result)*100))

Note that thanks to the KTrain interface we can simply feed the list of utterances to the predictor without the need to pre-process the raw strings beforehand. We get the accuracy of our classifier as an output:

Accuracy: 84.24%

NLU with the LAMBADA Method: Step 4 – Fine-tune GPT-2 to generate utterances

To fine-tune GPT-2, we use a Python script made available by Huggingface on their Github repository. Among others, we specify the following parameters:

  • the pretrained model that we want to use (gpt2-medium). Larger models, typically generate better text outputs. Please note, these models require a large amount of memory during training, so make sure you pick a model that fits into your (GPU-)memory.
  • the number of epochs. This parameter specifies how many times the training data is fed through the network. On the one hand, if the number of epochs is too small, the model will not learn to generate useful utterances. On the other hand, if the number is chosen too big, the model will likely overfit and the variability in the generated text data will be limited – the model will basically just remember the training data.
  • the batch size. This determines how many utterances are used for training in parallel. The larger the batch size the faster the training, larger batch sizes require more memory, though.
  • the block size. The block size defines an upper bound on the number of tokens considered from each training data instance that are used. Make sure that this number is sufficient so that utterances are not cropped.

Let’s load our model and generate some utterances! To trigger the generation of new utterances for a specific intent we provide the model with this intent as seed ('<intent>,', e.g. ‘inform_hungry,’).

from transformers import GPT2Tokenizer, TFGPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")
model = TFGPT2LMHeadModel.from_pretrained('/content/transformers/output/',
pad_token_id=tokenizer.eos_token_id, from_pt=True)

input_ids = tokenizer.encode('inform_hungry,', return_tensors='tf')
sample_outputs = model.generate(

print("Output:n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))<span id="mce_marker" data-mce-type="bookmark" data-mce-fragment="1">​</span><span id="mce_marker" data-mce-type="bookmark" data-mce-fragment="1">​</span>

0: inform_hungry,I want a snack!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
1: inform_hungry,I want to eat!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2: inform_hungry,I want some food!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
3: inform_hungry,I'm so hungry I could eat!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!...

This looks good! The artificially generated utterances fit the intent, but in order to be a useful addition and to improve our model, these utterances must differ from utterances used for training. The training data for the intent inform_hungry was the following:

Intent: Utterance:

inform_hungry,I want a snack

inform_hungry,I am very hungry

inform_hungry,I’m hangry

inform_hungry,Need food

inform_hungry,I want to eat

inform_hungry,I’m a bit peckish

inform_hungry,My stomach is rumbling

inform_hungry,I’m so hungry I could eat a horse

inform_hungry,I’m feeling hangry

inform_hungry,I could eat

We can see that the two utterances “I want some food” and “I’m so hungry I could eat” are not part of the training data.

If we are not satisfied with our generated utterances because they are all very similar or if they do not match the underlying intent, we can adjust the variability of the generated output by modifying the following parameters:

  • do_sample. This parameter must be set to True, otherwise the model will keep returning the same output.
  • top_k. This parameter specifies the number of distinct tokens that are considered for sampling each step. The higher you set this parameter, the more diverse the output will be.
  • top_p. This parameter specifies the cumulative probability of most likely tokens considered for sampling, e.g. using top_p = 0.92 will sample from 92% most likely words. The higher top_p, the more diverse the output. The maximum value is 1.

Step 5 – Generate and filter new utterances

We now generate the new utterances for all intents. To have a sufficiently large sample that we can choose the best utterances from, we generate 200 per intent.

def generate_utterances_df(n_generated, tokenizer, model, intent):
  input_ids = tokenizer.encode(intent + ',', return_tensors='tf')
  sample_outputs = model.generate(

  list_of_intent_and_utterances = [
        tokenizer.decode(sample_output, skip_special_tokens=True)[len(intent)+1:]
    for sample_output in sample_outputs

  return pandas.DataFrame(list_of_intent_and_utterances, columns=['intent', 'utterance'])

intents = data_train["intent"].unique()

generated_utterances_df = pandas.DataFrame(columns=['intent', 'utterance'])

for intent in intents:
  print("Generating for intent " + intent)
  utterances_for_intent_df = generate_utterances_df(NUMBER_OF_GENERATED_UTTERANCES_PER_INTENT, tokenizer, model, intent)
  generated_utterances_df = generated_utterances_df.append(utterances_for_intent_df)

After a while the data is generated, and we can have a closer look at it. First, we use our old distilBERT classifier to predict the intent for all generated utterances. We also keep track of the prediction probability indicating the level of confidence of each individual prediction made by our model.

predictions_for_generated = numpy.array(predictor.predict(generated_data['utterance'].tolist(),
proba_for_predictions_for_gen = predictor.predict(generated_data['utterance'].tolist(),
predicted_proba = numpy.array([max(probas) for probas in proba_for_predictions_for_gen])

generated_data_predicted = pandas.DataFrame({"intent": generated_data['intent'],
"utterance": generated_data['utterance'],
"predicted_intent": predictions_for_generated,
"prediction_proba": predicted_proba})

0body_related_questionDo you chew?body_related_question0.701058
1body_related_questionDo you have a stomachbody_related_question0.737520

Let’s have a look at some of the utterances for which the intent used for generation does not match the predicted intent.

generated_data_predicted[generated_data_predicted['intent'] !=

7ask_purposeWhere do you live?get_location0.745748
20ask_purposeWhat was your greatest passion growing up?needs_love0.192401
70ask_purposeWhy are you here?get_location0.683455
182ask_purposeWhere are you from?get_location0.697122
49get_locationAre you in a computer?body_related_question0.498938
162get_locationTell me what you’re doingask_purpose0.571899
3make_singI sing a songinform_hungry0.358060
18make_singI want to singinform_hungry0.604815
20make_singYou’re so cuteneeds_love0.266433
41make_singYou’re singinginform_hungry0.323076

We can see that in some cases the prediction is clearly wrong. However, there are also cases where the prediction matches the utterance, but doesn’t match the intent used for generation. This indicates that our GPT-2 model is not perfect as it doesn’t generate matching utterances for an intent all the time.

To stop from training our classifier with corrupt data, we drop all utterances for which the basic intent does not match the predicted intent. For those with matching instances, we only keep the ones with the highest prediction probability scores.

correctly_predicted_data = generated_data_predicted[generated_data_predicted
['intent'] == generated_data_predicted['predicted_intent']]
ascending=[True, False]).drop_duplicates(keep='first').groupby('intent').count()

suicide risk676767

We can see that for each intent, there are at least 35 mutually distinct utterances. To keep a balanced data set, we pick the top 30 utterances per intent according to the prediction probability.

TOP_N = 30
top_predictions_per_intent = correctly_predicted_data.drop_duplicates(subset=
'utterance',keep='first').sort_values(by=['intent', 'prediction_proba'],

NLU with LAMBADA’s Data Augmentation: Step 6 – Train the intent classifier with augmented data

We now combine the generated data with the initial training data and split the enriched data set intotraining and validation data.

data_train_aug = data_train.append(top_predictions_per_intent[['intent', 'utterance']], ignore_index=True)

intents = data_train_aug['intent'].unique()

X_train_aug = []
X_valid_aug = []
y_train_aug = []
y_valid_aug = []
for intent in intents:
    intent_X_train, intent_X_valid, intent_y_train, intent_y_valid = train_test_split(
        data_train_aug[data_train_aug['intent'] == intent]['utterance'],
        data_train_aug[data_train_aug['intent'] == intent]['intent'],


Now it’s time to train our new intent classification model. The code is like the one above:

distil_bert_augmented = text.Transformer('distilbert-base-cased', maxlen=50, classes=intents)
processed_train_aug = distil_bert_augmented.preprocess_train(X_train_aug, y_train_aug)
processed_test_aug = distil_bert_augmented.preprocess_test(X_valid_aug, y_valid_aug)
model_aug = distil_bert_augmented.get_classifier()
learner_aug = ktrain.get_learner(model_aug, train_data=processed_train_aug, val_data=processed_test_aug,
learner_aug.fit_onecycle(5e-5, N_TRAINING_EPOCHS_AUGMENTED)

Finally, we use our evaluation data set to check the accuracy of our new intent classifier.

Accuracy: 91.40%

We can see that the performance improved by a margin of 7%. Overall, the improvement in prediction accuracy was consistently more than 4% across all experiments we ran.

Data Augmentation in NLU Summary


We employed the LAMBADA method to augment data used for Natural Language Understanding (NLU) tasks. We trained a GPT-2 model to generate new training utterances and utilized them as training data for our intent classification model (distilBERT). The performance of the intent classification model improved by at least 4% in each of our tests.

Additionally, we saw that high-level libraries such as KTrain and Huggingface Transformers help to reduce the complexity of applying state-of-the-art transformer models for Natural Language Generation (NLG) and other Natural Language Processing (NLP) tasks such as classification and make these approaches broadly applicable.

More Artificial Intelligence and Natural Language Processing for you