One of our previous articles covered the LAMBADA method that makes use of Natural Language Generation (NLG) to generate training utterances for a Natural Language Understanding (NLU) task, namely intent classification. In this tutorial we walk you through the code to reproduce our PoC implementation of LAMBADA.
Before you go ahead with this tutorial, we suggest having a look at our article that explains the fundamental ideas and conceptsapplied by LAMBADA in more detail. In this tutorial we illustrate crucial methods providing an interactive COLAB notebook. Overall, we explain some of the key points of the code and demonstrate how to adjust parameters in order to match your requirements while omitting the less important parts. You can copy the notebook using your Google Account in order to follow along with the code. For training and testing you can insert your own data or use data that we provided.
We use distilBERT as a classification model and GPT-2 as text generation model. For both, we load pretrained weights and finetune them. In case of GPT-2 we apply the Huggingface Transfomers library to bootstrap a pretrained model and subsequently to fine-tune it. To load and fine-tune distilBERT we use Ktrain, a library that provides a high-level interface for language models, eliminating the need to worry about tokenization and other pre-processing tasks.
First, we install both libraries in our COLAB runtime:
We use one of the pre-labelled chitchat data sets from Microsoft’s Azure QnA Maker. Next, we split the chitchat data set such that we obtain ten intents with ten utterances each as an initial training data set and the remaining 1047 samples as a test data set. In the following, we use the test data set in order to benchmark the different intent classifiers we train in this tutorial.
Subsequently, we load the training data from file train.csv and split it in such a way to obtain six utterances per intent for training and four utterances per intent for validation.
We download the pretrained distilBERT model, transform the training and validation data from pure text into the valid format for our model and initialize a learner object, which is used in KTrain to train the model.
Now it’s time to train the model. We feed the training data to the network multiple times, specified by the number of epochs. In the beginning both monitored metrics, namely the loss function (decrease) and the accuracy (increase), should indicate improvement of the model with each epoch passed. However, after training the model for a while the validation loss will increase and the validation accuracy drop. This is a result of overfitting the training data and it is time to stop feeding the same data to the network.
The optimal number of epochs depends on your data set, model and training parameters. If you do not know the right number of epochs beforehand you can use a high number of epochs and activate checkpoints by setting the checkpoint_folder parameter to select the best performing model afterwards.
To check the performance of our trained classifier, we use our test data in the eval.csv file.
Note that thanks to the KTrain interface we can simply feed the list of utterances to the predictor without the need to pre-process the raw strings beforehand. We get the accuracy of our classifier as an output:
Accuracy: 84.24%
To fine-tune GPT-2, we use a Python script made available by Huggingface on their Github repository. Among others, we specify the following parameters:
Let’s load our model and generate some utterances! To trigger the generation of new utterances for a specific intent we provide the model with this intent as seed ('<intent>,', e.g. ‘inform_hungry,’).</intent>
This looks good! The artificially generated utterances fit the intent, but in order to be a useful addition and to improve our model, these utterances must differ from utterances used for training. The training data for the intent inform_hungry was the following:
We can see that the two utterances “I want some food” and “I’m so hungry I could eat” are not part of the training data.
If we are not satisfied with our generated utterances because they are all very similar or if they do not match the underlying intent, we can adjust the variability of the generated output by modifying the following parameters:
We now generate the new utterances for all intents. To have a sufficiently large sample that we can choose the best utterances from, we generate 200 per intent.
After a while the data is generated, and we can have a closer look at it. First, we use our old distilBERT classifier to predict the intent for all generated utterances. We also keep track of the prediction probability indicating the level of confidence of each individual prediction made by our model.
Let’s have a look at some of the utterances for which the intent used for generation does not match the predicted intent.
We can see that in some cases the prediction is clearly wrong. However, there are also cases where the prediction matches the utterance, but doesn’t match the intent used for generation. This indicates that our GPT-2 model is not perfect as it doesn’t generate matching utterances for an intent all the time.
To stop from training our classifier with corrupt data, we drop all utterances for which the basic intent does not match the predicted intent. For those with matching instances, we only keep the ones with the highest prediction probability scores.
We can see that for each intent, there are at least 35 mutually distinct utterances. To keep a balanced data set, we pick the top 30 utterances per intent according to the prediction probability.
We now combine the generated data with the initial training data and split the enriched data set intotraining and validation data.
Now it’s time to train our new intent classification model. The code is like the one above:
Finally, we use our evaluation data set to check the accuracy of our new intent classifier.
We can see that the performance improved by a margin of 7%. Overall, the improvement in prediction accuracy was consistently more than 4% across all experiments we ran.
We employed the LAMBADA method to augment data used for Natural Language Understanding (NLU) tasks. We trained a GPT-2 model to generate new training utterances and utilized them as training data for our intent classification model (distilBERT). The performance of the intent classification model improved by at least 4% in each of our tests.
Additionally, we saw that high-level libraries such as KTrain and Huggingface Transformers help to reduce the complexity of applying state-of-the-art transformer models for Natural Language Generation (NLG) and other Natural Language Processing (NLP) tasks such as classification and make these approaches broadly applicable.