Methods from linguistics, big data techniques and artificial intelligence (AI) enable the recognition and processing of natural language.
In practice, natural language processing is a diverse set of tools and techniques that enable a computer to process human written or spoken language. The steps for any particular goal may vary but some are almost universal, for example first detecting the language used in a text then normalizing the text according to the rules of the detected language. Each step frequently mixes rules curated by humans and rules or models inferred by machine learning. However, with larger amounts of training data becoming available and computing power to match, hand crafted rules are increasingly no longer necessary but often are still helpful to cover special cases.
A typical NLP pipeline might start like this:
Language Detection
The simplest way to decide on the language of a text is to look for typical words in this language. The presence of “a”, “and”, “is”, “are” and so on is a strong indicator for the text being English.
More sophisticated approaches use large amounts of text in different languages and train a classifier to determine the language. The leading models are able to determine a language with very high accuracy and also detect the use of multiple languages in the same text.
Part-of-speech Tagging
The word “type” in “I can type 50 words a minute” functions as a verb, whereas in “He is not my type” it functions as noun. Going further, we can determine the grammatical function for each word. These tags add context to the text and make it easier for a computer to understand the “correct” meaning.
Machine learning models can determine these tags and parse dependencies between them, using different algorithmic approaches like Hidden Markov Models, decision trees or neural networks. Pretrained models exist for many languages and frameworks.
Stemming and Lemmatizing
To normalize a text, we can split it into individual words or tokens and find the root of the tokens. Here, the language and part-of-speech information are useful. For example, the reduced form of “Reading” could be “read” if it is a verbform, but if we talk about the English town it’s still “Reading”. This process is called lemmatization. Stemming is a simpler but also effective approach which uses rules like: “remove every trailing -s and -ity”. This transforms e.g. “rises” to “rise” and “plurality” to “plural”. A well-known set is the Porter-Stemmer.
Other steps in the pipeline could be named entity recognition – the detection of personal names, companies, products and so on, or creating embeddings – mapping words onto numeric vectors. Each step enriches and transforms the text, which makes it better suited for further tasks. For example, it is easier to derive meaningful features from such a text when you feed it to specialized algorithms to classify the text topic, extract key phrases, or compare it to other documents.
More recently, challenges in natural language processing have been tackled with deep learning techniques with some remarkable success.
The advancements in NLP research through deep learning make use of huge amounts of training data and processing power to let a single neural network architecture deal with the complexities of natural language instead of smaller, individually tailored steps.
BERT and GPT-2 for example take (almost) unpreprocessed text, yet implicitly ‘understand’ how “rise” and “rises” are related. They also ‘understand’ from the context of a sentence the difference between a “shot of whiskey” and a “shot in the dark” or the similarity between a “colleague” and a “co-worker”.
These models are trained by trying to predict the next word in a sentence or the word missing in a sentence. The impressive results in some NLP tasks like the generation of embeddings and text classification are seen as almost a by-product of the high level of general language understanding the models achieve from this extensive training.
Many NLP techniques are useful on their own, like named entity recognition and language detection, or enrich a text with syntactic or semantic information, like lemmatizing and part-of-speech-tagging.
They all give a computer a better picture of the text structure and intended “meaning”. Depending on the tasks, we can tailor a NLP pipeline or use single machine learning models to solve them.