Natural Language Processing (NLP) is nowadays among the most popular fields of data science. Therefore, also the number of tools and their support in this field is growing dramatically. Among these the most known are NLTK and spaCy, but there are also other well-known libraries like Gensim.
Each of these libraries has pros and cons, but there are several features which they share, e.g. tokenization and stemming. The decision on which is the best library to use depends strongly on the use case.
In addition, every library supports different human languages. Of course, the user has always the possibility to train its own language models, but this assumes that a corpus is available.
The NLP libraries presented here are used to perform the preprocessing of a corpus of documents, i.e. all the steps required to clean up the texts and prepare them to be processed by machine learning algorithms. The libraries used to train such models are then the standard ones used in data science, like scikit-learn and TensorFlow.
The Natural Language Toolkit (NLTK) is the one of the NLP libraries which has been around for the longest time and is a Python library. This library supports tokenization, stemming, Part-Of-Speech (POS), and entity recognition, in addition to over 50 corpora and lexical resources. Moreover, NLTK is a great choice in both industry and research, since it offers a variety of algorithms to choose from.
The corpora allow to support multiple human languages apart from English, such as Spanish, Portuguese and Hindi. For all languages not available, the user needs to train his own model, which means that an available corpus is needed. Furthermore, NLTK supports - among others - the Snowball stemmer, currently available for most Indo-European languages, therefore offering a good coverage.
According to our experience NLTK is the best library for the NLP newbies: It is very easy to setup and to start working with it, especially because it is string-based.
SpaCy is currently the hottest NLP library, written in Python and Cython, with a huge number of contributions.
SpaCy features neural models for tagging, parsing, entity recognition, and lemmatizing – at odds of NLTK which has a stemming but not a lemmatizer.
Currently spaCy offers pre-trained models for the most used European languages, and the coverage is always growing.
This library is a good choice for most industrial applications. In total, fewer different algorithms are supported, but the developers have integrated the currently best available ones in terms of precision and speed for the respective purposes.
According to what we experienced, spaCy is suitable for those who are already familiar with NLP, as the setup and the object-oriented implementation make it a more complex tool to use.
Other noticeable libraries purely for NLP are the following:
Our experience in using NLP is strongly oriented towards NLTK and spaCy, in addition to the standard ML libraries (scikit-learn, TensorFlow, …).
As already mentioned, NLTK is often the first choice, because its setup is very quick, but it also quickly reaches its limit – especially in terms of performances.
Therefore, for more advanced NLP pipelines, we strongly rely on spaCy for mainly two reasons: This library yields very accurate results with great performances, meaning, you can get the best results in a shorter time than with NLTK.
The two leading libraries in NLP are NLTK and spaCy, which present very similar features, and a similar coverage of pretrained language models. The decision which is the best one strongly depends on the use case and on the familiarity of the user with Python and NLP. According to our experience, NLTK offers a lot of freedom in the choice of algorithms and a very easy start, but for more complex pipelines, spaCy is a better choice.
These libraries allow to go from the clean-up of the input documents, to the creation of features, which can then be used from other classical ML libraries, e.g. scikit-learn, to train models and make predictions.