Named Entity Recognition Analysis
N amed E ntity R ecognition( NER ) is a technique in Natural language processing used for identifying the entities in an input text. This article discusses various NER techniques examined at botsplash, for chatbot creation. The NER tags the input sequence of words with entities such as person, place, organization, date reference etc. Depending on requirements and application, entities might vary. In our application, we use NER for the chatbot to help in processing user request.
e.g. Below is a user query
Entities to extract
The extracted entities are parameters to the data store/database. The return result from the database can be used to create chatbot responses.
Simple NER can be constructed by using a collection of regular expressions for simple entities like currency, amount etc. But for a complex entity like the name of a place (which can also be the name of the person so the only way to distinguish is by context) or reference to date or time (there are multiple ways to writing) we need machine learning model which learn rather than having a specific rule. There are various NLP packages which have NER implementation.
Some of the libraries which have NER built are listed below
1. Stanford CoreNLP,
3. Spacy 2.0 (Alpha version)
5. JULIE Lab Tools (Biomedical clinical, ecological and economic entities)
6. Balie (called as YooName)
7. Postech Biomedical NER System (Biomedical entities only)
8. ABNER (Biomedical entities only)
NLTK also provides an interface for using Stanford coreNLP NER. Deep neural network model (LSTM) is also used for NER and have been proved to be effective. Spacy Alpha used neural network model.
For our application, we only explored Stanford coreNLP, Spacy and LSTM model.
The Stanford coreNLP package provides NER and is written in JAVA. We can also use the package from the command line by passing input file and in Python using os.cmd. The package comes with pre-trained NER model that can be used immediately. The pre-trained model can recognize 3 classes (Person, Organization, Location and Date). They also have web demo where we can try out NER and other package functionality. The package also supports retraining on custom entities and multi word entities such as multi word city names (e.g. New York) or street address (e.g. 100 N Tryon Street). The data needs to be in IO(inside outside) encoding or IOB (inside outside beginning) encoding. IO cannot distinguish adjacent entities of the same category but IOB can, as it marks multi word entity with ‘B-‘(for the first word) and ‘I-‘(for the following words) prefix.
e.g. IO encoding
Spacy is Python NLP package that provides NER, tokenization, sentence segmentation, sentiment analysis, coherence resolution, dependency parsing and POS tagging. This package also comes with pre-trained model which can be used to do entity recognition like a product, language, event etc. It also supports re-training of the model. For training, the data needs to be in a list of tuples. Each tuple contains sentence and list of all the entities and their location in the sentence. The newer version of Spacy 2.0 is in alpha. It still has most of the functionality same as the original package but the model implementation has changed. Most of the models are implemented using neural network using Thinc and support GPU.
Deep LSTM network
We build many to many recurrent neural network model for NER using a recurrent neural network and LSTM cells.
The network was written in Tensorflow and was trained on real estate data to recognize various parts of address (street, city, state, and zip code). The model has word & character embedding as input layer and we use GloVe word vectors. Every input word would be converted to its corresponding word vector and concatenated with character vector. This input was then fed into LSTM network. The logit from LSTM layer is fed to softmax layer to output the probability. The model was trained for 10 epochs. The final trained model has had test accuracy of 98.90 and f1 of 98.71. The trained model was able to recognize street address (even made up but syntactically correct) with high accuracy.
Deep neural network model have given us good accuracy but the biggest hurdle to good NER is availability of data. Other library also performed well when used for entities they were trained for. During re training of these model we again face the issue of dataset. Collection of data can be difficult and highly depends on domain. There are dataset available like the coNLL 2002 task data , Language-Independent Named Entity Recognition at CoNLL-2003 or CLEF ehealth 2016 but when building domain specific models, in our case real estate, we either had to get data from people’s conversation with real estate agents or find data that closely resembles it.