Topic modelling is a Natural language processing and machine learning technique often used in text mining. As the name entails topic modelling deals with discovery and extraction of topics from a collection of documents. These ‘topics’ can then be used for inferring theme of documents and then finally use it for document clustering based on a common theme.
The data we have is from Fintech blogging site letstalkpayments(LTP). Generally speaking, the blogs on this site deal with company, event, place or product related to Fintech industry. We treat the data which is a collection of blogs as a collection of documents.
Why topic modelling
The documents we got from LTP are unlabeled i.e we do not have any attached labels or class which can be used to train supervised model. So our problem is reduced to clustering. There are many clustering techniques like k-means. But the reason we do not use k-means is that our aim is not only to cluster the documents but also to find the class or topic which makes documents in the same cluster similar.
The technique we use for topic modelling is Latent Dirichlet Allocation. The intuition behind this technique is, group words that appear in similar documents belong under a single topic. For example, if a document is about online payment processing we might expect words like ‘google wallet’, ‘PayPal’ or ‘merchant’ but not words like ‘region ‘,’insurance’ etc. These frequent occurring words can then be grouped under a single topic. Based on the frequency of these words appearing in document LDA can assign relevance score of a particular topic to a document. So when a document is passed through a trained LDA model it outputs LDA vector. This LDA vector is nothing but a topic distribution.
e.g. Below is a typical LDA vector returned by gensim LDA model. LDA vector is collection of tuple as (id, score)
Below is each ID translated to topic name
Data preprocessing is an important step in getting data ready for training LDA model. LDA requires data to be less noisy and topic should not change much with a reasonable number of documents belonging to each topic. As preprocessing step we first remove stopwords. The reason we do this is that stopwords occur quite frequently in English text and there is a high chance that most of the topics will be these stopwords. Stopword list is provided by NLTK but it is not comprehensive enough. We augment that list with words from this website Stop Word List 1.
Next step is removing punctuations from the text as it also has a high chance of getting detected as a topic word. Punctuation list is also provided by NLTK. Next, we pass the words through lemmatizer to get root form of the words. This can be done using lemmatizer provided by NLTK or Stanford core NLP. This step is done so that two same words (but different inflected forms) are not detected as separate topic words.
Another step which we included later after going through training steps is filtering out nouns. This can be done using POS tagger which tags the input statement words with parts of speech and then removing all words which are not nouns
e.g. The statement : “The payments industry in the US has been steadily rising as the hottest and most financially attractive segment within FinTech.” will give : “payments industry US segment FinTech”.
Training LDA model
Before training we need to first construct document term frequency matrix. To do this we first construct dictionary using Gensim’s inbuilt function. We use this dictionary to construct bag-of-words. Bag-of-words is a collection of words in text and their frequency. We can then build document term matrix from bag-of-words.
We use gensim library and Gensim’s implementation of LDA. The LDA model takes document term matrix as input. Other parameters we need to specify are number of passes (these are important for model to converge), dictionary, number to topics, update_every (number of batches after which to update), chunksize (size of batches in which to divide the documents) and every_eval (number of batches after which to calculate perplexity of model). For rest of the parameters, we keep their default values.
We trained models on the complete set of the document as well subset of documents. We vary the number of passes & number of topics and alpha type (asymmetric, auto & symmetric).
Below are of 50 topics model discovered after running for 50 passes on complete data: