Topic Modelling is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents.
Why we need it?
- Discovering hidden topical patterns that are present across the collection.
- Annotating documents according to these topics.
- Using these annotations to organize, search and summarize texts.
Coming to our topic which is Top2Vec, It is an algorithm designed specifically for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors. This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the number of topics from your data. Awesome right?
Let’s see how Top2Vec achieve this.
- Generate Embedding Vectors for text and words using Doc2Vec or Universal Sentence Encoder or BERT Sentence Transformer.
- Performs Dimensionality Reduction on vectors using UMAP, this helps in finding dense areas.
- Using HDBSCAN, it automatically finds those dense areas in your documents and cluster those topics. Assigns topics to each clusters.
Lets get into Practical Implementation.
Dataset used : Presidential Inaugural Addresses
Columns : US president name, Address, date, and speech text
Installation
pip install top2vecpip install top2vec[sentence_transformers]pip install top2vec[sentence_encoders]
Reading Data
from top2vec import Top2Vecimport pandas as pddf = pd.read_csv("inaug_speeches.csv", engine='python', encoding= 'latin1')df.head()
Cleaning Noisy Data
Now, we can train our model
model = Top2Vec(df.text.values, embedding_model='universal-sentence-encoder')''' Models Availableuniversal-sentence-encoderuniversal-sentence-encoder-multilingualdistiluse-base-multilingual-cased'''
Get number of topics
model.get_num_topics()
Output : 2
Get keywords for each topics
model.topic_wordsOutput : array([['constitutional', 'citizens', 'republic', 'oath', 'countrymen', 'democracy', 'constitution', 'nation', 'citizen', 'prosperity', 'against', 'respect', 'civil', 'freedom', 'without', 'honor', 'equal', 'congress', 'government', 'whose', 'who', 'liberty', 'powers', 'principles', 'national', 'rights', 'states', 'ourselves', 'principle', 'necessary', 'governments', 'nor', 'authority', 'shall', 'among', 'duty', 'even', 'free', 'executive', 'administration', 'each', 'between', 'every', 'others', 'under', 'president', 'called', 'individual', 'both', 'of'], ['freedom', 'prosperity', 'nation', 'citizens', 'countrymen', 'liberty', 'democracy', 'republic', 'citizen', 'oath', 'equal', 'peace', 'against', 'nations', 'without', 'constitutional', 'ourselves', 'beyond', 'free', 'constitution', 'respect', 'honor', 'who', 'national', 'president', 'strength', 'necessary', 'america', 'individual', 'country', 'hope', 'every', 'greater', 'united', 'world', 'principles', 'civil', 'strong', 'only', 'live', 'itself', 'even', 'nor', 'whose', 'stand', 'war', 'our', 'shall', 'governments', 'americans']], dtype='<U14')
Generate a WordCloud
model.generate_topic_wordcloud(0)
Search topics by keywords
topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["citizens"], num_topics=2)print(f"No of topics : {len(topic_nums)} and words are {topic_words[0]}")Output : No of topics : 2 and words are ['constitutional' 'citizens' 'republic' 'oath' 'countrymen' 'democracy' 'constitution' 'nation' 'citizen' 'prosperity' 'against' 'respect' 'civil' 'freedom' 'without' 'honor' 'equal' 'congress' 'government' 'whose' 'who' 'liberty' 'powers' 'principles' 'national' 'rights' 'states' 'ourselves' 'principle' 'necessary' 'governments' 'nor' 'authority' 'shall' 'among' 'duty' 'even' 'free' 'executive' 'administration' 'each' 'between' 'every' 'others' 'under' 'president' 'called' 'individual' 'both' 'of']
Get the First Topic
topic_words, word_scores, topic_nums = model.get_topics(1)print(topic_words)Output : [['constitutional' 'citizens' 'republic' 'oath' 'countrymen' 'democracy' 'constitution' 'nation' 'citizen' 'prosperity' 'against' 'respect' 'civil' 'freedom' 'without' 'honor' 'equal' 'congress' 'government' 'whose' 'who' 'liberty' 'powers' 'principles' 'national' 'rights' 'states' 'ourselves' 'principle' 'necessary' 'governments' 'nor' 'authority' 'shall' 'among' 'duty' 'even' 'free' 'executive' 'administration' 'each' 'between' 'every' 'others' 'under' 'president' 'called' 'individual' 'both' 'of']]
Get Similarity Words
# to get similar words : Semantic Searchwords, word_scores = model.similar_words(keywords=["constitutional"], keywords_neg=[], num_words=20)for word, score in zip(words, word_scores):print(f"{word} {score}")Output : constitution 0.7288548891777089
citizen 0.4723290988168851
citizens 0.4502901978727408
oath 0.430887191231749
government 0.4204032301028167
law 0.4152112776782078
rights 0.41506158969869855
congress 0.3923969740586745
democracy 0.3910433332140063
laws 0.37440227253656755
american 0.37049017996101846
president 0.36987784447077826
commerce 0.368667863317993
states 0.36404274276619314
state 0.3631170358867932
liberty 0.36074777573782213
foreign 0.3518555599625002
duties 0.3516476103913455
federal 0.35102676632961854
countrymen 0.3505416958618698
Search Documents by Keywords
documents, document_scores, document_ids = model.search_documents_by_keywords(keywords=["government", "citizen"], num_docs=5)for doc, score, doc_id in zip(documents, document_scores, document_ids):print(f"Document: {doc_id}, Score: {score}")print("-----------")print(doc)print("-----------")print()
Get Topic Vector for each topics
model.topic_vectorsOutput : array([[-0.03240328, -0.06166243, -0.02745233, ..., 0.04508633, -0.04117269, 0.04916515], [-0.0502777 , -0.06314228, -0.02798648, ..., 0.05642398, -0.00569224, 0.00679894]], dtype=float32)
Use the embedding model used by the Top2Vec model to generate document embeddings for any section of text
embedding_vector = model.embed(["fellow citizens of the senate and of the house of representatives"])embedding_vector.shapeOutput : TensorShape([1, 512])
Reduce the Number of Topics
topic_mapping = model.hierarchical_topic_reduction(num_topics=1)model.topic_words_reduced[0]Output : array(['citizens', 'republic', 'countrymen', 'constitutional', 'oath', 'nation', 'democracy', 'prosperity', 'citizen', 'freedom', 'constitution', 'against', 'liberty', 'equal', 'without', 'respect', 'honor', 'who', 'civil', 'ourselves', 'national', 'principles', 'whose', 'free', 'peace', 'necessary', 'powers', 'beyond', 'nations', 'nor', 'congress', 'government', 'governments', 'individual', 'even', 'every', 'rights', 'president', 'shall', 'states', 'each', 'under', 'principle', 'both', 'among', 'strength', 'greater', 'itself', 'between', 'duty'], dtype='<U14')
Save and load the Model
model.save("inaug_speeches")model = Top2Vec.load("inaug_speeches")print(model.get_topics()[0])Output : array([['constitutional', 'citizens', 'republic', 'oath', 'countrymen', 'democracy', 'constitution', 'nation', 'citizen', 'prosperity', 'against', 'respect', 'civil', 'freedom', 'without', 'honor', 'equal', 'congress', 'government', 'whose', 'who', 'liberty', 'powers', 'principles', 'national', 'rights', 'states', 'ourselves', 'principle', 'necessary', 'governments', 'nor', 'authority', 'shall', 'among', 'duty', 'even', 'free', 'executive', 'administration', 'each', 'between', 'every', 'others', 'under', 'president', 'called', 'individual', 'both', 'of'], ['freedom', 'prosperity', 'nation', 'citizens', 'countrymen', 'liberty', 'democracy', 'republic', 'citizen', 'oath', 'equal', 'peace', 'against', 'nations', 'without', 'constitutional', 'ourselves', 'beyond', 'free', 'constitution', 'respect', 'honor', 'who', 'national', 'president', 'strength', 'necessary', 'america', 'individual', 'country', 'hope', 'every', 'greater', 'united', 'world', 'principles', 'civil', 'strong', 'only', 'live', 'itself', 'even', 'nor', 'whose', 'stand', 'war', 'our', 'shall', 'governments', 'americans']], dtype='<U14')
Check out my Github repo for code by clicking here.
References :
- Github
- Python Package
- Paper
- Expose a trained and saved Top2Vec model with a REST API : Rest-API for Top2Vec
Hope you learned something new today, Happy Learning!