Topic Modelling and Semantic Search with Top2Vec

5 min readJun 16, 2022

Topic Modelling is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents.

Why we need it?

Discovering hidden topical patterns that are present across the collection.
Annotating documents according to these topics.
Using these annotations to organize, search and summarize texts.

Coming to our topic which is Top2Vec, It is an algorithm designed specifically for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors. This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the number of topics from your data. Awesome right?

Let’s see how Top2Vec achieve this.

Generate Embedding Vectors for text and words using Doc2Vec or Universal Sentence Encoder or BERT Sentence Transformer.
Performs Dimensionality Reduction on vectors using UMAP, this helps in finding dense areas.
Using HDBSCAN, it automatically finds those dense areas in your documents and cluster those topics. Assigns topics to each clusters.

Lets get into Practical Implementation.

Dataset used : Presidential Inaugural Addresses

Columns : US president name, Address, date, and speech text

Installation

pip install top2vecpip install top2vec[sentence_transformers]pip install top2vec[sentence_encoders]

Reading Data

from top2vec import Top2Vecimport pandas as pddf = pd.read_csv("inaug_speeches.csv", engine='python', encoding= 'latin1')df.head()

Cleaning Noisy Data

Now, we can train our model

model = Top2Vec(df.text.values, embedding_model='universal-sentence-encoder')''' Models Availableuniversal-sentence-encoderuniversal-sentence-encoder-multilingualdistiluse-base-multilingual-cased'''

Get number of topics

model.get_num_topics()
Output : 2

Get keywords for each topics

model.topic_wordsOutput : array([['constitutional', 'citizens', 'republic', 'oath', 'countrymen',         'democracy', 'constitution', 'nation', 'citizen', 'prosperity',         'against', 'respect', 'civil', 'freedom', 'without', 'honor',         'equal', 'congress', 'government', 'whose', 'who', 'liberty',         'powers', 'principles', 'national', 'rights', 'states',         'ourselves', 'principle', 'necessary', 'governments', 'nor',         'authority', 'shall', 'among', 'duty', 'even', 'free',         'executive', 'administration', 'each', 'between', 'every',         'others', 'under', 'president', 'called', 'individual', 'both',         'of'],        ['freedom', 'prosperity', 'nation', 'citizens', 'countrymen',         'liberty', 'democracy', 'republic', 'citizen', 'oath', 'equal',         'peace', 'against', 'nations', 'without', 'constitutional',         'ourselves', 'beyond', 'free', 'constitution', 'respect',         'honor', 'who', 'national', 'president', 'strength', 'necessary',         'america', 'individual', 'country', 'hope', 'every', 'greater',         'united', 'world', 'principles', 'civil', 'strong', 'only',         'live', 'itself', 'even', 'nor', 'whose', 'stand', 'war', 'our',         'shall', 'governments', 'americans']], dtype='<U14')

Generate a WordCloud

model.generate_topic_wordcloud(0)

Search topics by keywords

topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["citizens"], num_topics=2)print(f"No of topics : {len(topic_nums)} and words are {topic_words[0]}")Output : No of topics : 2 and words are ['constitutional' 'citizens' 'republic' 'oath' 'countrymen' 'democracy'  'constitution' 'nation' 'citizen' 'prosperity' 'against' 'respect'  'civil' 'freedom' 'without' 'honor' 'equal' 'congress' 'government'  'whose' 'who' 'liberty' 'powers' 'principles' 'national' 'rights'  'states' 'ourselves' 'principle' 'necessary' 'governments' 'nor'  'authority' 'shall' 'among' 'duty' 'even' 'free' 'executive'  'administration' 'each' 'between' 'every' 'others' 'under' 'president'  'called' 'individual' 'both' 'of']

Get the First Topic

topic_words, word_scores, topic_nums = model.get_topics(1)print(topic_words)Output : [['constitutional' 'citizens' 'republic' 'oath' 'countrymen' 'democracy'   'constitution' 'nation' 'citizen' 'prosperity' 'against' 'respect'   'civil' 'freedom' 'without' 'honor' 'equal' 'congress' 'government'   'whose' 'who' 'liberty' 'powers' 'principles' 'national' 'rights'   'states' 'ourselves' 'principle' 'necessary' 'governments' 'nor'   'authority' 'shall' 'among' 'duty' 'even' 'free' 'executive'   'administration' 'each' 'between' 'every' 'others' 'under' 'president'   'called' 'individual' 'both' 'of']]

Get Similarity Words

# to get similar words : Semantic Searchwords, word_scores = model.similar_words(keywords=["constitutional"], keywords_neg=[], num_words=20)for word, score in zip(words, word_scores):print(f"{word} {score}")Output : constitution 0.7288548891777089 
citizen 0.4723290988168851 
citizens 0.4502901978727408 
oath 0.430887191231749 
government 0.4204032301028167 
law 0.4152112776782078 
rights 0.41506158969869855 
congress 0.3923969740586745 
democracy 0.3910433332140063 
laws 0.37440227253656755 
american 0.37049017996101846
president 0.36987784447077826 
commerce 0.368667863317993 
states 0.36404274276619314 
state 0.3631170358867932 
liberty 0.36074777573782213 
foreign 0.3518555599625002 
duties 0.3516476103913455 
federal 0.35102676632961854 
countrymen 0.3505416958618698

Search Documents by Keywords

documents, document_scores, document_ids = model.search_documents_by_keywords(keywords=["government", "citizen"], num_docs=5)for doc, score, doc_id in zip(documents, document_scores, document_ids):print(f"Document: {doc_id}, Score: {score}")print("-----------")print(doc)print("-----------")print()

Get Topic Vector for each topics

model.topic_vectorsOutput : array([[-0.03240328, -0.06166243, -0.02745233, ...,  0.04508633,         -0.04117269,  0.04916515],        [-0.0502777 , -0.06314228, -0.02798648, ...,  0.05642398,         -0.00569224,  0.00679894]], dtype=float32)

Use the embedding model used by the Top2Vec model to generate document embeddings for any section of text

embedding_vector = model.embed(["fellow citizens of the senate and of the house of representatives"])embedding_vector.shapeOutput : TensorShape([1, 512])

Reduce the Number of Topics

topic_mapping = model.hierarchical_topic_reduction(num_topics=1)model.topic_words_reduced[0]Output : array(['citizens', 'republic', 'countrymen', 'constitutional', 'oath',        'nation', 'democracy', 'prosperity', 'citizen', 'freedom',        'constitution', 'against', 'liberty', 'equal', 'without',        'respect', 'honor', 'who', 'civil', 'ourselves', 'national',        'principles', 'whose', 'free', 'peace', 'necessary', 'powers',        'beyond', 'nations', 'nor', 'congress', 'government',        'governments', 'individual', 'even', 'every', 'rights',        'president', 'shall', 'states', 'each', 'under', 'principle',        'both', 'among', 'strength', 'greater', 'itself', 'between',        'duty'], dtype='<U14')

Save and load the Model

model.save("inaug_speeches")model = Top2Vec.load("inaug_speeches")print(model.get_topics()[0])Output : array([['constitutional', 'citizens', 'republic', 'oath', 'countrymen',         'democracy', 'constitution', 'nation', 'citizen', 'prosperity',         'against', 'respect', 'civil', 'freedom', 'without', 'honor',         'equal', 'congress', 'government', 'whose', 'who', 'liberty',         'powers', 'principles', 'national', 'rights', 'states',         'ourselves', 'principle', 'necessary', 'governments', 'nor',         'authority', 'shall', 'among', 'duty', 'even', 'free',         'executive', 'administration', 'each', 'between', 'every',         'others', 'under', 'president', 'called', 'individual', 'both',         'of'],        ['freedom', 'prosperity', 'nation', 'citizens', 'countrymen',         'liberty', 'democracy', 'republic', 'citizen', 'oath', 'equal',         'peace', 'against', 'nations', 'without', 'constitutional',         'ourselves', 'beyond', 'free', 'constitution', 'respect',         'honor', 'who', 'national', 'president', 'strength', 'necessary',         'america', 'individual', 'country', 'hope', 'every', 'greater',         'united', 'world', 'principles', 'civil', 'strong', 'only',         'live', 'itself', 'even', 'nor', 'whose', 'stand', 'war', 'our',         'shall', 'governments', 'americans']], dtype='<U14')

Check out my Github repo for code by clicking here.

References :

Github
Python Package
Paper
Expose a trained and saved Top2Vec model with a REST API : Rest-API for Top2Vec

Hope you learned something new today, Happy Learning!