Topic Modelling and Semantic Search with Top2Vec

Amal
5 min readJun 16, 2022

--

Topic Modelling is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents.

Why we need it?

  • Discovering hidden topical patterns that are present across the collection.
  • Annotating documents according to these topics.
  • Using these annotations to organize, search and summarize texts.

Coming to our topic which is Top2Vec, It is an algorithm designed specifically for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors. This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the number of topics from your data. Awesome right?

Let’s see how Top2Vec achieve this.

Lets get into Practical Implementation.

Dataset used : Presidential Inaugural Addresses

Columns : US president name, Address, date, and speech text

Installation

pip install top2vecpip install top2vec[sentence_transformers]pip install top2vec[sentence_encoders]

Reading Data

from top2vec import Top2Vecimport pandas as pddf = pd.read_csv("inaug_speeches.csv", engine='python', encoding= 'latin1')df.head()
Sample of data

Cleaning Noisy Data

Example as shown above

Now, we can train our model

model = Top2Vec(df.text.values, embedding_model='universal-sentence-encoder')''' Models Availableuniversal-sentence-encoderuniversal-sentence-encoder-multilingualdistiluse-base-multilingual-cased'''

Get number of topics

model.get_num_topics()
Output : 2

Get keywords for each topics

model.topic_wordsOutput : array([['constitutional', 'citizens', 'republic', 'oath', 'countrymen',         'democracy', 'constitution', 'nation', 'citizen', 'prosperity',         'against', 'respect', 'civil', 'freedom', 'without', 'honor',         'equal', 'congress', 'government', 'whose', 'who', 'liberty',         'powers', 'principles', 'national', 'rights', 'states',         'ourselves', 'principle', 'necessary', 'governments', 'nor',         'authority', 'shall', 'among', 'duty', 'even', 'free',         'executive', 'administration', 'each', 'between', 'every',         'others', 'under', 'president', 'called', 'individual', 'both',         'of'],        ['freedom', 'prosperity', 'nation', 'citizens', 'countrymen',         'liberty', 'democracy', 'republic', 'citizen', 'oath', 'equal',         'peace', 'against', 'nations', 'without', 'constitutional',         'ourselves', 'beyond', 'free', 'constitution', 'respect',         'honor', 'who', 'national', 'president', 'strength', 'necessary',         'america', 'individual', 'country', 'hope', 'every', 'greater',         'united', 'world', 'principles', 'civil', 'strong', 'only',         'live', 'itself', 'even', 'nor', 'whose', 'stand', 'war', 'our',         'shall', 'governments', 'americans']], dtype='<U14')

Generate a WordCloud

model.generate_topic_wordcloud(0)

Search topics by keywords

topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["citizens"], num_topics=2)print(f"No of topics : {len(topic_nums)} and words are {topic_words[0]}")Output : No of topics : 2 and words are ['constitutional' 'citizens' 'republic' 'oath' 'countrymen' 'democracy'  'constitution' 'nation' 'citizen' 'prosperity' 'against' 'respect'  'civil' 'freedom' 'without' 'honor' 'equal' 'congress' 'government'  'whose' 'who' 'liberty' 'powers' 'principles' 'national' 'rights'  'states' 'ourselves' 'principle' 'necessary' 'governments' 'nor'  'authority' 'shall' 'among' 'duty' 'even' 'free' 'executive'  'administration' 'each' 'between' 'every' 'others' 'under' 'president'  'called' 'individual' 'both' 'of']

Get the First Topic

topic_words, word_scores, topic_nums = model.get_topics(1)print(topic_words)Output : [['constitutional' 'citizens' 'republic' 'oath' 'countrymen' 'democracy'   'constitution' 'nation' 'citizen' 'prosperity' 'against' 'respect'   'civil' 'freedom' 'without' 'honor' 'equal' 'congress' 'government'   'whose' 'who' 'liberty' 'powers' 'principles' 'national' 'rights'   'states' 'ourselves' 'principle' 'necessary' 'governments' 'nor'   'authority' 'shall' 'among' 'duty' 'even' 'free' 'executive'   'administration' 'each' 'between' 'every' 'others' 'under' 'president'   'called' 'individual' 'both' 'of']]

Get Similarity Words

# to get similar words : Semantic Searchwords, word_scores = model.similar_words(keywords=["constitutional"], keywords_neg=[], num_words=20)for word, score in zip(words, word_scores):print(f"{word} {score}")Output : constitution 0.7288548891777089 
citizen 0.4723290988168851
citizens 0.4502901978727408
oath 0.430887191231749
government 0.4204032301028167
law 0.4152112776782078
rights 0.41506158969869855
congress 0.3923969740586745
democracy 0.3910433332140063
laws 0.37440227253656755
american 0.37049017996101846
president 0.36987784447077826
commerce 0.368667863317993
states 0.36404274276619314
state 0.3631170358867932
liberty 0.36074777573782213
foreign 0.3518555599625002
duties 0.3516476103913455
federal 0.35102676632961854
countrymen 0.3505416958618698

Search Documents by Keywords

documents, document_scores, document_ids = model.search_documents_by_keywords(keywords=["government", "citizen"], num_docs=5)for doc, score, doc_id in zip(documents, document_scores, document_ids):print(f"Document: {doc_id}, Score: {score}")print("-----------")print(doc)print("-----------")print()

Get Topic Vector for each topics

model.topic_vectorsOutput : array([[-0.03240328, -0.06166243, -0.02745233, ...,  0.04508633,         -0.04117269,  0.04916515],        [-0.0502777 , -0.06314228, -0.02798648, ...,  0.05642398,         -0.00569224,  0.00679894]], dtype=float32)

Use the embedding model used by the Top2Vec model to generate document embeddings for any section of text

embedding_vector = model.embed(["fellow citizens of the senate and of the house of representatives"])embedding_vector.shapeOutput : TensorShape([1, 512])

Reduce the Number of Topics

topic_mapping = model.hierarchical_topic_reduction(num_topics=1)model.topic_words_reduced[0]Output : array(['citizens', 'republic', 'countrymen', 'constitutional', 'oath',        'nation', 'democracy', 'prosperity', 'citizen', 'freedom',        'constitution', 'against', 'liberty', 'equal', 'without',        'respect', 'honor', 'who', 'civil', 'ourselves', 'national',        'principles', 'whose', 'free', 'peace', 'necessary', 'powers',        'beyond', 'nations', 'nor', 'congress', 'government',        'governments', 'individual', 'even', 'every', 'rights',        'president', 'shall', 'states', 'each', 'under', 'principle',        'both', 'among', 'strength', 'greater', 'itself', 'between',        'duty'], dtype='<U14')

Save and load the Model

model.save("inaug_speeches")model = Top2Vec.load("inaug_speeches")print(model.get_topics()[0])Output : array([['constitutional', 'citizens', 'republic', 'oath', 'countrymen',         'democracy', 'constitution', 'nation', 'citizen', 'prosperity',         'against', 'respect', 'civil', 'freedom', 'without', 'honor',         'equal', 'congress', 'government', 'whose', 'who', 'liberty',         'powers', 'principles', 'national', 'rights', 'states',         'ourselves', 'principle', 'necessary', 'governments', 'nor',         'authority', 'shall', 'among', 'duty', 'even', 'free',         'executive', 'administration', 'each', 'between', 'every',         'others', 'under', 'president', 'called', 'individual', 'both',         'of'],        ['freedom', 'prosperity', 'nation', 'citizens', 'countrymen',         'liberty', 'democracy', 'republic', 'citizen', 'oath', 'equal',         'peace', 'against', 'nations', 'without', 'constitutional',         'ourselves', 'beyond', 'free', 'constitution', 'respect',         'honor', 'who', 'national', 'president', 'strength', 'necessary',         'america', 'individual', 'country', 'hope', 'every', 'greater',         'united', 'world', 'principles', 'civil', 'strong', 'only',         'live', 'itself', 'even', 'nor', 'whose', 'stand', 'war', 'our',         'shall', 'governments', 'americans']], dtype='<U14')

Check out my Github repo for code by clicking here.

References :

Hope you learned something new today, Happy Learning!

--

--

Amal
Amal

Written by Amal

Regular Post | Data Scientist, India

No responses yet