CleanText is an open-source python package to easily clean raw text data.
Features
- Eliminate any additional white space
- Make all of the text in the sentence lowercase throughout
- Remove digits from the text
- Remove punctuations from the text
- With custom regex, replace or remove the text portion
- Select a language for stop words and eliminate stop words
- Stem the words. Stemming is the process of joining words with overlapping meanings to form a single term
Installation
pip install cleantext
Example
To return the text in a string format,
import cleantext
cleantext.clean("your_raw_text_here")
To return a list of words from the text,
import cleantext
cleantext.clean_words("your_raw_text_here")
import cleantext
cleantext.clean("your_raw_text_here",
clean_all= False # Execute all cleaning operations
extra_spaces=True , # Remove extra white spaces
stemming=True , # Stem the words
stopwords=True ,# Remove stop words
lowercase=True ,# Convert to lowercase
numbers=True ,# Remove all digits
punct=True ,# Remove all punctuations
reg: str = '<regex>', # Remove parts of text based on regex
reg_replace: str = '<replace_value>', # String to replace the regex used in reg
stp_lang='english' # Language for stop words
)
clean_text = "01040200 (BNF code for codeine phosphate 60mg tablets)"
cleantext.clean(clean_text, lowercase=True, numbers= True, stopwords=True, punct=True)
Output: "bnf code codeine phosphate mg tablets"
Using Regex to replace email with keyword:
text = "my id, name1@dom1.com and your, name2@dom2.in"
cleantext.clean(text, reg=r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", reg_replace='email', clean_all=False)
Output: "my id, email and your, email"
Using Regex to replace phone number with keyword:
text = "My number is (123)-567-8912"
cleantext.clean(text, reg = "(\d{10})|(\(\d{3}\)-\d{3}-\d{4})", reg_replace = "<num>", clean_all= False)
Output: "My number is <num>"
Hope you learned something new today, Happy Learning!