Easily Clean Raw Text Data using CleanText

Amal
2 min readDec 14, 2022

CleanText is an open-source python package to easily clean raw text data.

Features

  • Eliminate any additional white space
  • Make all of the text in the sentence lowercase throughout
  • Remove digits from the text
  • Remove punctuations from the text
  • With custom regex, replace or remove the text portion
  • Select a language for stop words and eliminate stop words
  • Stem the words. Stemming is the process of joining words with overlapping meanings to form a single term

Installation

pip install cleantext

Example

To return the text in a string format,

import cleantext
cleantext.clean("your_raw_text_here")

To return a list of words from the text,

import cleantext
cleantext.clean_words("your_raw_text_here")
import cleantext

cleantext.clean("your_raw_text_here",
clean_all= False # Execute all cleaning operations
extra_spaces=True , # Remove extra white spaces
stemming=True , # Stem the words
stopwords=True ,# Remove stop words
lowercase=True ,# Convert to lowercase
numbers=True ,# Remove all digits
punct=True ,# Remove all punctuations
reg: str = '<regex>', # Remove parts of text based on regex
reg_replace: str = '<replace_value>', # String to replace the regex used in reg
stp_lang='english' # Language for stop words
)
clean_text = "01040200 (BNF code for codeine phosphate 60mg tablets)"

cleantext.clean(clean_text, lowercase=True, numbers= True, stopwords=True, punct=True)

Output: "bnf code codeine phosphate mg tablets"

Using Regex to replace email with keyword:

text = "my id, name1@dom1.com and your, name2@dom2.in"
cleantext.clean(text, reg=r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", reg_replace='email', clean_all=False)

Output: "my id, email and your, email"

Using Regex to replace phone number with keyword:

text = "My number is (123)-567-8912"

cleantext.clean(text, reg = "(\d{10})|(\(\d{3}\)-\d{3}-\d{4})", reg_replace = "<num>", clean_all= False)

Output: "My number is <num>"

Hope you learned something new today, Happy Learning!

--

--