Natural Language Processing

Day 3: Cleaning text, stopwords, predictions

Cleaning the text

Cleaning the text means to remove special characters, numbers and multiple empty spaces from the text.


Add a new script frame, click copy text button and paste the contents into the frame.

There is no output from this script.

Before Cleaning

The text that appears below shows numbers, special characters and extra spaces. The items to pay attention to are: review 73, 78 and 79. Originally these had extra spaces, numbers and special characters.


Add a new script frame, click copy text button and paste the contents into the frame.

The output shows the dataset before it is cleaned.


After cleaning

Add a new script frame, click copy text button and paste the contents into the frame.

The output shows that numbers, special characters and extra space have been removed.


Stop Words

The stopwords are a list of words that are very very common but don’t provide useful information for most text analysis procedures. While it is helpful to understand the structure of sentence, it does not help you understand the semantics of the sentences themselves.1

Here’s the code to generate a list of most commonly used stop words.


Add a new script frame, click copy text button and paste the contents into the frame.

Your output should look like:


Stopwords script

The following script shows you how this function works. I made a new Python project and put the file into a dictionary format.

Start a new Python project then click copy text button and paste the contents into the frame. Save it and call it "stopWords" and run it.

Your output should look like:


Python gives us with a number of data structures such as: