Text data can be messy in its own unique way. Imagine a city
column with entries like 'New York', ' new york ', and 'new york'. To a computer, these are three different cities, which would ruin any analysis.
Engagement Message
Why is it so important for identical items to be written in the exact same way?
To clean text, we need to use string methods. In Pandas, you access these by using the .str
accessor on a Series. For example, to work on a city
column, you would start with df['city'].str
. This tells Pandas to treat each entry as a string.
Engagement Message
Why do you think Pandas requires this extra .str
step?
A common issue is extra whitespace. The .str.strip()
method removes any spaces from the beginning and end of a string. So, ' chicago '
becomes 'chicago'
. This is a crucial first step for cleaning up user-entered text.
Engagement Message
What kind of common data entry errors does .str.strip()
help fix?
Next, we tackle inconsistent capitalization. The .str.lower()
method converts every character in a string to lowercase. This ensures that 'Boston', 'boston', and 'BOSTON' are all treated as the same value: 'boston'.
