Sign in

Preprocessing Data

In this blog, we are going to look at preprocessing data in order to get it ready for modeling. For this example we are going to use the Chicago Traffic Crash Database.

2 Load the Data/Filtering for Chosen Zip-codes

First we import libraries we will need to load, process, and plot our data.

We load the data, ensure it loaded correctly, and take a quick look at .info() to see what we will be working with in this dataframe.

5 rows × 310 columns

Now that we have a basic idea of the data, we need to start filtering for Portland, Oregon. Since there are multiple Portlands in the United States, we will first drop any rows that are outside or Oregon.

5 rows × 310 columns

Now we can create our Portland dataframe.

5 rows × 310 columns

We can drop [‘City’,’State’, ‘StateName’, ‘Metro’, ‘SizeRank’, ‘CountyName’, ‘RegionID’, ‘RegionType’] columns since all the data in those columns are either repetitive or irrelevant.

3 Data Preprocessing

Reshape the dataframe from Wide to Long Format so the dates and average prices become columns. Then we examine and process the dataframe

Here we see that we need to update the datatype of date.

RegionName is another name for zip-code which is more of a categorical variable so we should convert the data type as well.

Let’s take another look to make sure our changes worked as planned.

Quickly check for null values but from the examination of .info() above, there should not be any.

Data analyst with experience in web scraping, SQL, data modeling, and machine learning.