Preprocessing Data

Cris
3 min readJul 26, 2021

In this blog, we are going to look at preprocessing data in order to get it ready for modeling. For this example we are going to use the Chicago Traffic Crash Database.

2 Load the Data/Filtering for Chosen Zip-codes

First we import libraries we will need to load, process, and plot our data.

import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import pandas.tseries
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
%matplotlib inline

We load the data, ensure it loaded correctly, and take a quick look at .info() to see what we will be working with in this dataframe.

# load dataframe and print head to ensure it loaded correctly
df = pd.read_csv('zillow_data2.csv')
df.head()

5 rows × 310 columns

df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30329 entries, 0 to 30328
Columns: 310 entries, RegionID to 2021-01
dtypes: float64(301), int64(3), object(6)
memory usage: 71.7+ MB

Now that we have a basic idea of the data, we need to start filtering for Portland, Oregon. Since there are multiple Portlands in the United States, we will first drop any rows that are outside or Oregon.

df_or = df.copy()
df_or = df_or[df_or.State == 'OR']
df_or.head()

5 rows × 310 columns

Now we can create our Portland dataframe.

df_port = df_or.copy()
df_port = df_port[df_port.City == 'Portland']
df_port.head()

5 rows × 310 columns

We can drop [‘City’,’State’, ‘StateName’, ‘Metro’, ‘SizeRank’, ‘CountyName’, ‘RegionID’, ‘RegionType’] columns since all the data in those columns are either repetitive or irrelevant.

df_port.drop(labels=['City','State', 'StateName', 'Metro', 'SizeRank', 'CountyName','RegionID', 'RegionType'], axis=1, inplace=True)
df_port['RegionName'].nunique()
28

3 Data Preprocessing

Reshape the dataframe from Wide to Long Format so the dates and average prices become columns. Then we examine and process the dataframe

df_portmelt = df_port.melt(id_vars=['RegionName'], var_name='Date',
value_name='Avg_Price')
df_portmelt.head()
df_portmelt.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8428 entries, 0 to 8427
Data columns (total 3 columns):
RegionName 8428 non-null int64
Date 8428 non-null object
Avg_Price 8428 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 197.6+ KB

Here we see that we need to update the datatype of date.

# Convert Date to datetime
df_portmelt['Date'] = pd.to_datetime(df_portmelt['Date'], format='%Y-%m')

RegionName is another name for zip-code which is more of a categorical variable so we should convert the data type as well.

# Convert RegionName to string as it will be used categoricaldf_portmelt['RegionName'] = 'zip: ' + df_portmelt['RegionName'].astype(str)

Let’s take another look to make sure our changes worked as planned.

df_portmelt.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8428 entries, 0 to 8427
Data columns (total 3 columns):
RegionName 8428 non-null object
Date 8428 non-null datetime64[ns]
Avg_Price 8428 non-null float64
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 197.6+ KB
df_portmelt.head()

Quickly check for null values but from the examination of .info() above, there should not be any.

df_portmelt.isnull().sum()RegionName    0
Date 0
Avg_Price 0
dtype: int64

--

--

Cris
0 Followers

Data analyst with experience in web scraping, SQL, data modeling, and machine learning.