Categorical Values

Cris
4 min readJul 10, 2021

--

We will start by getting the Ames Housing dataset from https://www.kaggle.com/prevek18/ames-housing-dataset and saving it as ames.csv so we can inout it into a pandas dataframe using pandas read_csv().

# Import your data
import pandas as pd

Save it as a dataframe to be used throughout this notebook.

df = pd.read_csv('ames.csv')

Now that we have it loaded and saved as a dataframe, let’s look at the first five rows to see what will be working with:

# Inspect the first few rows
df.head()

5 rows × 81 columns

As we can see there are more columns then what will show up by printing the dataframe, so let’s take a look at what columns with .info().

df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id 1460 non-null int64
MSSubClass 1460 non-null int64
MSZoning 1460 non-null object
LotFrontage 1201 non-null float64
LotArea 1460 non-null int64
Street 1460 non-null object
Alley 91 non-null object
LotShape 1460 non-null object
LandContour 1460 non-null object
Utilities 1460 non-null object
LotConfig 1460 non-null object
LandSlope 1460 non-null object
Neighborhood 1460 non-null object
Condition1 1460 non-null object
Condition2 1460 non-null object
BldgType 1460 non-null object
HouseStyle 1460 non-null object
OverallQual 1460 non-null int64
OverallCond 1460 non-null int64
YearBuilt 1460 non-null int64
YearRemodAdd 1460 non-null int64
RoofStyle 1460 non-null object
RoofMatl 1460 non-null object
Exterior1st 1460 non-null object
Exterior2nd 1460 non-null object
MasVnrType 1452 non-null object
MasVnrArea 1452 non-null float64
ExterQual 1460 non-null object
ExterCond 1460 non-null object
Foundation 1460 non-null object
BsmtQual 1423 non-null object
BsmtCond 1423 non-null object
BsmtExposure 1422 non-null object
BsmtFinType1 1423 non-null object
BsmtFinSF1 1460 non-null int64
BsmtFinType2 1422 non-null object
BsmtFinSF2 1460 non-null int64
BsmtUnfSF 1460 non-null int64
TotalBsmtSF 1460 non-null int64
Heating 1460 non-null object
HeatingQC 1460 non-null object
CentralAir 1460 non-null object
Electrical 1459 non-null object
1stFlrSF 1460 non-null int64
2ndFlrSF 1460 non-null int64
LowQualFinSF 1460 non-null int64
GrLivArea 1460 non-null int64
BsmtFullBath 1460 non-null int64
BsmtHalfBath 1460 non-null int64
FullBath 1460 non-null int64
HalfBath 1460 non-null int64
BedroomAbvGr 1460 non-null int64
KitchenAbvGr 1460 non-null int64
KitchenQual 1460 non-null object
TotRmsAbvGrd 1460 non-null int64
Functional 1460 non-null object
Fireplaces 1460 non-null int64
FireplaceQu 770 non-null object
GarageType 1379 non-null object
GarageYrBlt 1379 non-null float64
GarageFinish 1379 non-null object
GarageCars 1460 non-null int64
GarageArea 1460 non-null int64
GarageQual 1379 non-null object
GarageCond 1379 non-null object
PavedDrive 1460 non-null object
WoodDeckSF 1460 non-null int64
OpenPorchSF 1460 non-null int64
EnclosedPorch 1460 non-null int64
3SsnPorch 1460 non-null int64
ScreenPorch 1460 non-null int64
PoolArea 1460 non-null int64
PoolQC 7 non-null object
Fence 281 non-null object
MiscFeature 54 non-null object
MiscVal 1460 non-null int64
MoSold 1460 non-null int64
YrSold 1460 non-null int64
SaleType 1460 non-null object
SaleCondition 1460 non-null object
SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

From the above we can see that we have various different data types that we will have to work with.

Let’s take a look at what some of these columns mean.

LotArea: Size of the lot in square feet

MSZoning: Identifies the general zoning classification of the sale.

   A     Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium Density

OverallCond: Rates the overall condition of the house

   10    Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor

KitchenQual: Kitchen quality

   Ex    Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor

YrSold: Year Sold (YYYY)

SalePrice: Sale price of the house in dollars

Let’s inspect all features using .describe() and .info()

# Use .describe()
df.describe()

8 rows × 38 columns

Plotting Categorical Data

We’ll pick 6 categorical variables and plot them against SalePrice with a bar graph for each variable.

First we need to import the necessary libraries.

import matplotlib.pyplot as plt
%matplotlib inline

Now we will create bar plots.

Creating dummy variables

Creating dummy variables is a way to use categorical variables in models that only accept numerical values. We will go more in depth with dummy values later, but the basic idea is it converts categorical variables into binary values.

# Create dummy variables for your six categorical featuresdummies = pd.get_dummies(df[categorical], prefix=categorical, drop_first=True)
dummies.head()

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Cris
Cris

Written by Cris

0 Followers

Data analyst with experience in web scraping, SQL, data modeling, and machine learning.

No responses yet

Write a response