Dealing missing data is an important step. Missing data can cause issues when trying to run models, visualizing the data, calculating summary statistics, or even when trying to convert the data type.
In order to deal with missing data, we first need to know how to find it. There are a few different approaches to dealing with the missing data.
NaN, short for “not a number,” or null values are probably the easiest missing values to detect. Pandas has built in ways to check for NaNs.
The code above returns a matrix of boolean values where if a cell contains a NaN values it returns True, and other cell containing valid values returns False.
This can be hard to go through if you have a large dataframe. If you just want to check how many null values are in your dataframe, you can use
Since True returns a 1 and False returns a 0, .sum() can be used to count the number of null values in each column.
We’ll take a look at this using the Super Hero Dataset from Kaggle.
First we need to take a quick look at the the dataframe we will be using:
First we are just going to look at the boolean matrix.
Looking at this isn’t very helpful. It looks like there aren’t any null values, but that came be misleading so we should look at
Here we can see that there are null values in Publisher and weight.
Now that we know where the null values are, we need to look for nonsense placeholder values that could mess with analysis later on.
Just looking at this segment of the data, we can see that there is at least one character that has a weight of -99.0. This value does not make sense in the terms of weight so it warrants looking into.
We can see that there are quite a few characters with a weight of -99.0.
Dealing with the missing data
There are a few different ways you can deal with missing data. You can remove it, replace it, or keep it.
In the case of the missing publisher information, since there is only 15 missing publishers, we can just drop those rows.
Now we need to deal with the weight column.
First we are going to convert the -99.0 into NaNs.
Now we can decide how to deal with them. Since there are a lot of rows with NaNs now, we shouldn’t just drop those rows. One way to deal with this values is replace the missing values with the median weight.
Now we just need to check to make sure everything worked like we wanted.