A Look into Freaky Franchise’s Rotten Tomato Competition

Cris
7 min readJul 4, 2021

Podcasts are all the rage, especially in these weird times where many people have more free time than they did in the past. Today we are going to look at some gathered data from one particular podcast, Freaky Franchise where they “unmask horror movies based on quantity over quality.” I strongly suggest checking Freaky Franchise out if you are into horror movies.

The first part of the episode the two hosts have a friendly competition where they guess the Rotten Tomato scores of the movies they are discussing, the loser having to sum the movie up in under a minute. In this post, we are going to look data surrounding this competition and clean the data set to be used in the future run in a model.

First, we need to load in the data set and see what we are working with

import pandas as pd
ff_data = pd.read_csv('Freaky_Franchise_data.csv')
ff_data

Now let’s take a look if there is any missing data.

ff_data.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 10 columns):
# 72 non-null object
Episode Title 75 non-null object
Date Aired 72 non-null object
Cordie 63 non-null object
Theo 63 non-null object
Difference in scores 62 non-null object
RT Score 62 non-null object
Goes First 60 non-null object
Winner 60 non-null object
Notes 9 non-null object
dtypes: object(10)
memory usage: 6.0+ KB

From this summary we can see a few things we will have to do to the DataFrame before we start using it to create statistical information.

  • First see that we can reset the index.
  • It seems like there is a second table on the bottom that we should remove before continuing
  • We can see that Cordie, Theo, the Difference in scores, and the RT scores are listed as objects while we will need them as floats or integers
  • In the same vein of above, we may want to convert Date Aired to DateTime.
  • We also see that there are some null values that we will have to deal with

Scrubbing the data for modeling

First we are going to drop the extra table on the bottom.

# Dropping the rows without an index (episode number) by telling pandas to just keep rows that
# the episode number is not empty.
ff_data = ff_data[ff_data['#'].notna()]
ff_data.tail()

First, we can set the index to the episode number

In [4]:

ff_data.set_index("#", inplace=True)
ff_data

Let’s look at the null values and decide what to do with them

ff_data.isnull().sum()
Episode Title 0
Date Aired 0
Cordie 12
Theo 12
Difference in scores 12
RT Score 12
Goes First 12
Winner 12
Notes 63
dtype: int64

4 of these columns have the same amount of null values. This could be a coincidence or the null values could be in the same row. We should look deeper into that since it could help us decide how we deal with the null values.

# First we are going to just look at rows that have null values
ff_data[ff_data.isnull().any(axis=1)]
# This produced more rows than we wanted. We want to see if the 12 in are the same
# To check this we are going to create a new df without notes
no_notes = ff_data.copy()
no_notes.drop(labels='Notes', axis=1, inplace=True)
no_notes.head()

Run the same code again with no_notes to see all rows with null values

no_notes[no_notes.isnull().any(axis=1)]

We can see that like we suspected, the 12 null values all fall on the same rows. These episodes are mostly retrospectives and specials which we can guess (and I can confirm from listening to them) did not include the competition. Since the main thing we are looking at in this blog is the Rotten Tomato competition, we can safely drop these rows without loss of data.

Using the same method we used to remove the extra table. Since the null values fall across the row, we just need to choose one column.

ff_data = ff_data[ff_data['Cordie'].notna()]
ff_data.head()

Let’s look at ff_data.info() again to check it worked

ff_data.info()<class 'pandas.core.frame.DataFrame'>
Index: 60 entries, 1 to 71
Data columns (total 9 columns):
Episode Title 60 non-null object
Date Aired 60 non-null object
Cordie 60 non-null object
Theo 60 non-null object
Difference in scores 60 non-null object
RT Score 60 non-null object
Goes First 60 non-null object
Winner 60 non-null object
Notes 9 non-null object
dtypes: object(9)
memory usage: 4.7+ KB

Now that we have the data we will be working with, we need to convert it into a format we can work with

Using a for loop we will convert all into float. First, create a list of the column names that we need to convert

columns = ['Cordie', 'Theo', 'Difference in scores', 'RT Score']# Use a for loop to loop through columns to convert any columns that can be into floats
for x in columns:
ff_data[x] = pd.to_numeric(ff_data[x], errors='coerce')
ff_data.info()<class 'pandas.core.frame.DataFrame'>
Index: 60 entries, 1 to 71
Data columns (total 9 columns):
Episode Title 60 non-null object
Date Aired 60 non-null object
Cordie 60 non-null int64
Theo 59 non-null float64
Difference in scores 59 non-null float64
RT Score 59 non-null float64
Goes First 60 non-null object
Winner 60 non-null object
Notes 9 non-null object
dtypes: float64(3), int64(1), object(5)
memory usage: 4.7+ KB

Here we can see that ‘Theo’, ‘Difference in scores’, and ‘RT Score’ have one less non-null object than before. That mean most likely there was a non-number filler which we converted to a null value when we coerced the errors. Seeing that, we will need to check for null values again and decide what to do with them.

Checking again for nulls using the same method as above

no_notes = ff_data.copy()
no_notes.drop(labels='Notes', axis=1, inplace=True)
no_notes[no_notes.isnull().any(axis=1)]

It looks like there is one episode where Theo’s guess is not listed and thus the difference is not listed and another episode where no Rotten Tomato Score is listed. Both of these episodes have winners so we shouldn’t get drop them right out. Since it is just three null values, we are going replace the null values with probable answers using the other data we have.

In [14]: # Since Cordie won the Sleepaway Camp IV with a guess of zero and simple search, I found that the movie does not have a RT score so we will replace the null with a 0

ff_data['RT Score'] = ff_data['RT Score'].fillna(0)

For Jason Lives, we know Theo wins so we will fill it with with the RT Score. Then fill difference with the difference between it and Cordie’s guess.

ff_data['Theo'] = ff_data['Theo'].fillna(ff_data['RT Score'])
ff_data['Difference in scores'] = ff_data['Difference in scores'].fillna(
abs(ff_data['Cordie'] - ff_data['Theo']))

Check for nulls once again.

ff_data.info()<class 'pandas.core.frame.DataFrame'>
Index: 60 entries, 1 to 71
Data columns (total 9 columns):
Episode Title 60 non-null object
Date Aired 60 non-null object
Cordie 60 non-null int64
Theo 60 non-null float64
Difference in scores 60 non-null float64
RT Score 60 non-null float64
Goes First 60 non-null object
Winner 60 non-null object
Notes 9 non-null object
dtypes: float64(3), int64(1), object(5)
memory usage: 4.7+ KB

One last thing we will do before we start running test and models is create boolean columns of who went first and who won using one-hot encoding.

ff_data.columns = ff_data.columns.str.replace(' ', '_')
ff_data.head()

We are just going to keep the columns with data that will affect the model

feats = ['Cordie','Theo','Difference_in_scores','RT_Score','Goes_First', 'Winner']
ff_data = ff_data[feats]
ff_data = pd.get_dummies(ff_data, drop_first=True)
ff_data

Now the data is cleaned and ready to have models run on it. We will tackle run various model on it at a later date.

--

--

Cris
0 Followers

Data analyst with experience in web scraping, SQL, data modeling, and machine learning.