For this exercise, we are going to use a Macbeth transcript from Project Gutenberg to learn how to group common traits within a string and create a plot that shows the 20 most used words in Macbeth. This will involve learning how to count and sort using python.
We will pull the manuscript using the Python package requests
. We will also be using numpy and matplotlib.pyplot so we will import those as well.
import requests
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Now that we have imported the package, we can pull the manuscript and save it as a variable.
macbeth = requests.get('http://www.gutenberg.org/cache/epub/2264/pg2264.txt').text
Before we continue, we should know what we are working with.
print(type(macbeth))
print(len(macbeth))
As we can see, we are working with a string with 120,253 characters. Now let’s preview the manuscript. This will give us an idea of what we may to do
print(macbeth[:500])
It looks like there will be a bit of cleaning to be done before we can begin to count and sort the most used words.
The first thing we need to do to ensure a accurate count is removing everything that comes before the start of the actual play. To find exactly where in the manuscript this occurs, we do need to do a little bit of research into the document itself.
print(macbeth[:20000])
It looks like the play starts after “David Reed.” Before we can split the string at “David Reed,” we should ensure it doesn’t appear elsewhere in the manuscript. This can be done a few ways, we are going to do this by splitting the string into list of words by splitting at the spaces. Once we have the word list, we can create a dictionary to keep a tally of how many times each word is used.
First we need to make sure all the characters are lowercase. This will help to make sure we don’t accidentally miss any words that might have different capitalization.
# Split the transcript into wordsmacbeth_lower = macbeth.lower()
Now we need to remove any punctuation for the same reason.
# remove punctuation marks: 6 more 'the' when removed
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
for p in macbeth_lower:
if p in punctuations:
macbeth_lower = macbeth_lower.replace(p, "")
We can finally create our first word list and word_count dictionary.
word_list = macbeth_lower.split()# Create a dictionaryword_count = {}# Iterate through the text of Macbethfor word in word_list:
word_count[word] = word_count.get(word, 0) + 1 # Update word counts
Now that we have the word_count, we can check to see how many times “david” appears.
print (word_count['david']) 1
Since ‘david’ only appears once in the manuscript, we can split the original string macbeth at “David Reed” to get rid of everything that comes before the actual play.
# split the list at reed
just_macbeth = macbeth.split('David Reed')# we just need the second string in the list of strings we created above
just_macbeth = just_macbeth[1]# check that just_macbeth is a string
print(type(just_macbeth)) # print the first 700 characters to be sure it removed all but play
print (just_macbeth[:700])
It all looks good so we can continue on to count the top 20 used words in Macbeth. We are going to use the same method we used above to get a dictionary of word_count. This time we will use the string “just_macbeth.”
# Split the transcript into words# convert string to lowercase
macbeth_lower = macbeth.lower()# remove punctuation marks
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
for p in macbeth_lower:
if p in punctuations:
macbeth_lower = macbeth_lower.replace(p, "")
# create word_list
word_list = macbeth_lower.split()# Create a dictionary
word_count = {}# Iterate through the text of Macbeth
for word in word_list:
word_count[word] = word_count.get(word, 0) + 1
Now we need to sort the word_count
# Sort words by counts in descending orderword_count_sorted = sorted(list(word_count.items()), key=lambda x: x[1], reverse=True)
From the sorted list we can seperate the top 25.
top_25_words = word_count_sorted[:25]
Now that we have our sorted list, we can finally graph the top 25 used words in Macbeth.
# Create Bar Graph
y = [ys[1] for ys in top_25_words]
top_25_words = [xs[0] for xs in top_25_words]x = np.arange(len(y))# Include descriptive titles and labelsplt.figure(figsize = (12,8))# Plot vertical bars of fixed width by passing x and y values to .bar() function
plt.bar(x, y)plt.xlabel('Word')
plt.ylabel('Number of Times Used')
plt.xticks(x, top_25_words, rotation=90)# Give a title to the bar graph
plt.title('Top 25 Used Words in \'MacBeth\'')# Output the final plot
plt.show()
It looks like the most used word is “the” followed by “and.”