I've chosen to explore the TMDb movie data set for my project's analysis. As noted in the overview and notes, "This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue."
- Adjusted Revenue
- Genre popularity
- Adjusted Budget
- Year
- Adjusted Revenue
- Director
#import pyplot with alias 'plt' for creating visualizations
import matplotlib.pyplot as plt
#import numpy for various 2d/1d mathmatical functions and data structures
import numpy as np
#import pandas for more 2d/1d mathmatical functions, data structures, and plotting functions
import pandas as pd
#Import seaborn as alias 'sns' to enhance pyplot visualizations
import seaborn as sns
#change default pandas formatting of floats(used for more legibile entries in the budget/revenue colums of imdb database)
pd.options.display.float_format = '{:,.2f}'.format
#magic words to allow matplotlib and pylab to provide the visualizations inline within the jupyter notebook cells
%matplotlib inline
%pylab inline
Here I used the built in read_csv function of pandas to bring in the tmdb csv file into a dataframe
#Reading the tmbdb csv file with pandas into a new dataframe named tmdb_df
tmdb_df = pd.read_csv('tmdb-movies.csv')
Taking a look at the newly formed data frame
# Callling head function of the tmdb dataframe to return the first 3 lines of the dataframe
tmdb_df.head(3)
Below we call the pandas info function on the dataframe to see the data types of the columns. We can see there is:
- 4 columns with dtype float64
- 6 columns with dtype int64
- 11 columns with dtype object
Some of the columns have less than 10866 non-null entries:
This means these fields contain some NaN values, so some of their data may be missing or errant
#Info function on dataframe to see data types and structure of the dataframe
tmdb_df.info()
While cleaning I need to decide if the initial datatypes from the csv import are a good fit:
- id: int64 seems fine though I could change it to an object since mathmatical operations on the id values yields no relevant values(i.e. adding two id values together)
- imdb_id, original_title, homepage, director, tagline, overview: object seems good since these contains a mix a of letters and numbers
- popularity, vote_average, budjet_adj, revenue_adj: float64 looks good since they're numbers w/ decimals
- budget, revenue, runtime, votecount: int64 looks good since we may want to perform calculations on the numbers
- cast, genres,keywords, production_companies: object may need to be cleaned since there are multiple values separated by pipe characters
- release_date: object will likely want to change to datetime64 object
- release_year: int64 likely fine as an int64 since we will have release_date as a time dtype after cleaning
So initally, I will change the dtypes of the id and release date fields
#Fixing dtypes
#Here we assign the id field to be an object datatype
tmdb_df['id'] = tmdb_df.id.astype(object)
#Set release date to be a pandas datatime dtype
tmdb_df['release_date'] = pd.to_datetime(tmdb_df['release_date'])
#verifying dtype changes were applied
tmdb_df.loc[:,['id', 'release_date']].dtypes
Below we create a function, clean_multi_value, to avoid repetitive code
(Rubric Criteria: Code Functionality"Does the project use good coding practices")
#For using in dataframe apply function on multi-value fields
def clean_multivalue(s):
return s.split('|')
#Looking over before apply
tmdb_df[['cast', 'genres', 'keywords', 'production_companies']].head(2)
From the info function we saw the cast, genres, and keywords columns had some null entries which could cause problems later.To alleviate this issue we first apply a fillna function the the column and then apply the clean_multivalue function we just created.
#Use fillna to replace nan values with just empty string object using inplace to avoid needing assignment statement
tmdb_df.fillna(value="", inplace=True)
#applymap used to split the multi-valued fields into lists
tmdb_df[['cast', 'genres', 'keywords', 'production_companies']] = tmdb_df[['cast', 'genres', 'keywords', 'production_companies']].applymap(clean_multivalue)
#Check the updated structure
tmdb_df.info()
#Looking over after apply
tmdb_df[['cast', 'genres', 'keywords', 'production_companies']].head(2)
For the questions we're trying to answer, many of the data columns are not needed. We will trim unused columns, and create a new dataframe that only includes the fields needed to answer it's respective question.
# Creating dataframe with only genres and release year as columns
genre_yr_df = tmdb_df[['genres','release_year']]
genre_yr_df.head(2)
In order to analyze the categories we need to break down the lists from our applymap earlier. We will do this with a get get dummies function to create a new frame that has columns for each category with a 1 or 0 to indicate whether the category was present for that particular entry.
#New dataframe clean_cat to hold the decompressed categories
clean_cat = pd.get_dummies(genre_yr_df.genres.apply(pd.Series).stack()).sum(level=0)
The logic of this is:
credit(https://stackoverflow.com/questions/29034928/pandas-convert-a-column-of-list-to-dummies)
#Renaming empty field to None
clean_cat.rename(columns = {'':'None'}, inplace=True)
#Taking a look at our new category dataframe
clean_cat.head()
# Verifying the number of rows is correct
clean_cat.shape[0]
# Taking a quick look at the counts of each category
clean_cat.sum().sort_values(ascending=False)
To make our graphs a bit more legible we'll take a look at only the top 5 categories overall, across the years:
#Forming a new data frame in case we want to go back and do something with the original clean_cat dataframe
rq2 = pd.DataFrame(clean_cat,copy=True)
Here we go ahead and add the year column back onto the the categories
#inserting the release year column in the first position
rq2.insert(loc = 0, column = 'release_year', value = genre_yr_df['release_year'])
#verifiyng insert
rq2.head(2)
Since we want to look at the data on a year to year basis we will perform a groupby on the dataframe for relase_year with the sum function to get a total of occurrences of each category for each year.
#create new dataframe from groupby
rq3 = rq2.groupby('release_year').sum()
#verify new dataframe looks like we want it
rq3.head()
To make analysis easier we will create a dataframe which has the ratios of each category relative to it's year's total number of categories.
This is to alleviate the issue of there being a dramatic increase in the number of movies made each year up until recent years.
#Here we create a series that has the total of each of the rows
row_totals = rq3.sum(axis=1)
#We then divide the data frame by that series to get relative percents of categories for each year
percentage_df = rq3.div(row_totals, axis='index')
#number of years with data
percentage_df.shape[0]
#Verifying the new dataframe looks good
percentage_df.tail()
This percentage_df looks like it is now formatted for analysis so we will try to make a graph to look at the trends over the years
#Creating a plot for the top 5 Genres over the years
ax = percentage_df[['Drama','Comedy','Thriller','Action','Romance']].plot(title="Most Popular Genre's by year",
)
ax.set_ylabel("Percent Of total films in year")
ax.set_xlabel("Release Year")
As we can see in the graph it appears that drama is the most popular genre most years, however there are a few years where Comedy is more popular. Below we take a look at which years Comedy surpasses Drama.
#Which years is Comedy more popular than Drama?
percentage_df[percentage_df.Comedy > percentage_df.Drama]
Here we take a look at a stacked bar chart to visualize the data another way. Note that the ratios won't add up to 1 since we've ommited genre's outside the top 5.
ax = percentage_df[['Drama','Comedy','Thriller','Action','Romance']].plot.bar(stacked=True,
title="Most Popular Genre's by year")
ax.set_ylabel("Ratio Of total films in year")
ax.set_xlabel("Release Year")
One more angle to look at the data from. Here we use a histogram to see the relative frequencies each category is at a particlar ratio of the yearly total.
Drama is clearly top-dog. Interestingly we can see here that even in Romance's peak year of popularity(1967) it still wasn't more popular than Comedy or Drama at their least popular years.
percentage_df[['Drama','Comedy','Thriller','Action','Romance']].plot.hist(title="Histogram of Ratio frequencies")
#Finding exact year Romance was at it's peak popularity
percentage_df[percentage_df.Romance > .1]
Genre popularity could be a good way to analyze whether a movie would be profitable which led me to wonder what percent of these movies end up turning a profit?
Below we will trim our original dataframe from the csv so that we only have the fields that we need to answer this question.
In order to determine profit we'll use the budget_adj and revenue_adj columns. Additionally we'll bring in the original title to use as the index so that we can look for anomolies in the data a bit easier
#creating a new dataframe to only hold relevent fields
budget_rev_df = tmdb_df[['original_title','budget_adj','revenue_adj']]
#setting the index
budget_rev_df.set_index('original_title',inplace=True)
#Verifying the new data frame structure
budget_rev_df.info()
Above, we can see that the data doesn't have any non-null values, but it may have zero values that could skew the data
#Checking for number of rows with zero values
budget_rev_df[(budget_rev_df.budget_adj == 0) |
(budget_rev_df.revenue_adj == 0)].shape[0]
Yikes! Unfortunately 7011 of the 10866 rows in the dataframe have zero values in either of the columns. Luckily we'll still have ~3855 values to work with.
#Update our dataframe to only have rows that do not contain zero values
budget_rev_df = budget_rev_df[(budget_rev_df.budget_adj != 0) &
(budget_rev_df.revenue_adj != 0)]
#verifiying our update went through
budget_rev_df.info()
That should help with accuracy when analyzing the data. Now lets perform a describe function to look at some of the current statistics of the dataframe.
#Checking current state of dataframe
budget_rev_df.describe()
So it seems there is still some minor issues with the data. Notably the 'min' values for budget and revenue seem way too low. We'll take a look at records under 1000 in either column too see if we can't get to the bottom of this.
I'm looking at values below 1000 specifically because I think some of the values might be in the 'numbers of millions'. Since to my knowledge there's no movie with a budget or revenue above a billion(1000 millions), I decided on 1000 as the threshold to use.
#Get a filtered look at the data frames lower values
budget_rev_df[(budget_rev_df.budget_adj < 1000) | (budget_rev_df.revenue_adj < 1000)]
It looks like some of the values here are being listed as 'number of millions' instead of the literal number. I can't be certain that's true in all the cases, but I know for instance that "The Karate Kid, Part II" did not have a inflation-adjusted budget of $224 dollars. I could either drop all these records or I could try to upscale them to what I think is the intended value. Since we've already lost over 7000 records that contained zero entries I will opt for the latter option.
To do this I will first create a function for converting to millions, and then use an applymap with that function on the dataframe to fix the relvant entries
#conversion function to use in applymap
def convert_to_millions(x):
if x >= 1000:
return x
elif x < 1000:
return x*1000000
#Use applymap with our convert function we just defined to fix the data
budget_rev_df = budget_rev_df.applymap(convert_to_millions)
#Verify our applymap worked correctly
budget_rev_df[(budget_rev_df.budget_adj < 1000) | (budget_rev_df.revenue_adj < 1000)]
Good, so as verified above the offending entries were upscaled to what is hopefully their intended values. With the data fixed to my liking we will now add a calculated profit column to the dataframe
#Adding calculated field profit_adj to dataframe
budget_rev_df['profit_adj'] = budget_rev_df.revenue_adj.subtract(budget_rev_df.budget_adj)
#Veriying that our new frame looks good
budget_rev_df.sort_values('profit_adj', ascending=False).head()
We'll also generate another column that will have a truth value depending on the whether the profit column is positive. I could define another function for the apply here but it's so basic I think a lambda will suffice.
#Generate is_profit column for with appy on on profit column
budget_rev_df['is_profit'] = budget_rev_df.profit_adj.apply(lambda x: x>0)
#check use value counts with normalize to get the ratio of true to false in the 'is_profit' column
budget_rev_df.is_profit.value_counts(normalize=True)
As we see above roughly 72% of the movies in the set are profitable. Only 28% of the time do movies lose money. Now lets take a look at some visualizations
#Bar chart to visualize whether a movie is profitable
budget_rev_df.is_profit.value_counts(normalize=True).plot(kind='bar',
title="Movie Profibability Bar Chart")
#creating a scatter plot to visualize the dataframe further
fig,ax = plt.subplots(1, 1, sharey=False, tight_layout=True, figsize=(5, 5))
ax.scatter(x=budget_rev_df.budget_adj,y=budget_rev_df.profit_adj)
ax.set_title("Scatter: Budget vs Profit")
ax.set_xlabel('Inflation Adjusted budget(Billions)')
ax.set_ylabel('Inflation Adjusted Profit(Billions)')
Above we're able to see a fairly tight clustering with some potential outliers.
While we're looking looking at movie profits I was wondering who the most profitable directors are. Below I quickly create a dataframe and graph to analyze that.
#Create a new data frame with budgets and revunues over 1000
director_profit_df = tmdb_df[tmdb_df.budget_adj > 1000]
director_profit_df = director_profit_df[(director_profit_df.revenue_adj > 1000)]
#calculate profit column
director_profit_df['net_director_profit_df'] = director_profit_df.revenue_adj - director_profit_df.budget_adj
#Graph the data with a bar chart
grouped_director_df = director_profit_df.groupby('director').net_director_profit_df.sum()
ax = grouped_director_df.sort_values(inplace=False, ascending=False).head(10).plot(y='Total Profit',
title='Total Profit by Director',
kind='bar',
legend=True)
ax.set_ylabel("Inflation adjusted dollars in Billions")
ax.set_xlabel("Director")
Wow! Financially speaking Steven Spielberg is in a league of his own. More than double the total profits of second place James Cameron.
From our analysis here we are able to gain insight on the TMDB data set and answer questions we've posed.
For the first question, " Which genres are most popular from year to year?", we were able to see that the top 5 genre categories were:
Drama was the the most popular genre all but 8 of the 56 years for which there was data. The only genre to be more popular than drama at any point was Comedy. Notable that 4 of the 8 years that Comedy was more popular than drama were in the mid-to-late 80's(1985,1987,1988,1989). Perhaps a particular set of comedic actors were popular during that time frame.
For the 2nd question, " How often does a movie's inflation adjusted revenue not exceed it's inflation adjusted budget?", it was to discovered that the vast majority of the time the the movies from the IMDB data set do turn a profit. 72% of the movies on the list were profitable according to the data set. Worth noting there is some possible issues with the data as a large number of the results were missing or had very low number for adjusted budget or adjusted revenue. To minimize errors from this the dataset was groomed to remove entries with zero values and for budget/revenue values less than 1000 it was assumed that the amounts listed were for 'number of millions'. With that assumption those numbers below 1000 were multiplied by 1 million to get what seemed to be more accurate numbers.
We also found out that the most profitable director of all time was Steven Spielberg.
While a lot of interesting information could be gleaned from this data set it became apparent during analysis that missing or incorrect values for a number of entries could limit the ability to make accurate conclusions on various questions. I am sure that data sets like this one are in use by many movie studios to do predictive analysis on potential movie projects to determine whether they will be profitable ahead of time. I look forward to analyzing more data sets like this in the future and hope continue finding more answers to questions like the ones in this report.