Project Name: Movie Correlation Project | Python
- Igli Ferati
- Jul 11, 2023
- 4 min read
Updated: Jul 28, 2023
Introduction
There are 6820 movies in the dataset (220 movies per year, 1986-2016). Each movie has the following attributes:
budget: the budget of a movie. Some movies don't have this, so it appears as 0
company: the production company
country: country of origin
director: the director
genre: main genre of the movie.
gross: revenue of the movie
name: name of the movie
rating: rating of the movie (R, PG, etc.)
released: release date (YYYY-MM-DD)
runtime: duration of the movie
score: IMDb user rating
votes: number of user votes
star: main actor/actress
writer: writer of the movie
year: year of release
Acknowledgements This data was scraped from IMDb.
The project involves several steps, starting with the preparation of the data. The necessary libraries, including pandas, numpy, seaborn, and matplotlib, are imported. The dataset is then read from a CSV file into a pandas DataFrame. Initial data exploration is performed to check for missing values and data types. Preparing the Data
We will import libraries that we will use in this project, ans also get the file in csv from local drive.
# First let's import the packages we will use in this project
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlibplt.style.use('ggplot')from matplotlib.pyplot import figure
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = [12.8, 6.4]
pd.options.mode.chained_assignment = None
# Now we need to read in the data
df = pd.read_csv(r'C:\Users\user\Desktop\DATA ANALYST MATERIALS\MOVIES\movies.csv')
# Let's look at the data
df.head()

# Let's see if there is any missing data
for col in df.columns:pct_missing = np.mean(df[col].isnull())print('{} - {}%'.format(col, round(pct_missing)))
name - 0% rating - 0% genre - 0% year - 0% released - 0% score - 0% votes - 0% director - 0% writer - 0% star - 0% country - 0% budget - 0% gross - 0% company - 0% runtime - 0%
# Data types for our columns
print(df.dtypes)
name object rating object genre object year int64 released object score float64 votes float64 director object writer object star object country object budget float64 gross float64 company object runtime float64 dtype: object
Next, the missing values in the budget and gross columns are filled with zeros, and the data types of these columns are changed to integers.
# We have some Na values so we need to fix those rows, before we proceed to data type change.
df['budget'] = df['budget'].fillna(0).astype('int64')df['gross'] = df['gross'].fillna(0).astype('Int64')
# We need to specify the column that we will be working first to change
data type.df['budget'] = df['budget'].astype('int64')df['gross'] = df['gross'].astype('int64')
df
# Now we create an other columnd called 'yearcorrect', so we can have the correct year.
df['yearcorrect'] = df['released'].astype(str).str.split(' ').str[-3]df

# We will sort the date based on 'gross' column, without changing table.
df=df.sort_values(by=['gross'], inplace=False, ascending=False)
#Display the maximum rows that we have on out table.
pd.set_option('display.max_rows', None)
# We will drop duplicates
df.drop_duplicates()
# Scatter plot with budget vs gross
plt.scatter(x=df['budget'], y=df['gross'])
plt.title('Budget vs Gross Earnings')
plt.xlabel('Gross Earnings')
plt.ylabel('Budget for FIlm')plt.show()

If we look at the graph we are looing at a budget which values are in milion. On the X value we have Gross Earnings, and on the Y value we have budget for film.
# Plot budget vs gross using seaborn.
sns.regplot(x="gross", y="budget", data=df, scatter_kws={'color': 'red'}, line_kws={'color': 'blue'})

# Let'start looking at correlation ( We have different ways for correlation, pearson, kendall, spearman)
df.corr(method='pearson') #Pearson which is by default and what we will be using for this analysis.

High correlation between budget and gross.
Further analysis involves calculating the correlation matrix using Pearson's method. The correlation matrix reveals the relationships between different numeric features of the movies. The correlations between variables such as:
year
score
votes
budget
gross and runtime are calculated and visualized using a heatmap.
# Calculate the correlation matrix using Pearson's
methodcorrelation_matrix = df.corr(method='pearson')
# Create a heatmap of the correlation
matrixsns.heatmap(correlation_matrix, annot=True)
# Set the title of the heatmap
plt.title('Correlation Matrix for Numeric Features')
# Set the x-axis
labelplt.xlabel('Movie Features')
# Set the y-axis label
plt.ylabel('Movie Features')# Display the heatmapplt.show()
We have a visualization of the correlation that we made. Lighter colors have high correlation and darker colors have less correlation. To explore the relationship between the company column and other variables, the dataset is numerized by converting the categorical variables into numeric codes. The correlation matrix is recalculated, and a heatmap is generated to visualize the correlations.

# We will look at the company
df_numerized = df
for col_name in df.columns:
if(df[col_name].dtype == 'object'):
df[col_name]= df[col_name].astype('category')
df[col_name] = df[col_name].cat.codesdf_numerized
We can see that now the table is all with numbers instead of strings, by using the formula before.

# Calculate the correlation matrix using Pearson's method
correlation_matrix = df_numerized.corr(method='pearson')
# Create a heatmap of the correlation
matrixsns.heatmap(correlation_matrix, annot=True)
# Set the title of the heatmap
plt.title("Correlation matrix for Movies")
# Set the x-axis label
plt.xlabel("Movie features")
# Set the y-axis label
plt.ylabel("Movie features")
# Display the heatmap
plt.show()
The visualization is quitle larger now that we numerized all strings.

# Let's check the table of the data that we correlized before on the visualization.
df_numerized.corr()

# Calculate the correlation matrix
correlation_mat = df_numerized.corr()
# Extract the correlation pairs
corr_pairs = correlation_mat.unstack()corr_pairs
name name 1.000000 rating -0.008069 genre 0.016355 year 0.011453 released -0.011311 ... runtime country -0.078412 budget 0.320447 gross 0.245216 company 0.034402 runtime 1.000000 Length: 225, dtype: float64
The visualization is quite larger now that we numerized all strings.
# Sort the correlation pairs
sorted_pairs = corr_pairs.sort_values()sorted_pairs
budget genre -0.356564 genre budget -0.356564 gross -0.235650 gross genre -0.235650 rating budget -0.176002 ... year year 1.000000 genre genre 1.000000 rating rating 1.000000 company company 1.000000 runtime runtime 1.000000 Length: 225, dtype: float64
# Filter the correlation pairs for high correlation coefficients
high_corr = sorted_pairs[(sorted_pairs) > 0.5]
high_corr
gross votes 0.630757 votes gross 0.630757 budget gross 0.740395 gross budget 0.740395 name name 1.000000 director director 1.000000 gross gross 1.000000 budget budget 1.000000 country country 1.000000 star star 1.000000 writer writer 1.000000 votes votes 1.000000 score score 1.000000 released released 1.000000 year year 1.000000 genre genre 1.000000 rating rating 1.000000 company company 1.000000 runtime runtime 1.000000 dtype: float64
Conclusion
The project concludes by identifying the variables with the highest correlation to gross earnings. It is found that votes and budget have the highest correlation, while the production company has a low correlation with gross earnings.
In summary, the Movie Correlation Project explores the relationships between various movie attributes and their impact on gross earnings. The analysis provides insights into the factors that influence a movie's financial success and can be valuable for decision-making in the film industry.