Correlation in Movie Industry

For this project the first questions that came to my mind was: Is the movie industry dying? Is Netflix the new entertainment king? And the best way to answer those is analyzing that dataset of four decades using Pandas, Matpoltlib and Seaborn to also understand more factors that intervene in this industry, like actors, genres, user ratings and more.

Import Libraries

import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
plt.style.use('ggplot')

Loading data

df = pd.read_csv('movies.csv')
df_head=df.head()
df_head.to_html()
name rating genre year released score votes director writer star country budget gross company runtime
0 The Shining R Drama 1980 June 13, 1980 (United States) 8.4 927000.0 Stanley Kubrick Stephen King Jack Nicholson United Kingdom 19000000.0 46998772.0 Warner Bros. 146.0
1 The Blue Lagoon R Adventure 1980 July 2, 1980 (United States) 5.8 65000.0 Randal Kleiser Henry De Vere Stacpoole Brooke Shields United States 4500000.0 58853106.0 Columbia Pictures 104.0
2 Star Wars: Episode V - The Empire Strikes Back PG Action 1980 June 20, 1980 (United States) 8.7 1200000.0 Irvin Kershner Leigh Brackett Mark Hamill United States 18000000.0 538375067.0 Lucasfilm 124.0
3 Airplane! PG Comedy 1980 July 2, 1980 (United States) 7.7 221000.0 Jim Abrahams Jim Abrahams Robert Hays United States 3500000.0 83453539.0 Paramount Pictures 88.0
4 Caddyshack R Comedy 1980 July 25, 1980 (United States) 7.3 108000.0 Harold Ramis Brian Doyle-Murray Chevy Chase United States 6000000.0 39846344.0 Orion Pictures 98.0

Cleaning (Pre Prossesing)

df.isna().sum()
name           0
rating        77
genre          0
year           0
released       2
score          3
votes          3
director       0
writer         3
star           1
country        3
budget      2171
gross        189
company       17
runtime        4
dtype: int64
df.dropna(inplace=True)
df.isna().sum()
name        0
rating      0
genre       0
year        0
released    0
score       0
votes       0
director    0
writer      0
star        0
country     0
budget      0
gross       0
company     0
runtime     0
dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5421 entries, 0 to 7652
Data columns (total 15 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      5421 non-null   object 
 1   rating    5421 non-null   object 
 2   genre     5421 non-null   object 
 3   year      5421 non-null   int64  
 4   released  5421 non-null   object 
 5   score     5421 non-null   float64
 6   votes     5421 non-null   float64
 7   director  5421 non-null   object 
 8   writer    5421 non-null   object 
 9   star      5421 non-null   object 
 10  country   5421 non-null   object 
 11  budget    5421 non-null   float64
 12  gross     5421 non-null   float64
 13  company   5421 non-null   object 
 14  runtime   5421 non-null   float64
dtypes: float64(5), int64(1), object(9)
memory usage: 677.6+ KB
df['votes']=df['votes'].astype('int64')
df['budget']=df['budget'].astype('int64')
df['gross']=df['gross'].astype('int64')
df['released']=df['released'].astype('string')
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5421 entries, 0 to 7652
Data columns (total 15 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      5421 non-null   object 
 1   rating    5421 non-null   object 
 2   genre     5421 non-null   object 
 3   year      5421 non-null   int64  
 4   released  5421 non-null   string 
 5   score     5421 non-null   float64
 6   votes     5421 non-null   int64  
 7   director  5421 non-null   object 
 8   writer    5421 non-null   object 
 9   star      5421 non-null   object 
 10  country   5421 non-null   object 
 11  budget    5421 non-null   int64  
 12  gross     5421 non-null   int64  
 13  company   5421 non-null   object 
 14  runtime   5421 non-null   float64
dtypes: float64(2), int64(4), object(8), string(1)
memory usage: 677.6+ KB
df_sort=df.sort_values(by=['gross'],inplace=False,ascending=False)
df_sort_head=df_sort.head()
df_sort_head.to_html()
name rating genre year released score votes director writer star country budget gross company runtime
5445 Avatar PG-13 Action 2009 December 18, 2009 (United States) 7.8 1100000 James Cameron James Cameron Sam Worthington United States 237000000 2847246203 Twentieth Century Fox 162.0
7445 Avengers: Endgame PG-13 Action 2019 April 26, 2019 (United States) 8.4 903000 Anthony Russo Christopher Markus Robert Downey Jr. United States 356000000 2797501328 Marvel Studios 181.0
3045 Titanic PG-13 Drama 1997 December 19, 1997 (United States) 7.8 1100000 James Cameron James Cameron Leonardo DiCaprio United States 200000000 2201647264 Twentieth Century Fox 194.0
6663 Star Wars: Episode VII - The Force Awakens PG-13 Action 2015 December 18, 2015 (United States) 7.8 876000 J.J. Abrams Lawrence Kasdan Daisy Ridley United States 245000000 2069521700 Lucasfilm 138.0
7244 Avengers: Infinity War PG-13 Action 2018 April 27, 2018 (United States) 8.4 897000 Anthony Russo Christopher Markus Robert Downey Jr. United States 321000000 2048359754 Marvel Studios 149.0

Featuring Engineering

released_df=df['released'].str.split(",",n = 1,expand = True) 
released_df.head()
         0                      1
0  June 13   1980 (United States)
1   July 2   1980 (United States)
2  June 20   1980 (United States)
3   July 2   1980 (United States)
4  July 25   1980 (United States)
df['Day_Month']=released_df[0]
df['Year_Correct']=released_df[1].str[:5]
df['Country_Correct']=released_df[1].str[5:]
df['Country_Correct']=df['Country_Correct'].str[2:-1]
df_head=df.head()
df_head.to_html()
name rating genre year released score votes director writer star country budget gross company runtime Day_Month Year_Correct Country_Correct
0 The Shining R Drama 1980 June 13, 1980 (United States) 8.4 927000 Stanley Kubrick Stephen King Jack Nicholson United Kingdom 19000000 46998772 Warner Bros. 146.0 June 13 1980 United States
1 The Blue Lagoon R Adventure 1980 July 2, 1980 (United States) 5.8 65000 Randal Kleiser Henry De Vere Stacpoole Brooke Shields United States 4500000 58853106 Columbia Pictures 104.0 July 2 1980 United States
2 Star Wars: Episode V - The Empire Strikes Back PG Action 1980 June 20, 1980 (United States) 8.7 1200000 Irvin Kershner Leigh Brackett Mark Hamill United States 18000000 538375067 Lucasfilm 124.0 June 20 1980 United States
3 Airplane! PG Comedy 1980 July 2, 1980 (United States) 7.7 221000 Jim Abrahams Jim Abrahams Robert Hays United States 3500000 83453539 Paramount Pictures 88.0 July 2 1980 United States
4 Caddyshack R Comedy 1980 July 25, 1980 (United States) 7.3 108000 Harold Ramis Brian Doyle-Murray Chevy Chase United States 6000000 39846344 Orion Pictures 98.0 July 25 1980 United States

day_month=df['Day_Month'].str.split(" ",n = 1,expand = True) 
df['Month']=day_month[0]
df['Day']=day_month[1]
df_head=df.head()
df_head.to_html()
name rating genre year released score votes director writer star country budget gross company runtime Day_Month Year_Correct Country_Correct Month Day
0 The Shining R Drama 1980 June 13, 1980 (United States) 8.4 927000 Stanley Kubrick Stephen King Jack Nicholson United Kingdom 19000000 46998772 Warner Bros. 146.0 June 13 1980 United States June 13
1 The Blue Lagoon R Adventure 1980 July 2, 1980 (United States) 5.8 65000 Randal Kleiser Henry De Vere Stacpoole Brooke Shields United States 4500000 58853106 Columbia Pictures 104.0 July 2 1980 United States July 2
2 Star Wars: Episode V - The Empire Strikes Back PG Action 1980 June 20, 1980 (United States) 8.7 1200000 Irvin Kershner Leigh Brackett Mark Hamill United States 18000000 538375067 Lucasfilm 124.0 June 20 1980 United States June 20
3 Airplane! PG Comedy 1980 July 2, 1980 (United States) 7.7 221000 Jim Abrahams Jim Abrahams Robert Hays United States 3500000 83453539 Paramount Pictures 88.0 July 2 1980 United States July 2
4 Caddyshack R Comedy 1980 July 25, 1980 (United States) 7.3 108000 Harold Ramis Brian Doyle-Murray Chevy Chase United States 6000000 39846344 Orion Pictures 98.0 July 25 1980 United States July 25

df.drop(['year','released','Day_Month','country'],axis=1,inplace=True)
df_head=df.head()
df_head.to_html()
name rating genre score votes director writer star budget gross company runtime Year_Correct Country_Correct Month Day
0 The Shining R Drama 8.4 927000 Stanley Kubrick Stephen King Jack Nicholson 19000000 46998772 Warner Bros. 146.0 1980 United States June 13
1 The Blue Lagoon R Adventure 5.8 65000 Randal Kleiser Henry De Vere Stacpoole Brooke Shields 4500000 58853106 Columbia Pictures 104.0 1980 United States July 2
2 Star Wars: Episode V - The Empire Strikes Back PG Action 8.7 1200000 Irvin Kershner Leigh Brackett Mark Hamill 18000000 538375067 Lucasfilm 124.0 1980 United States June 20
3 Airplane! PG Comedy 7.7 221000 Jim Abrahams Jim Abrahams Robert Hays 3500000 83453539 Paramount Pictures 88.0 1980 United States July 2
4 Caddyshack R Comedy 7.3 108000 Harold Ramis Brian Doyle-Murray Chevy Chase 6000000 39846344 Orion Pictures 98.0 1980 United States July 25

df.rename(columns={"Year_Correct": "year", "Country_Correct": "country",'Month':'month','Day':'day'},inplace=True)
df_head=df.head()
df_head.to_html()
name rating genre score votes director writer star budget gross company runtime year country month day
0 The Shining R Drama 8.4 927000 Stanley Kubrick Stephen King Jack Nicholson 19000000 46998772 Warner Bros. 146.0 1980 United States June 13
1 The Blue Lagoon R Adventure 5.8 65000 Randal Kleiser Henry De Vere Stacpoole Brooke Shields 4500000 58853106 Columbia Pictures 104.0 1980 United States July 2
2 Star Wars: Episode V - The Empire Strikes Back PG Action 8.7 1200000 Irvin Kershner Leigh Brackett Mark Hamill 18000000 538375067 Lucasfilm 124.0 1980 United States June 20
3 Airplane! PG Comedy 7.7 221000 Jim Abrahams Jim Abrahams Robert Hays 3500000 83453539 Paramount Pictures 88.0 1980 United States July 2
4 Caddyshack R Comedy 7.3 108000 Harold Ramis Brian Doyle-Murray Chevy Chase 6000000 39846344 Orion Pictures 98.0 1980 United States July 25

df_sorted = df.sort_values(by=['gross'],inplace=False,ascending=False)
df_sorted_head=df_sorted.head()
df_sorted_head.to_html()
name rating genre score votes director writer star budget gross company runtime year country month day
5445 Avatar PG-13 Action 7.8 1100000 James Cameron James Cameron Sam Worthington 237000000 2847246203 Twentieth Century Fox 162.0 2009 United States December 18
7445 Avengers: Endgame PG-13 Action 8.4 903000 Anthony Russo Christopher Markus Robert Downey Jr. 356000000 2797501328 Marvel Studios 181.0 2019 United States April 26
3045 Titanic PG-13 Drama 7.8 1100000 James Cameron James Cameron Leonardo DiCaprio 200000000 2201647264 Twentieth Century Fox 194.0 1997 United States December 19
6663 Star Wars: Episode VII - The Force Awakens PG-13 Action 7.8 876000 J.J. Abrams Lawrence Kasdan Daisy Ridley 245000000 2069521700 Lucasfilm 138.0 2015 United States December 18
7244 Avengers: Infinity War PG-13 Action 8.4 897000 Anthony Russo Christopher Markus Robert Downey Jr. 321000000 2048359754 Marvel Studios 149.0 2018 United States April 27

df_drop=df.drop_duplicates().head()
df_drop.to_html()
name rating genre score votes director writer star budget gross company runtime year country month day
0 The Shining R Drama 8.4 927000 Stanley Kubrick Stephen King Jack Nicholson 19000000 46998772 Warner Bros. 146.0 1980 United States June 13
1 The Blue Lagoon R Adventure 5.8 65000 Randal Kleiser Henry De Vere Stacpoole Brooke Shields 4500000 58853106 Columbia Pictures 104.0 1980 United States July 2
2 Star Wars: Episode V - The Empire Strikes Back PG Action 8.7 1200000 Irvin Kershner Leigh Brackett Mark Hamill 18000000 538375067 Lucasfilm 124.0 1980 United States June 20
3 Airplane! PG Comedy 7.7 221000 Jim Abrahams Jim Abrahams Robert Hays 3500000 83453539 Paramount Pictures 88.0 1980 United States July 2
4 Caddyshack R Comedy 7.3 108000 Harold Ramis Brian Doyle-Murray Chevy Chase 6000000 39846344 Orion Pictures 98.0 1980 United States July 25

df['year']=pd.to_numeric(df['year'])

Hyphotesis

We going to start the make hypothesis about correlations in our dataframe. First, last assume that the column “gross” and “budget” are positive correlated and let’s see if this is true

Correlation between Gross and Budget

sns.regplot(data=df,x='gross',y='budget',color="b",line_kws={"color":"red"});
plt.title("Gross Vs Budget Earnings");
plt.xlabel("Gross Earnings");
plt.ylabel("Budget for Film");
plt.show()

Heat Map

corr_matrix=df.corr()
sns.heatmap(corr_matrix,annot=True);
plt.title("Correlation Matrix between Numeric Features");
plt.xlabel("Movies Features");
plt.ylabel("Movies Features");
plt.show()

Correlation between Gross and Number of Votes

sns.regplot(data=df,x='gross',y='votes',color="b",line_kws={"color":"red"});
plt.title("Gross Vs Number of Votes");
plt.xlabel("Gross Earnings");
plt.ylabel("Votes");
plt.show()

Chi Square Test

contigency=pd.crosstab(df['rating'],df['genre'])
cont=contigency.head()
cont.to_html()
genre Action Adventure Animation Biography Comedy Crime Drama Family Fantasy Horror Mystery Romance Sci-Fi Thriller Western
rating
Approved 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
G 0 13 82 1 11 0 3 1 0 0 0 0 0 0 0
NC-17 0 0 0 0 2 3 6 0 0 1 0 0 0 0 0
Not Rated 5 0 1 2 6 4 20 0 2 4 0 0 0 0 0
PG 146 166 175 46 275 5 77 3 3 5 0 3 1 1 1

from scipy.stats import chi2_contingency
c, p, dof, expected = chi2_contingency(contigency)
p
0.0

Extra

Here will fing correlation between categorical variable with some numerical ones.

df1 = df.copy()
df1['rating']=df1['rating'].astype('category')
df1['genre']=df1['genre'].astype('category')
df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5421 entries, 0 to 7652
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   name      5421 non-null   object  
 1   rating    5421 non-null   category
 2   genre     5421 non-null   category
 3   score     5421 non-null   float64 
 4   votes     5421 non-null   int64   
 5   director  5421 non-null   object  
 6   writer    5421 non-null   object  
 7   star      5421 non-null   object  
 8   budget    5421 non-null   int64   
 9   gross     5421 non-null   int64   
 10  company   5421 non-null   object  
 11  runtime   5421 non-null   float64 
 12  year      5407 non-null   float64 
 13  country   5407 non-null   string  
 14  month     5421 non-null   string  
 15  day       5421 non-null   string  
dtypes: category(2), float64(3), int64(3), object(5), string(3)
memory usage: 646.9+ KB
df1['rating_cat']=df1['rating'].cat.codes
df1['genre_cat']=df1['genre'].cat.codes
df1_head=df1.head()
df1_head.to_html()
name rating genre score votes director writer star budget gross company runtime year country month day rating_cat genre_cat
0 The Shining R Drama 8.4 927000 Stanley Kubrick Stephen King Jack Nicholson 19000000 46998772 Warner Bros. 146.0 1980.0 United States June 13 6 6
1 The Blue Lagoon R Adventure 5.8 65000 Randal Kleiser Henry De Vere Stacpoole Brooke Shields 4500000 58853106 Columbia Pictures 104.0 1980.0 United States July 2 6 1
2 Star Wars: Episode V - The Empire Strikes Back PG Action 8.7 1200000 Irvin Kershner Leigh Brackett Mark Hamill 18000000 538375067 Lucasfilm 124.0 1980.0 United States June 20 4 0
3 Airplane! PG Comedy 7.7 221000 Jim Abrahams Jim Abrahams Robert Hays 3500000 83453539 Paramount Pictures 88.0 1980.0 United States July 2 4 4
4 Caddyshack R Comedy 7.3 108000 Harold Ramis Brian Doyle-Murray Chevy Chase 6000000 39846344 Orion Pictures 98.0 1980.0 United States July 25 6 4

df1_corr=df1.corr()
df1_corr.to_html()
score votes budget gross runtime year rating_cat genre_cat
score 1.000000 0.474256 0.072001 0.222556 0.414068 0.061443 0.065983 0.035106
votes 0.474256 1.000000 0.439675 0.614751 0.352303 0.202215 0.006031 -0.135990
budget 0.072001 0.439675 1.000000 0.740247 0.318695 0.319669 -0.203946 -0.368523
gross 0.222556 0.614751 0.740247 1.000000 0.275796 0.268141 -0.181906 -0.244101
runtime 0.414068 0.352303 0.318695 0.275796 1.000000 0.075183 0.140792 -0.059237
year 0.061443 0.202215 0.319669 0.268141 0.075183 1.000000 0.022089 -0.066049
rating_cat 0.065983 0.006031 -0.203946 -0.181906 0.140792 0.022089 1.000000 0.147796
genre_cat 0.035106 -0.135990 -0.368523 -0.244101 -0.059237 -0.066049 0.147796 1.000000

df2=pd.get_dummies(df1,columns=['rating','genre']).head()
df2_matrix=df2.corr()
corr_pairs = df2_matrix.unstack()
sorted_pairs = corr_pairs.sort_values()
sorted_pairs.dropna()
rating_R         rating_PG      -1.000000
rating_PG        rating_cat     -1.000000
                 rating_R       -1.000000
rating_cat       rating_PG      -1.000000
genre_Adventure  score          -0.873726
                                   ...   
gross            gross           1.000000
budget           budget          1.000000
votes            votes           1.000000
genre_Comedy     genre_Comedy    1.000000
genre_Drama      genre_Drama     1.000000
Length: 169, dtype: float64
high_corr = sorted_pairs[(sorted_pairs)>0.7]
high_corr
votes            gross              0.729639
gross            votes              0.729639
votes            genre_Action       0.744107
genre_Action     votes              0.744107
budget           score              0.760846
score            budget             0.760846
runtime          votes              0.798509
votes            runtime            0.798509
runtime          genre_Drama        0.822495
genre_Drama      runtime            0.822495
votes            score              0.833480
score            votes              0.833480
runtime          budget             0.932112
budget           runtime            0.932112
                 votes              0.952681
votes            budget             0.952681
gross            genre_Action       0.997050
genre_Action     gross              0.997050
score            score              1.000000
genre_cat        genre_cat          1.000000
genre_Adventure  genre_Adventure    1.000000
genre_Action     genre_Action       1.000000
rating_R         rating_R           1.000000
                 rating_cat         1.000000
rating_PG        rating_PG          1.000000
rating_cat       rating_R           1.000000
                 rating_cat         1.000000
runtime          runtime            1.000000
gross            gross              1.000000
budget           budget             1.000000
votes            votes              1.000000
genre_Comedy     genre_Comedy       1.000000
genre_Drama      genre_Drama        1.000000
dtype: float64

Conclusions

For the correlation above, we can conclude that the genre Action have a very high correlation with gross, therefore those who are looking for profit when thinking about make a movies, this a excellent genre to plan. Also, the drama genre is the one which have the highest correlation with run time. We can conclude that films with lower duration are not well accpeted by those who love drama films.

Thank you!