For this project the first questions that came to my mind was: Is the movie industry dying? Is Netflix the new entertainment king? And the best way to answer those is analyzing that dataset of four decades using Pandas, Matpoltlib and Seaborn to also understand more factors that intervene in this industry, like actors, genres, user ratings and more.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
plt.style.use('ggplot')df = pd.read_csv('movies.csv')
df_head=df.head()
df_head.to_html()| name | rating | genre | year | released | score | votes | director | writer | star | country | budget | gross | company | runtime | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | The Shining | R | Drama | 1980 | June 13, 1980 (United States) | 8.4 | 927000.0 | Stanley Kubrick | Stephen King | Jack Nicholson | United Kingdom | 19000000.0 | 46998772.0 | Warner Bros. | 146.0 |
| 1 | The Blue Lagoon | R | Adventure | 1980 | July 2, 1980 (United States) | 5.8 | 65000.0 | Randal Kleiser | Henry De Vere Stacpoole | Brooke Shields | United States | 4500000.0 | 58853106.0 | Columbia Pictures | 104.0 |
| 2 | Star Wars: Episode V - The Empire Strikes Back | PG | Action | 1980 | June 20, 1980 (United States) | 8.7 | 1200000.0 | Irvin Kershner | Leigh Brackett | Mark Hamill | United States | 18000000.0 | 538375067.0 | Lucasfilm | 124.0 |
| 3 | Airplane! | PG | Comedy | 1980 | July 2, 1980 (United States) | 7.7 | 221000.0 | Jim Abrahams | Jim Abrahams | Robert Hays | United States | 3500000.0 | 83453539.0 | Paramount Pictures | 88.0 |
| 4 | Caddyshack | R | Comedy | 1980 | July 25, 1980 (United States) | 7.3 | 108000.0 | Harold Ramis | Brian Doyle-Murray | Chevy Chase | United States | 6000000.0 | 39846344.0 | Orion Pictures | 98.0 |
’
df.isna().sum()name 0
rating 77
genre 0
year 0
released 2
score 3
votes 3
director 0
writer 3
star 1
country 3
budget 2171
gross 189
company 17
runtime 4
dtype: int64
df.dropna(inplace=True)
df.isna().sum()name 0
rating 0
genre 0
year 0
released 0
score 0
votes 0
director 0
writer 0
star 0
country 0
budget 0
gross 0
company 0
runtime 0
dtype: int64
df.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 5421 entries, 0 to 7652
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 5421 non-null object
1 rating 5421 non-null object
2 genre 5421 non-null object
3 year 5421 non-null int64
4 released 5421 non-null object
5 score 5421 non-null float64
6 votes 5421 non-null float64
7 director 5421 non-null object
8 writer 5421 non-null object
9 star 5421 non-null object
10 country 5421 non-null object
11 budget 5421 non-null float64
12 gross 5421 non-null float64
13 company 5421 non-null object
14 runtime 5421 non-null float64
dtypes: float64(5), int64(1), object(9)
memory usage: 677.6+ KB
df['votes']=df['votes'].astype('int64')
df['budget']=df['budget'].astype('int64')
df['gross']=df['gross'].astype('int64')
df['released']=df['released'].astype('string')
df.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 5421 entries, 0 to 7652
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 5421 non-null object
1 rating 5421 non-null object
2 genre 5421 non-null object
3 year 5421 non-null int64
4 released 5421 non-null string
5 score 5421 non-null float64
6 votes 5421 non-null int64
7 director 5421 non-null object
8 writer 5421 non-null object
9 star 5421 non-null object
10 country 5421 non-null object
11 budget 5421 non-null int64
12 gross 5421 non-null int64
13 company 5421 non-null object
14 runtime 5421 non-null float64
dtypes: float64(2), int64(4), object(8), string(1)
memory usage: 677.6+ KB
df_sort=df.sort_values(by=['gross'],inplace=False,ascending=False)
df_sort_head=df_sort.head()
df_sort_head.to_html()| name | rating | genre | year | released | score | votes | director | writer | star | country | budget | gross | company | runtime | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5445 | Avatar | PG-13 | Action | 2009 | December 18, 2009 (United States) | 7.8 | 1100000 | James Cameron | James Cameron | Sam Worthington | United States | 237000000 | 2847246203 | Twentieth Century Fox | 162.0 |
| 7445 | Avengers: Endgame | PG-13 | Action | 2019 | April 26, 2019 (United States) | 8.4 | 903000 | Anthony Russo | Christopher Markus | Robert Downey Jr. | United States | 356000000 | 2797501328 | Marvel Studios | 181.0 |
| 3045 | Titanic | PG-13 | Drama | 1997 | December 19, 1997 (United States) | 7.8 | 1100000 | James Cameron | James Cameron | Leonardo DiCaprio | United States | 200000000 | 2201647264 | Twentieth Century Fox | 194.0 |
| 6663 | Star Wars: Episode VII - The Force Awakens | PG-13 | Action | 2015 | December 18, 2015 (United States) | 7.8 | 876000 | J.J. Abrams | Lawrence Kasdan | Daisy Ridley | United States | 245000000 | 2069521700 | Lucasfilm | 138.0 |
| 7244 | Avengers: Infinity War | PG-13 | Action | 2018 | April 27, 2018 (United States) | 8.4 | 897000 | Anthony Russo | Christopher Markus | Robert Downey Jr. | United States | 321000000 | 2048359754 | Marvel Studios | 149.0 |
’
released_df=df['released'].str.split(",",n = 1,expand = True)
released_df.head() 0 1
0 June 13 1980 (United States)
1 July 2 1980 (United States)
2 June 20 1980 (United States)
3 July 2 1980 (United States)
4 July 25 1980 (United States)
df['Day_Month']=released_df[0]
df['Year_Correct']=released_df[1].str[:5]
df['Country_Correct']=released_df[1].str[5:]
df['Country_Correct']=df['Country_Correct'].str[2:-1]df_head=df.head()
df_head.to_html()| name | rating | genre | year | released | score | votes | director | writer | star | country | budget | gross | company | runtime | Day_Month | Year_Correct | Country_Correct | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | The Shining | R | Drama | 1980 | June 13, 1980 (United States) | 8.4 | 927000 | Stanley Kubrick | Stephen King | Jack Nicholson | United Kingdom | 19000000 | 46998772 | Warner Bros. | 146.0 | June 13 | 1980 | United States |
| 1 | The Blue Lagoon | R | Adventure | 1980 | July 2, 1980 (United States) | 5.8 | 65000 | Randal Kleiser | Henry De Vere Stacpoole | Brooke Shields | United States | 4500000 | 58853106 | Columbia Pictures | 104.0 | July 2 | 1980 | United States |
| 2 | Star Wars: Episode V - The Empire Strikes Back | PG | Action | 1980 | June 20, 1980 (United States) | 8.7 | 1200000 | Irvin Kershner | Leigh Brackett | Mark Hamill | United States | 18000000 | 538375067 | Lucasfilm | 124.0 | June 20 | 1980 | United States |
| 3 | Airplane! | PG | Comedy | 1980 | July 2, 1980 (United States) | 7.7 | 221000 | Jim Abrahams | Jim Abrahams | Robert Hays | United States | 3500000 | 83453539 | Paramount Pictures | 88.0 | July 2 | 1980 | United States |
| 4 | Caddyshack | R | Comedy | 1980 | July 25, 1980 (United States) | 7.3 | 108000 | Harold Ramis | Brian Doyle-Murray | Chevy Chase | United States | 6000000 | 39846344 | Orion Pictures | 98.0 | July 25 | 1980 | United States |
’
day_month=df['Day_Month'].str.split(" ",n = 1,expand = True)
df['Month']=day_month[0]
df['Day']=day_month[1]
df_head=df.head()
df_head.to_html()| name | rating | genre | year | released | score | votes | director | writer | star | country | budget | gross | company | runtime | Day_Month | Year_Correct | Country_Correct | Month | Day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | The Shining | R | Drama | 1980 | June 13, 1980 (United States) | 8.4 | 927000 | Stanley Kubrick | Stephen King | Jack Nicholson | United Kingdom | 19000000 | 46998772 | Warner Bros. | 146.0 | June 13 | 1980 | United States | June | 13 |
| 1 | The Blue Lagoon | R | Adventure | 1980 | July 2, 1980 (United States) | 5.8 | 65000 | Randal Kleiser | Henry De Vere Stacpoole | Brooke Shields | United States | 4500000 | 58853106 | Columbia Pictures | 104.0 | July 2 | 1980 | United States | July | 2 |
| 2 | Star Wars: Episode V - The Empire Strikes Back | PG | Action | 1980 | June 20, 1980 (United States) | 8.7 | 1200000 | Irvin Kershner | Leigh Brackett | Mark Hamill | United States | 18000000 | 538375067 | Lucasfilm | 124.0 | June 20 | 1980 | United States | June | 20 |
| 3 | Airplane! | PG | Comedy | 1980 | July 2, 1980 (United States) | 7.7 | 221000 | Jim Abrahams | Jim Abrahams | Robert Hays | United States | 3500000 | 83453539 | Paramount Pictures | 88.0 | July 2 | 1980 | United States | July | 2 |
| 4 | Caddyshack | R | Comedy | 1980 | July 25, 1980 (United States) | 7.3 | 108000 | Harold Ramis | Brian Doyle-Murray | Chevy Chase | United States | 6000000 | 39846344 | Orion Pictures | 98.0 | July 25 | 1980 | United States | July | 25 |
’
df.drop(['year','released','Day_Month','country'],axis=1,inplace=True)
df_head=df.head()
df_head.to_html()| name | rating | genre | score | votes | director | writer | star | budget | gross | company | runtime | Year_Correct | Country_Correct | Month | Day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | The Shining | R | Drama | 8.4 | 927000 | Stanley Kubrick | Stephen King | Jack Nicholson | 19000000 | 46998772 | Warner Bros. | 146.0 | 1980 | United States | June | 13 |
| 1 | The Blue Lagoon | R | Adventure | 5.8 | 65000 | Randal Kleiser | Henry De Vere Stacpoole | Brooke Shields | 4500000 | 58853106 | Columbia Pictures | 104.0 | 1980 | United States | July | 2 |
| 2 | Star Wars: Episode V - The Empire Strikes Back | PG | Action | 8.7 | 1200000 | Irvin Kershner | Leigh Brackett | Mark Hamill | 18000000 | 538375067 | Lucasfilm | 124.0 | 1980 | United States | June | 20 |
| 3 | Airplane! | PG | Comedy | 7.7 | 221000 | Jim Abrahams | Jim Abrahams | Robert Hays | 3500000 | 83453539 | Paramount Pictures | 88.0 | 1980 | United States | July | 2 |
| 4 | Caddyshack | R | Comedy | 7.3 | 108000 | Harold Ramis | Brian Doyle-Murray | Chevy Chase | 6000000 | 39846344 | Orion Pictures | 98.0 | 1980 | United States | July | 25 |
’
df.rename(columns={"Year_Correct": "year", "Country_Correct": "country",'Month':'month','Day':'day'},inplace=True)
df_head=df.head()
df_head.to_html()| name | rating | genre | score | votes | director | writer | star | budget | gross | company | runtime | year | country | month | day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | The Shining | R | Drama | 8.4 | 927000 | Stanley Kubrick | Stephen King | Jack Nicholson | 19000000 | 46998772 | Warner Bros. | 146.0 | 1980 | United States | June | 13 |
| 1 | The Blue Lagoon | R | Adventure | 5.8 | 65000 | Randal Kleiser | Henry De Vere Stacpoole | Brooke Shields | 4500000 | 58853106 | Columbia Pictures | 104.0 | 1980 | United States | July | 2 |
| 2 | Star Wars: Episode V - The Empire Strikes Back | PG | Action | 8.7 | 1200000 | Irvin Kershner | Leigh Brackett | Mark Hamill | 18000000 | 538375067 | Lucasfilm | 124.0 | 1980 | United States | June | 20 |
| 3 | Airplane! | PG | Comedy | 7.7 | 221000 | Jim Abrahams | Jim Abrahams | Robert Hays | 3500000 | 83453539 | Paramount Pictures | 88.0 | 1980 | United States | July | 2 |
| 4 | Caddyshack | R | Comedy | 7.3 | 108000 | Harold Ramis | Brian Doyle-Murray | Chevy Chase | 6000000 | 39846344 | Orion Pictures | 98.0 | 1980 | United States | July | 25 |
’
df_sorted = df.sort_values(by=['gross'],inplace=False,ascending=False)
df_sorted_head=df_sorted.head()
df_sorted_head.to_html()| name | rating | genre | score | votes | director | writer | star | budget | gross | company | runtime | year | country | month | day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5445 | Avatar | PG-13 | Action | 7.8 | 1100000 | James Cameron | James Cameron | Sam Worthington | 237000000 | 2847246203 | Twentieth Century Fox | 162.0 | 2009 | United States | December | 18 |
| 7445 | Avengers: Endgame | PG-13 | Action | 8.4 | 903000 | Anthony Russo | Christopher Markus | Robert Downey Jr. | 356000000 | 2797501328 | Marvel Studios | 181.0 | 2019 | United States | April | 26 |
| 3045 | Titanic | PG-13 | Drama | 7.8 | 1100000 | James Cameron | James Cameron | Leonardo DiCaprio | 200000000 | 2201647264 | Twentieth Century Fox | 194.0 | 1997 | United States | December | 19 |
| 6663 | Star Wars: Episode VII - The Force Awakens | PG-13 | Action | 7.8 | 876000 | J.J. Abrams | Lawrence Kasdan | Daisy Ridley | 245000000 | 2069521700 | Lucasfilm | 138.0 | 2015 | United States | December | 18 |
| 7244 | Avengers: Infinity War | PG-13 | Action | 8.4 | 897000 | Anthony Russo | Christopher Markus | Robert Downey Jr. | 321000000 | 2048359754 | Marvel Studios | 149.0 | 2018 | United States | April | 27 |
’
df_drop=df.drop_duplicates().head()
df_drop.to_html()| name | rating | genre | score | votes | director | writer | star | budget | gross | company | runtime | year | country | month | day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | The Shining | R | Drama | 8.4 | 927000 | Stanley Kubrick | Stephen King | Jack Nicholson | 19000000 | 46998772 | Warner Bros. | 146.0 | 1980 | United States | June | 13 |
| 1 | The Blue Lagoon | R | Adventure | 5.8 | 65000 | Randal Kleiser | Henry De Vere Stacpoole | Brooke Shields | 4500000 | 58853106 | Columbia Pictures | 104.0 | 1980 | United States | July | 2 |
| 2 | Star Wars: Episode V - The Empire Strikes Back | PG | Action | 8.7 | 1200000 | Irvin Kershner | Leigh Brackett | Mark Hamill | 18000000 | 538375067 | Lucasfilm | 124.0 | 1980 | United States | June | 20 |
| 3 | Airplane! | PG | Comedy | 7.7 | 221000 | Jim Abrahams | Jim Abrahams | Robert Hays | 3500000 | 83453539 | Paramount Pictures | 88.0 | 1980 | United States | July | 2 |
| 4 | Caddyshack | R | Comedy | 7.3 | 108000 | Harold Ramis | Brian Doyle-Murray | Chevy Chase | 6000000 | 39846344 | Orion Pictures | 98.0 | 1980 | United States | July | 25 |
’
df['year']=pd.to_numeric(df['year'])We going to start the make hypothesis about correlations in our dataframe. First, last assume that the column “gross” and “budget” are positive correlated and let’s see if this is true
sns.regplot(data=df,x='gross',y='budget',color="b",line_kws={"color":"red"});
plt.title("Gross Vs Budget Earnings");
plt.xlabel("Gross Earnings");
plt.ylabel("Budget for Film");
plt.show()
corr_matrix=df.corr()
sns.heatmap(corr_matrix,annot=True);
plt.title("Correlation Matrix between Numeric Features");
plt.xlabel("Movies Features");
plt.ylabel("Movies Features");
plt.show()
sns.regplot(data=df,x='gross',y='votes',color="b",line_kws={"color":"red"});
plt.title("Gross Vs Number of Votes");
plt.xlabel("Gross Earnings");
plt.ylabel("Votes");
plt.show()
contigency=pd.crosstab(df['rating'],df['genre'])
cont=contigency.head()
cont.to_html()| genre | Action | Adventure | Animation | Biography | Comedy | Crime | Drama | Family | Fantasy | Horror | Mystery | Romance | Sci-Fi | Thriller | Western |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| rating | |||||||||||||||
| Approved | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| G | 0 | 13 | 82 | 1 | 11 | 0 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| NC-17 | 0 | 0 | 0 | 0 | 2 | 3 | 6 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Not Rated | 5 | 0 | 1 | 2 | 6 | 4 | 20 | 0 | 2 | 4 | 0 | 0 | 0 | 0 | 0 |
| PG | 146 | 166 | 175 | 46 | 275 | 5 | 77 | 3 | 3 | 5 | 0 | 3 | 1 | 1 | 1 |
’
from scipy.stats import chi2_contingency
c, p, dof, expected = chi2_contingency(contigency)
p0.0
Here will fing correlation between categorical variable with some numerical ones.
df1 = df.copy()
df1['rating']=df1['rating'].astype('category')
df1['genre']=df1['genre'].astype('category')
df1.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 5421 entries, 0 to 7652
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 5421 non-null object
1 rating 5421 non-null category
2 genre 5421 non-null category
3 score 5421 non-null float64
4 votes 5421 non-null int64
5 director 5421 non-null object
6 writer 5421 non-null object
7 star 5421 non-null object
8 budget 5421 non-null int64
9 gross 5421 non-null int64
10 company 5421 non-null object
11 runtime 5421 non-null float64
12 year 5407 non-null float64
13 country 5407 non-null string
14 month 5421 non-null string
15 day 5421 non-null string
dtypes: category(2), float64(3), int64(3), object(5), string(3)
memory usage: 646.9+ KB
df1['rating_cat']=df1['rating'].cat.codes
df1['genre_cat']=df1['genre'].cat.codes
df1_head=df1.head()
df1_head.to_html()| name | rating | genre | score | votes | director | writer | star | budget | gross | company | runtime | year | country | month | day | rating_cat | genre_cat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | The Shining | R | Drama | 8.4 | 927000 | Stanley Kubrick | Stephen King | Jack Nicholson | 19000000 | 46998772 | Warner Bros. | 146.0 | 1980.0 | United States | June | 13 | 6 | 6 |
| 1 | The Blue Lagoon | R | Adventure | 5.8 | 65000 | Randal Kleiser | Henry De Vere Stacpoole | Brooke Shields | 4500000 | 58853106 | Columbia Pictures | 104.0 | 1980.0 | United States | July | 2 | 6 | 1 |
| 2 | Star Wars: Episode V - The Empire Strikes Back | PG | Action | 8.7 | 1200000 | Irvin Kershner | Leigh Brackett | Mark Hamill | 18000000 | 538375067 | Lucasfilm | 124.0 | 1980.0 | United States | June | 20 | 4 | 0 |
| 3 | Airplane! | PG | Comedy | 7.7 | 221000 | Jim Abrahams | Jim Abrahams | Robert Hays | 3500000 | 83453539 | Paramount Pictures | 88.0 | 1980.0 | United States | July | 2 | 4 | 4 |
| 4 | Caddyshack | R | Comedy | 7.3 | 108000 | Harold Ramis | Brian Doyle-Murray | Chevy Chase | 6000000 | 39846344 | Orion Pictures | 98.0 | 1980.0 | United States | July | 25 | 6 | 4 |
’
df1_corr=df1.corr()
df1_corr.to_html()| score | votes | budget | gross | runtime | year | rating_cat | genre_cat | |
|---|---|---|---|---|---|---|---|---|
| score | 1.000000 | 0.474256 | 0.072001 | 0.222556 | 0.414068 | 0.061443 | 0.065983 | 0.035106 |
| votes | 0.474256 | 1.000000 | 0.439675 | 0.614751 | 0.352303 | 0.202215 | 0.006031 | -0.135990 |
| budget | 0.072001 | 0.439675 | 1.000000 | 0.740247 | 0.318695 | 0.319669 | -0.203946 | -0.368523 |
| gross | 0.222556 | 0.614751 | 0.740247 | 1.000000 | 0.275796 | 0.268141 | -0.181906 | -0.244101 |
| runtime | 0.414068 | 0.352303 | 0.318695 | 0.275796 | 1.000000 | 0.075183 | 0.140792 | -0.059237 |
| year | 0.061443 | 0.202215 | 0.319669 | 0.268141 | 0.075183 | 1.000000 | 0.022089 | -0.066049 |
| rating_cat | 0.065983 | 0.006031 | -0.203946 | -0.181906 | 0.140792 | 0.022089 | 1.000000 | 0.147796 |
| genre_cat | 0.035106 | -0.135990 | -0.368523 | -0.244101 | -0.059237 | -0.066049 | 0.147796 | 1.000000 |
’
df2=pd.get_dummies(df1,columns=['rating','genre']).head()df2_matrix=df2.corr()
corr_pairs = df2_matrix.unstack()
sorted_pairs = corr_pairs.sort_values()
sorted_pairs.dropna()rating_R rating_PG -1.000000
rating_PG rating_cat -1.000000
rating_R -1.000000
rating_cat rating_PG -1.000000
genre_Adventure score -0.873726
...
gross gross 1.000000
budget budget 1.000000
votes votes 1.000000
genre_Comedy genre_Comedy 1.000000
genre_Drama genre_Drama 1.000000
Length: 169, dtype: float64
high_corr = sorted_pairs[(sorted_pairs)>0.7]
high_corrvotes gross 0.729639
gross votes 0.729639
votes genre_Action 0.744107
genre_Action votes 0.744107
budget score 0.760846
score budget 0.760846
runtime votes 0.798509
votes runtime 0.798509
runtime genre_Drama 0.822495
genre_Drama runtime 0.822495
votes score 0.833480
score votes 0.833480
runtime budget 0.932112
budget runtime 0.932112
votes 0.952681
votes budget 0.952681
gross genre_Action 0.997050
genre_Action gross 0.997050
score score 1.000000
genre_cat genre_cat 1.000000
genre_Adventure genre_Adventure 1.000000
genre_Action genre_Action 1.000000
rating_R rating_R 1.000000
rating_cat 1.000000
rating_PG rating_PG 1.000000
rating_cat rating_R 1.000000
rating_cat 1.000000
runtime runtime 1.000000
gross gross 1.000000
budget budget 1.000000
votes votes 1.000000
genre_Comedy genre_Comedy 1.000000
genre_Drama genre_Drama 1.000000
dtype: float64
For the correlation above, we can conclude that the genre Action have a very high correlation with gross, therefore those who are looking for profit when thinking about make a movies, this a excellent genre to plan. Also, the drama genre is the one which have the highest correlation with run time. We can conclude that films with lower duration are not well accpeted by those who love drama films.
Thank you!