Netflix Content Based Recommendation and EDA

By Niharika P, Mon 11 May 2020, in category Eda

netflix firsts python eda visuals

Netflix Visualization and Content-Based Recommendation system.

Netflix is an application that keeps growing up the graph with its popularity, shows and content. This is an EDA or a story telling through its data along with a content-based recommendation system.

netflix gif

Please Upvote if you like the notebook and share possible improvements in the comments.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 



# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

#import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
 #   for filename in filenames:
  #      print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.
In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt

Loading the dataset

In [3]:
netflix_overall=pd.read_csv("/kaggle/input/netflix-shows/netflix_titles.csv")
netflix_overall.head()
Out[3]:
show_id type title director cast country date_added release_year rating duration listed_in description
0 81145628 Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole... United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra...
1 80117401 Movie Jandino: Whatever it Takes NaN Jandino Asporaat United Kingdom September 9, 2016 2016 TV-MA 94 min Stand-Up Comedy Jandino Asporaat riffs on the challenges of ra...
2 70234439 TV Show Transformers Prime NaN Peter Cullen, Sumalee Montano, Frank Welker, J... United States September 8, 2018 2013 TV-Y7-FV 1 Season Kids' TV With the help of three human allies, the Autob...
3 80058654 TV Show Transformers: Robots in Disguise NaN Will Friedle, Darren Criss, Constance Zimmer, ... United States September 8, 2018 2016 TV-Y7 1 Season Kids' TV When a prison ship crash unleashes hundreds of...
4 80125979 Movie #realityhigh Fernando Lebrija Nesta Cooper, Kate Walsh, John Michael Higgins... United States September 8, 2017 2017 TV-14 99 min Comedies When nerdy high schooler Dani finally attracts...

Therefore, it is clear that the dataset contains 13 columns for exploratory analysis.

In [4]:
netflix_overall.count()
Out[4]:
show_id         6234
type            6234
title           6234
director        4265
cast            5664
country         5758
date_added      6223
release_year    6234
rating          6224
duration        6234
listed_in       6234
description     6234
dtype: int64
In [5]:
netflix_shows=netflix_overall[netflix_overall['type']=='TV Show']
In [6]:
netflix_movies=netflix_overall[netflix_overall['type']=='Movie']

Analysis of Movies vs TV Shows.

In [7]:
sns.set(style="darkgrid")
ax = sns.countplot(x="type", data=netflix_overall, palette="Set2")

It is evident that there are more Movies on Netflix than TV shows.

If a producer wants to release some content, which month must he do so?( Month when least amount of content is added)

In [8]:
netflix_date = netflix_shows[['date_added']].dropna()
netflix_date['year'] = netflix_date['date_added'].apply(lambda x : x.split(', ')[-1])
netflix_date['month'] = netflix_date['date_added'].apply(lambda x : x.lstrip().split(' ')[0])

month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'][::-1]
df = netflix_date.groupby('year')['month'].value_counts().unstack().fillna(0)[month_order].T
plt.figure(figsize=(10, 7), dpi=200)
plt.pcolor(df, cmap='afmhot_r', edgecolors='white', linewidths=2) # heatmap
plt.xticks(np.arange(0.5, len(df.columns), 1), df.columns, fontsize=7, fontfamily='serif')
plt.yticks(np.arange(0.5, len(df.index), 1), df.index, fontsize=7, fontfamily='serif')

plt.title('Netflix Contents Update', fontsize=12, fontfamily='calibri', fontweight='bold', position=(0.20, 1.0+0.02))
cbar = plt.colorbar()

cbar.ax.tick_params(labelsize=8) 
cbar.ax.minorticks_on()
plt.show()

If the latest year 2019 is considered, January and December were the months when comparatively much less content was released.Therefore, these months may be a good choice for the success of a new release!

Movie ratings analysis

In [9]:
plt.figure(figsize=(12,10))
sns.set(style="darkgrid")
ax = sns.countplot(x="rating", data=netflix_movies, palette="Set2", order=netflix_movies['rating'].value_counts().index[0:15])

The largest count of movies are made with the 'TV-MA' rating."TV-MA" is a rating assigned by the TV Parental Guidelines to a television program that was designed for mature audiences only.

Second largest is the 'TV-14' stands for content that may be inappropriate for children younger than 14 years of age.

Third largest is the very popular 'R' rating.An R-rated film is a film that has been assessed as having material which may be unsuitable for children under the age of 17 by the Motion Picture Association of America; the MPAA writes "Under 17 requires accompanying parent or adult guardian".

Analysing IMDB ratings to get top rated movies on Netflix

In [10]:
imdb_ratings=pd.read_csv('/kaggle/input/imdb-extensive-dataset/IMDb ratings.csv',usecols=['weighted_average_vote'])
imdb_titles=pd.read_csv('/kaggle/input/imdb-extensive-dataset/IMDb movies.csv', usecols=['title','year','genre'])
ratings = pd.DataFrame({'Title':imdb_titles.title,
                    'Release Year':imdb_titles.year,
                    'Rating': imdb_ratings.weighted_average_vote,
                    'Genre':imdb_titles.genre})
ratings.drop_duplicates(subset=['Title','Release Year','Rating'], inplace=True)
ratings.head()
Out[10]:
Title Release Year Rating Genre
0 The Story of the Kelly Gang 1906 6.1 Biography, Crime, Drama
1 Den sorte drøm 1911 5.9 Drama
2 Cleopatra 1912 5.2 Drama, History
3 L'Inferno 1911 7.0 Adventure, Drama, Fantasy
4 From the Manger to the Cross; or, Jesus of Naz... 1912 5.7 Biography, Drama

Performing inner join on the ratings dataset and netflix dataset to get the content that has both ratings on IMDB and are available on Netflix.

In [11]:
ratings.dropna()
joint_data=ratings.merge(netflix_overall,left_on='Title',right_on='title',how='inner')
joint_data=joint_data.sort_values(by='Rating', ascending=False)

Top rated 10 movies on Netflix are:

In [12]:
import plotly.express as px
top_rated=joint_data[0:10]
fig =px.sunburst(
    top_rated,
    path=['title','country'],
    values='Rating',
    color='Rating')
fig.show()

Countries with highest rated content.

In [13]:
country_count=joint_data['country'].value_counts().sort_values(ascending=False)
country_count=pd.DataFrame(country_count)
topcountries=country_count[0:11]
topcountries
Out[13]:
country
United States 1063
India 619
United Kingdom 135
Canada 60
United Kingdom, United States 47
Spain 44
Turkey 41
France 40
Philippines 40
South Korea 38
Australia 35
In [14]:
import plotly.express as px
data = dict(
    number=[1063,619,135,60,44,41,40,40,38,35],
    country=["United States", "India", "United Kingdom", "Canada", "Spain",'Turkey','Philippines','France','South Korea','Australia'])
fig = px.funnel(data, x='number', y='country')
fig.show()

Year wise analysis

In [15]:
plt.figure(figsize=(12,10))
sns.set(style="darkgrid")
ax = sns.countplot(y="release_year", data=netflix_movies, palette="Set2", order=netflix_movies['release_year'].value_counts().index[0:15])

So, 2017 was the year when most of the movies were released.

In [16]:
countries={}
netflix_movies['country']=netflix_movies['country'].fillna('Unknown')
cou=list(netflix_movies['country'])
for i in cou:
    #print(i)
    i=list(i.split(','))
    if len(i)==1:
        if i in list(countries.keys()):
            countries[i]+=1
        else:
            countries[i[0]]=1
    else:
        for j in i:
            if j in list(countries.keys()):
                countries[j]+=1
            else:
                countries[j]=1
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [17]:
countries_fin={}
for country,no in countries.items():
    country=country.replace(' ','')
    if country in list(countries_fin.keys()):
        countries_fin[country]+=no
    else:
        countries_fin[country]=no
        
countries_fin={k: v for k, v in sorted(countries_fin.items(), key=lambda item: item[1], reverse= True)}

TOP 10 MOVIE CONTENT CREATING COUNTRIES

In [18]:
plt.figure(figsize=(8,8))
ax = sns.barplot(x=list(countries_fin.keys())[0:10],y=list(countries_fin.values())[0:10])
ax.set_xticklabels(list(countries_fin.keys())[0:10],rotation = 90)
Out[18]:
[Text(0, 0, 'UnitedStates'),
 Text(0, 0, 'France'),
 Text(0, 0, 'UnitedKingdom'),
 Text(0, 0, 'Canada'),
 Text(0, 0, 'Germany'),
 Text(0, 0, 'Belgium'),
 Text(0, 0, 'China'),
 Text(0, 0, 'Spain'),
 Text(0, 0, 'India'),
 Text(0, 0, 'Australia')]

Analysis of duration of movies

In [19]:
netflix_movies['duration']=netflix_movies['duration'].str.replace(' min','')
netflix_movies['duration']=netflix_movies['duration'].astype(str).astype(int)
netflix_movies['duration']
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[19]:
0        90
1        94
4        99
6       110
7        60
       ... 
5577     70
5578    102
5579     88
5580    109
6231     60
Name: duration, Length: 4265, dtype: int64
In [20]:
sns.set(style="darkgrid")
sns.kdeplot(data=netflix_movies['duration'], shade=True)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa73f441710>

So, a good amount of movies on Netflix are among the duration of 75-120 mins. It is acceptable considering the fact that a fair amount of the audience cannot watch a 3 hour movie in one sitting. Can you? :p

In [21]:
from collections import Counter

genres=list(netflix_movies['listed_in'])
gen=[]

for i in genres:
    i=list(i.split(','))
    for j in i:
        gen.append(j.replace(' ',""))
g=Counter(gen)

WordCloud for Genres.

In [22]:
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

text = list(set(gen))
plt.rcParams['figure.figsize'] = (13, 13)
wordcloud = WordCloud(max_font_size=50, max_words=100,background_color="white").generate(str(text))

plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("off")
plt.show()

Lollipop plot of Genres vs their count on Netflix

In [23]:
g={k: v for k, v in sorted(g.items(), key=lambda item: item[1], reverse= True)}


fig, ax = plt.subplots()

fig = plt.figure(figsize = (10, 10))
x=list(g.keys())
y=list(g.values())
ax.vlines(x, ymin=0, ymax=y, color='green')
ax.plot(x,y, "o", color='maroon')
ax.set_xticklabels(x, rotation = 90)
ax.set_ylabel("Count of movies")
# set a title
ax.set_title("Genres");
<Figure size 720x720 with 0 Axes>

Therefore, it is clear that international movies, dramas and comedies are the top three genres that have the highest amount of content on Netflix.

Analysis of TV SERIES on Netflix

In [24]:
countries1={}
netflix_shows['country']=netflix_shows['country'].fillna('Unknown')
cou1=list(netflix_shows['country'])
for i in cou1:
    #print(i)
    i=list(i.split(','))
    if len(i)==1:
        if i in list(countries1.keys()):
            countries1[i]+=1
        else:
            countries1[i[0]]=1
    else:
        for j in i:
            if j in list(countries1.keys()):
                countries1[j]+=1
            else:
                countries1[j]=1
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [25]:
countries_fin1={}
for country,no in countries1.items():
    country=country.replace(' ','')
    if country in list(countries_fin1.keys()):
        countries_fin1[country]+=no
    else:
        countries_fin1[country]=no
        
countries_fin1={k: v for k, v in sorted(countries_fin1.items(), key=lambda item: item[1], reverse= True)}

Most content creating countries

In [26]:
# Set the width and height of the figure
plt.figure(figsize=(15,15))

# Add title
plt.title("Content creating countries")

# Bar chart showing average arrival delay for Spirit Airlines flights by month
sns.barplot(y=list(countries_fin1.keys()), x=list(countries_fin1.values()))

# Add label for vertical axis
plt.ylabel("Arrival delay (in minutes)")
Out[26]:
Text(0, 0.5, 'Arrival delay (in minutes)')

Naturally, United States has the most content that is created on netflix in the tv series category.

In [27]:
features=['title','duration']
durations= netflix_shows[features]

durations['no_of_seasons']=durations['duration'].str.replace(' Season','')

#durations['no_of_seasons']=durations['no_of_seasons'].astype(str).astype(int)
durations['no_of_seasons']=durations['no_of_seasons'].str.replace('s','')
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [28]:
durations['no_of_seasons']=durations['no_of_seasons'].astype(str).astype(int)
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

TV shows with largest number of seasons

In [29]:
t=['title','no_of_seasons']
top=durations[t]

top=top.sort_values(by='no_of_seasons', ascending=False)
In [30]:
top20=top[0:20]
top20.plot(kind='bar',x='title',y='no_of_seasons', color='red')
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa7508177d0>

Thus, NCIS, Grey's Anatomy and Supernatural are amongst the tv series that have highest number of seasons.

Lowest number of seasons.

In [31]:
bottom=top.sort_values(by='no_of_seasons')
bottom=bottom[20:50]

import plotly.graph_objects as go

fig = go.Figure(data=[go.Table(header=dict(values=['Title', 'No of seasons']),
                 cells=dict(values=[bottom['title'],bottom['no_of_seasons']],fill_color='lavender'))
                     ])
fig.show()

These are some binge-worthy shows that are short and have only one season.

In [32]:
genres=list(netflix_shows['listed_in'])
gen=[]

for i in genres:
    i=list(i.split(','))
    for j in i:
        gen.append(j.replace(' ',""))
g=Counter(gen)

Word Cloud for Genres

In [33]:
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

text = list(set(gen))

wordcloud = WordCloud(max_font_size=50, max_words=100,background_color="black").generate(str(text))
plt.rcParams['figure.figsize'] = (13, 13)
plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("off")
plt.show()
In [34]:
us_series_data=netflix_shows[netflix_shows['country']=='United States']
In [35]:
oldest_us_series=us_series_data.sort_values(by='release_year')[0:20]
In [36]:
fig = go.Figure(data=[go.Table(header=dict(values=['Title', 'Release Year']),
                 cells=dict(values=[oldest_us_series['title'],oldest_us_series['release_year']]))
                     ])
fig.show()

Above table shows the oldest US tv shows on Netflix.

In [37]:
newest_us_series=us_series_data.sort_values(by='release_year', ascending=False)[0:50]
In [38]:
fig = go.Figure(data=[go.Table(header=dict(values=['Title', 'Release Year']),
                 cells=dict(values=[newest_us_series['title'],newest_us_series['release_year']]))
                     ])
fig.show()

The above are latest released US television shows!

Content in France

In [39]:
netflix_fr=netflix_overall[netflix_overall['country']=='France']
nannef=netflix_fr.dropna()
import plotly.express as px
fig = px.treemap(nannef, path=['country','director'],
                  color='director', hover_data=['director','title'],color_continuous_scale='Purples')
fig.show()
In [40]:
newest_fr_series=netflix_fr.sort_values(by='release_year', ascending=False)[0:20]
In [41]:
newest_fr_series
Out[41]:
show_id type title director cast country date_added release_year rating duration listed_in description
3472 81074060 TV Show Until Dawn NaN Ahmed Sylla, Alban Ivanov, Ornella Fleury, Nat... France January 10, 2020 2020 TV-MA 1 Season International TV Shows, Reality TV, TV Comedies France’s funniest comics carry out ghastly tas...
39 80178151 TV Show The Spy NaN Sacha Baron Cohen, Noah Emmerich, Hadar Ratzon... France September 6, 2019 2019 TV-MA 1 Season International TV Shows, TV Dramas, TV Thrillers In the 1960s, Israeli clerk-turned-secret agen...
1014 81079723 Movie Paradise Beach Xavier Durringer Sami Bouajila, Tewfik Jallab, Mélanie Doutey, ... France November 8, 2019 2019 TV-MA 94 min Action & Adventure, Dramas, International Movies Mehdi gets out of prison, planning to settle o...
1791 81012340 Movie Shéhérazade Jean-Bernard Marlin Dylan Robert, Kenza Fortas, Idir Azougli, Lisa... France May 11, 2019 2019 TV-MA 111 min Dramas, Independent Movies, International Movies Fresh out of juvenile detention in Marseille, ...
2516 81010818 TV Show Family Business NaN Jonathan Cohen, Gérard Darmon, Julia Piaton, L... France June 28, 2019 2019 TV-MA 1 Season International TV Shows, TV Comedies, TV Dramas After learning France is about to legalize pot...
2587 81027187 Movie The Wolf's Call Antonin Baudry François Civil, Omar Sy, Mathieu Kassovitz, Re... France June 20, 2019 2019 TV-14 116 min Dramas, International Movies, Thrillers With nuclear war looming, a military expert in...
1309 81096745 Movie Fadily Camara : La plus drôle de tes copines Gautier & Leduc Fadily Camara France November 14, 2019 2019 TV-MA 54 min Stand-Up Comedy Irrepressible French comedian Fadily Camara we...
3858 81027190 Movie Paris Is Us Elisabeth Vogler Noémie Schmidt, Grégoire Isvarine, Marie Motte... France February 22, 2019 2019 TV-MA 84 min Dramas, Independent Movies, International Movies Amid a turbulent romance and rising tensions i...
1192 80222157 TV Show Who Killed Little Gregory? NaN NaN France November 20, 2019 2019 TV-MA 1 Season Crime TV Shows, Docuseries, International TV S... When their 4-year-old son is murdered, a young...
1152 80241539 TV Show Mortel NaN Carl Malapa, Nemo Schiffman, Manon Bresch, Cor... France November 21, 2019 2019 TV-MA 1 Season Crime TV Shows, International TV Shows, TV Dramas After making a deal with a supernatural figure...
1135 80198859 Movie Brother Julien Abraham MHD, Darren Muselet, Aïssa Maïga, Jalil Lesper... France November 22, 2019 2019 TV-MA 97 min Dramas, Independent Movies, International Movies Thrust from a violent home into a brutal custo...
1092 81018979 TV Show Mythomaniac NaN Marina Hands, Mathieu Demy, Marie Drion, Jérém... France November 28, 2019 2019 TV-MA 1 Season International TV Shows, TV Comedies, TV Dramas Burned out and taken for granted, a working mo...
1945 81024044 Movie Lady J Emmanuel Mouret Cécile De France, Edouard Baer, Alice Isaaz, N... France March 8, 2019 2019 TV-14 111 min Dramas, Independent Movies, International Movies When her romance with a lustful marquis takes ...
4339 80213020 TV Show The Bonfire of Destiny NaN Audrey Fleurot, Julie de Bona, Camille Lou, Gi... France December 26, 2019 2019 TV-MA 1 Season International TV Shows, TV Dramas After a devastating fire in 1897 Paris, three ...
1079 81120982 Movie I Lost My Body Jérémy Clapin Hakim Faris, Victoire Du Bois, Patrick d'Assum... France November 29, 2019 2019 TV-MA 81 min Dramas, Independent Movies, International Movies Romance, mystery and adventure intertwine as a...
4440 80989919 TV Show Twice Upon A Time NaN Gaspard Ulliel, Freya Mavor France December 19, 2019 2019 TV-MA 1 Season International TV Shows, Romantic TV Shows, TV ... Months after a crushing breakup, a man receive...
534 81061828 TV Show Nailed It! France NaN Artus, Noémie Honiat France October 25, 2019 2019 TV-14 1 Season International TV Shows, Reality TV On this fun and funny competition show, home b...
272 80217779 TV Show Marianne NaN Victoire Du Bois, Lucie Boujenah, Tiphaine Dav... France September 13, 2019 2019 TV-MA 1 Season International TV Shows, TV Dramas, TV Horror Lured back to her hometown, a famous horror wr...
732 81034012 Movie Street Flow Leïla Sy, Kery James Kery James, Jammeh Diangana, Chloé Jouannet, B... France October 13, 2019 2019 TV-MA 96 min Dramas, International Movies Three brothers – a gangster, a scholar and an ...
5692 80190086 TV Show The Hook Up Plan NaN Marc Ruchmann, Zita Hanrot, Sabrina Ouazani, J... France October 11, 2019 2019 TV-14 2 Seasons International TV Shows, Romantic TV Shows, TV ... When Parisian Elsa gets hung up on her ex, her...
In [42]:
fig = go.Figure(data=[go.Table(header=dict(values=['Title', 'Release Year']),
                 cells=dict(values=[newest_fr_series['title'],newest_fr_series['release_year']]))
                     ])
fig.show()

Recommendation System (Content Based)

The TF-IDF(Term Frequency-Inverse Document Frequency (TF-IDF) ) score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that occur frequently in plot overviews and therefore, their significance in computing the final similarity score.

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer
In [44]:
#removing stopwords
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
netflix_overall['description'] = netflix_overall['description'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(netflix_overall['description'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape
Out[44]:
(6234, 16151)

There are about 16151 words described for the 6234 movies in this dataset.

Here, The Cosine similarity score is used since it is independent of magnitude and is relatively easy and fast to calculate.

In [45]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
In [46]:
indices = pd.Series(netflix_overall.index, index=netflix_overall['title']).drop_duplicates()
In [47]:
def get_recommendations(title, cosine_sim=cosine_sim):
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return netflix_overall['title'].iloc[movie_indices]

This recommendation is just based on the Plot.

In [48]:
get_recommendations('Peaky Blinders')
Out[48]:
296                     Our Godfather
4491                              Don
2015                         The Fear
4852    Jonathan Strange & Mr Norrell
1231                       The Prison
3737                Power Rangers Zeo
5986                       The Tudors
1753      Once Upon a Time in Mumbaai
5494     The Legend of Michael Mishra
1142                  Shelby American
Name: title, dtype: object
In [49]:
get_recommendations('Mortel')
Out[49]:
3016                  PILI Fantasy: War of Dragons
5688         Edgar Rice Burroughs' Tarzan and Jane
4401                             Figures of Speech
2001                                     FirstBorn
3310    My Entire High School Sinking Into the Sea
5307                                 Psychokinesis
4332                                At First Light
3955                          The Umbrella Academy
1761                                     Chamatkar
5421                              Maharakshak Devi
Name: title, dtype: object

It is seen that the model performs well, but is not very accurate.Therefore, more metrics are added to the model to improve performance.

* Content based filtering on multiple metrics

Content based filtering on the following factors:

  • Title
  • Cast
  • Director
  • Listed in
  • Plot

Filling null values with empty string.

In [50]:
filledna=netflix_overall.fillna('')
filledna.head(2)
Out[50]:
show_id type title director cast country date_added release_year rating duration listed_in description
0 81145628 Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole... United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra...
1 80117401 Movie Jandino: Whatever it Takes Jandino Asporaat United Kingdom September 9, 2016 2016 TV-MA 94 min Stand-Up Comedy Jandino Asporaat riffs on the challenges of ra...

Cleaning the data - making all the words lower case

In [51]:
def clean_data(x):
        return str.lower(x.replace(" ", ""))

Identifying features on which the model is to be filtered.

In [52]:
features=['title','director','cast','listed_in','description']
filledna=filledna[features]
In [53]:
for feature in features:
    filledna[feature] = filledna[feature].apply(clean_data)
    
filledna.head(2)
Out[53]:
title director cast listed_in description
0 normofthenorth:kingsizedadventure richardfinn,timmaltby alanmarriott,andrewtoth,briandobson,colehoward... children&familymovies,comedies beforeplanninganawesomeweddingforhisgrandfathe...
1 jandino:whateverittakes jandinoasporaat stand-upcomedy jandinoasporaatriffsonthechallengesofraisingki...

Creating a "soup" or a "bag of words" for all rows.

In [54]:
def create_soup(x):
    return x['title']+ ' ' + x['director'] + ' ' + x['cast'] + ' ' +x['listed_in']+' '+ x['description']
In [55]:
filledna['soup'] = filledna.apply(create_soup, axis=1)

From here on, the code is basically similar to the upper model except the fact that count vectorizer is used instead of tfidf.

In [56]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(filledna['soup'])

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)
In [57]:
filledna=filledna.reset_index()
indices = pd.Series(filledna.index, index=filledna['title'])
In [58]:
def get_recommendations_new(title, cosine_sim=cosine_sim):
    title=title.replace(' ','').lower()
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return netflix_overall['title'].iloc[movie_indices]
In [59]:
get_recommendations_new('PK', cosine_sim2)
Out[59]:
5054                           3 Idiots
5494       The Legend of Michael Mishra
3093                  Anthony Kaun Hai?
2196                             Haapus
691                               Sanju
4110                   Taare Zameen Par
1449                    Chance Pe Dance
2194                    Chal Dhar Pakad
1746    EMI: Liya Hai To Chukana Padega
4920                   Khosla Ka Ghosla
Name: title, dtype: object
In [60]:
get_recommendations_new('Peaky Blinders', cosine_sim2)
Out[60]:
3465                                   Giri / Haji
6050                   The Frankenstein Chronicles
2018                         The Murder Detectives
5529                                        Loaded
550                                      Bodyguard
2505                                 Kiss Me First
5859                                  Happy Valley
233     How to Live Mortgage Free with Sarah Beeny
522                          Terrorism Close Calls
1605                                Killer Ratings
Name: title, dtype: object
In [61]:
get_recommendations_new('The Hook Up Plan', cosine_sim2)
Out[61]:
2576                     Melodies of Life - Born This Way
5273                                       Dancing Angels
5708                                        Little Things
2210                                           Rishta.com
5441    Club Friday To Be Continued - My Beautiful Tomboy
1155                                          Oh My Ghost
625                                  Accidentally in Love
789                                       College Romance
2030                                  แผนร้ายนายเจ้าเล่ห์
2480                                    Bangkok Bachelors
Name: title, dtype: object

Hence, more accurate predictions are obtained.

Please upvote if you liked the kernel! 😀

counter free