Netflix Visualization and Content-Based Recommendation system.¶

Netflix is an application that keeps growing up the graph with its popularity, shows and content. This is an EDA or a story telling through its data along with a content-based recommendation system.

netflix gif

Please Upvote if you like the notebook and share possible improvements in the comments.

In [1]:

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 



# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

#import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
 #   for filename in filenames:
  #      print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [2]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt

Loading the dataset¶

In [3]:

netflix_overall=pd.read_csv("/kaggle/input/netflix-shows/netflix_titles.csv")
netflix_overall.head()

Out[3]:

	show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
0	81145628	Movie	Norm of the North: King Sized Adventure	Richard Finn, Tim Maltby	Alan Marriott, Andrew Toth, Brian Dobson, Cole...	United States, India, South Korea, China	September 9, 2019	2019	TV-PG	90 min	Children & Family Movies, Comedies	Before planning an awesome wedding for his gra...
1	80117401	Movie	Jandino: Whatever it Takes	NaN	Jandino Asporaat	United Kingdom	September 9, 2016	2016	TV-MA	94 min	Stand-Up Comedy	Jandino Asporaat riffs on the challenges of ra...
2	70234439	TV Show	Transformers Prime	NaN	Peter Cullen, Sumalee Montano, Frank Welker, J...	United States	September 8, 2018	2013	TV-Y7-FV	1 Season	Kids' TV	With the help of three human allies, the Autob...
3	80058654	TV Show	Transformers: Robots in Disguise	NaN	Will Friedle, Darren Criss, Constance Zimmer, ...	United States	September 8, 2018	2016	TV-Y7	1 Season	Kids' TV	When a prison ship crash unleashes hundreds of...
4	80125979	Movie	#realityhigh	Fernando Lebrija	Nesta Cooper, Kate Walsh, John Michael Higgins...	United States	September 8, 2017	2017	TV-14	99 min	Comedies	When nerdy high schooler Dani finally attracts...

Therefore, it is clear that the dataset contains 13 columns for exploratory analysis.

In [4]:

netflix_overall.count()

Out[4]:

show_id         6234
type            6234
title           6234
director        4265
cast            5664
country         5758
date_added      6223
release_year    6234
rating          6224
duration        6234
listed_in       6234
description     6234
dtype: int64

In [5]:

netflix_shows=netflix_overall[netflix_overall['type']=='TV Show']

In [6]:

netflix_movies=netflix_overall[netflix_overall['type']=='Movie']

Analysis of Movies vs TV Shows.¶

In [7]:

sns.set(style="darkgrid")
ax = sns.countplot(x="type", data=netflix_overall, palette="Set2")

It is evident that there are more Movies on Netflix than TV shows.

If a producer wants to release some content, which month must he do so?( Month when least amount of content is added)¶

In [8]:

netflix_date = netflix_shows[['date_added']].dropna()
netflix_date['year'] = netflix_date['date_added'].apply(lambda x : x.split(', ')[-1])
netflix_date['month'] = netflix_date['date_added'].apply(lambda x : x.lstrip().split(' ')[0])

month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'][::-1]
df = netflix_date.groupby('year')['month'].value_counts().unstack().fillna(0)[month_order].T
plt.figure(figsize=(10, 7), dpi=200)
plt.pcolor(df, cmap='afmhot_r', edgecolors='white', linewidths=2) # heatmap
plt.xticks(np.arange(0.5, len(df.columns), 1), df.columns, fontsize=7, fontfamily='serif')
plt.yticks(np.arange(0.5, len(df.index), 1), df.index, fontsize=7, fontfamily='serif')

plt.title('Netflix Contents Update', fontsize=12, fontfamily='calibri', fontweight='bold', position=(0.20, 1.0+0.02))
cbar = plt.colorbar()

cbar.ax.tick_params(labelsize=8) 
cbar.ax.minorticks_on()
plt.show()

If the latest year 2019 is considered, January and December were the months when comparatively much less content was released.Therefore, these months may be a good choice for the success of a new release!

Movie ratings analysis¶

In [9]:

plt.figure(figsize=(12,10))
sns.set(style="darkgrid")
ax = sns.countplot(x="rating", data=netflix_movies, palette="Set2", order=netflix_movies['rating'].value_counts().index[0:15])

The largest count of movies are made with the 'TV-MA' rating."TV-MA" is a rating assigned by the TV Parental Guidelines to a television program that was designed for mature audiences only.

Second largest is the 'TV-14' stands for content that may be inappropriate for children younger than 14 years of age.

Third largest is the very popular 'R' rating.An R-rated film is a film that has been assessed as having material which may be unsuitable for children under the age of 17 by the Motion Picture Association of America; the MPAA writes "Under 17 requires accompanying parent or adult guardian".

Analysing IMDB ratings to get top rated movies on Netflix¶

In [10]:

imdb_ratings=pd.read_csv('/kaggle/input/imdb-extensive-dataset/IMDb ratings.csv',usecols=['weighted_average_vote'])
imdb_titles=pd.read_csv('/kaggle/input/imdb-extensive-dataset/IMDb movies.csv', usecols=['title','year','genre'])
ratings = pd.DataFrame({'Title':imdb_titles.title,
                    'Release Year':imdb_titles.year,
                    'Rating': imdb_ratings.weighted_average_vote,
                    'Genre':imdb_titles.genre})
ratings.drop_duplicates(subset=['Title','Release Year','Rating'], inplace=True)
ratings.head()

Out[10]:

	Title	Release Year	Rating	Genre
0	The Story of the Kelly Gang	1906	6.1	Biography, Crime, Drama
1	Den sorte drøm	1911	5.9	Drama
2	Cleopatra	1912	5.2	Drama, History
3	L'Inferno	1911	7.0	Adventure, Drama, Fantasy
4	From the Manger to the Cross; or, Jesus of Naz...	1912	5.7	Biography, Drama

Performing inner join on the ratings dataset and netflix dataset to get the content that has both ratings on IMDB and are available on Netflix.

In [11]:

ratings.dropna()
joint_data=ratings.merge(netflix_overall,left_on='Title',right_on='title',how='inner')
joint_data=joint_data.sort_values(by='Rating', ascending=False)

Top rated 10 movies on Netflix are:

In [12]:

import plotly.express as px
top_rated=joint_data[0:10]
fig =px.sunburst(
    top_rated,
    path=['title','country'],
    values='Rating',
    color='Rating')
fig.show()

Countries with highest rated content.

In [13]:

country_count=joint_data['country'].value_counts().sort_values(ascending=False)
country_count=pd.DataFrame(country_count)
topcountries=country_count[0:11]
topcountries

Out[13]:

	country
United States	1063
India	619
United Kingdom	135
Canada	60
United Kingdom, United States	47
Spain	44
Turkey	41
France	40
Philippines	40
South Korea	38
Australia	35

In [14]:

import plotly.express as px
data = dict(
    number=[1063,619,135,60,44,41,40,40,38,35],
    country=["United States", "India", "United Kingdom", "Canada", "Spain",'Turkey','Philippines','France','South Korea','Australia'])
fig = px.funnel(data, x='number', y='country')
fig.show()

Year wise analysis¶

In [15]:

plt.figure(figsize=(12,10))
sns.set(style="darkgrid")
ax = sns.countplot(y="release_year", data=netflix_movies, palette="Set2", order=netflix_movies['release_year'].value_counts().index[0:15])

So, 2017 was the year when most of the movies were released.

In [16]:

countries={}
netflix_movies['country']=netflix_movies['country'].fillna('Unknown')
cou=list(netflix_movies['country'])
for i in cou:
    #print(i)
    i=list(i.split(','))
    if len(i)==1:
        if i in list(countries.keys()):
            countries[i]+=1
        else:
            countries[i[0]]=1
    else:
        for j in i:
            if j in list(countries.keys()):
                countries[j]+=1
            else:
                countries[j]=1

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [17]:

countries_fin={}
for country,no in countries.items():
    country=country.replace(' ','')
    if country in list(countries_fin.keys()):
        countries_fin[country]+=no
    else:
        countries_fin[country]=no
        
countries_fin={k: v for k, v in sorted(countries_fin.items(), key=lambda item: item[1], reverse= True)}

Analysis of duration of movies¶

In [19]:

netflix_movies['duration']=netflix_movies['duration'].str.replace(' min','')
netflix_movies['duration']=netflix_movies['duration'].astype(str).astype(int)
netflix_movies['duration']

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[19]:

0        90
1        94
4        99
6       110
7        60
       ... 
5577     70
5578    102
5579     88
5580    109
6231     60
Name: duration, Length: 4265, dtype: int64

In [20]:

sns.set(style="darkgrid")
sns.kdeplot(data=netflix_movies['duration'], shade=True)

Out[20]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fa73f441710>

So, a good amount of movies on Netflix are among the duration of 75-120 mins. It is acceptable considering the fact that a fair amount of the audience cannot watch a 3 hour movie in one sitting. Can you? :p

In [21]:

from collections import Counter

genres=list(netflix_movies['listed_in'])
gen=[]

for i in genres:
    i=list(i.split(','))
    for j in i:
        gen.append(j.replace(' ',""))
g=Counter(gen)

WordCloud for Genres.¶

In [22]:

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

text = list(set(gen))
plt.rcParams['figure.figsize'] = (13, 13)
wordcloud = WordCloud(max_font_size=50, max_words=100,background_color="white").generate(str(text))

plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("off")
plt.show()

Lollipop plot of Genres vs their count on Netflix¶

In [23]:

g={k: v for k, v in sorted(g.items(), key=lambda item: item[1], reverse= True)}


fig, ax = plt.subplots()

fig = plt.figure(figsize = (10, 10))
x=list(g.keys())
y=list(g.values())
ax.vlines(x, ymin=0, ymax=y, color='green')
ax.plot(x,y, "o", color='maroon')
ax.set_xticklabels(x, rotation = 90)
ax.set_ylabel("Count of movies")
# set a title
ax.set_title("Genres");

<Figure size 720x720 with 0 Axes>

Therefore, it is clear that international movies, dramas and comedies are the top three genres that have the highest amount of content on Netflix.

Analysis of TV SERIES on Netflix¶

In [24]:

countries1={}
netflix_shows['country']=netflix_shows['country'].fillna('Unknown')
cou1=list(netflix_shows['country'])
for i in cou1:
    #print(i)
    i=list(i.split(','))
    if len(i)==1:
        if i in list(countries1.keys()):
            countries1[i]+=1
        else:
            countries1[i[0]]=1
    else:
        for j in i:
            if j in list(countries1.keys()):
                countries1[j]+=1
            else:
                countries1[j]=1

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [25]:

countries_fin1={}
for country,no in countries1.items():
    country=country.replace(' ','')
    if country in list(countries_fin1.keys()):
        countries_fin1[country]+=no
    else:
        countries_fin1[country]=no
        
countries_fin1={k: v for k, v in sorted(countries_fin1.items(), key=lambda item: item[1], reverse= True)}

Most content creating countries¶

In [26]:

# Set the width and height of the figure
plt.figure(figsize=(15,15))

# Add title
plt.title("Content creating countries")

# Bar chart showing average arrival delay for Spirit Airlines flights by month
sns.barplot(y=list(countries_fin1.keys()), x=list(countries_fin1.values()))

# Add label for vertical axis
plt.ylabel("Arrival delay (in minutes)")

Out[26]:

Text(0, 0.5, 'Arrival delay (in minutes)')

Naturally, United States has the most content that is created on netflix in the tv series category.

In [27]:

features=['title','duration']
durations= netflix_shows[features]

durations['no_of_seasons']=durations['duration'].str.replace(' Season','')

#durations['no_of_seasons']=durations['no_of_seasons'].astype(str).astype(int)
durations['no_of_seasons']=durations['no_of_seasons'].str.replace('s','')

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [28]:

durations['no_of_seasons']=durations['no_of_seasons'].astype(str).astype(int)

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

TV shows with largest number of seasons¶

In [29]:

t=['title','no_of_seasons']
top=durations[t]

top=top.sort_values(by='no_of_seasons', ascending=False)

In [30]:

top20=top[0:20]
top20.plot(kind='bar',x='title',y='no_of_seasons', color='red')

Out[30]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fa7508177d0>

Thus, NCIS, Grey's Anatomy and Supernatural are amongst the tv series that have highest number of seasons.

Lowest number of seasons.¶

In [31]:

bottom=top.sort_values(by='no_of_seasons')
bottom=bottom[20:50]

import plotly.graph_objects as go

fig = go.Figure(data=[go.Table(header=dict(values=['Title', 'No of seasons']),
                 cells=dict(values=[bottom['title'],bottom['no_of_seasons']],fill_color='lavender'))
                     ])
fig.show()

These are some binge-worthy shows that are short and have only one season.

In [32]:

genres=list(netflix_shows['listed_in'])
gen=[]

for i in genres:
    i=list(i.split(','))
    for j in i:
        gen.append(j.replace(' ',""))
g=Counter(gen)

Word Cloud for Genres¶

In [33]:

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

text = list(set(gen))

wordcloud = WordCloud(max_font_size=50, max_words=100,background_color="black").generate(str(text))
plt.rcParams['figure.figsize'] = (13, 13)
plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("off")
plt.show()

In [34]:

us_series_data=netflix_shows[netflix_shows['country']=='United States']

In [35]:

oldest_us_series=us_series_data.sort_values(by='release_year')[0:20]

In [36]:

fig = go.Figure(data=[go.Table(header=dict(values=['Title', 'Release Year']),
                 cells=dict(values=[oldest_us_series['title'],oldest_us_series['release_year']]))
                     ])
fig.show()

Above table shows the oldest US tv shows on Netflix.

In [37]:

newest_us_series=us_series_data.sort_values(by='release_year', ascending=False)[0:50]

In [38]:

fig = go.Figure(data=[go.Table(header=dict(values=['Title', 'Release Year']),
                 cells=dict(values=[newest_us_series['title'],newest_us_series['release_year']]))
                     ])
fig.show()

The above are latest released US television shows!

Content in France¶

In [39]:

netflix_fr=netflix_overall[netflix_overall['country']=='France']
nannef=netflix_fr.dropna()
import plotly.express as px
fig = px.treemap(nannef, path=['country','director'],
                  color='director', hover_data=['director','title'],color_continuous_scale='Purples')
fig.show()

In [40]:

newest_fr_series=netflix_fr.sort_values(by='release_year', ascending=False)[0:20]

In [41]:

newest_fr_series

Out[41]:

	show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
3472	81074060	TV Show	Until Dawn	NaN	Ahmed Sylla, Alban Ivanov, Ornella Fleury, Nat...	France	January 10, 2020	2020	TV-MA	1 Season	International TV Shows, Reality TV, TV Comedies	France’s funniest comics carry out ghastly tas...
39	80178151	TV Show	The Spy	NaN	Sacha Baron Cohen, Noah Emmerich, Hadar Ratzon...	France	September 6, 2019	2019	TV-MA	1 Season	International TV Shows, TV Dramas, TV Thrillers	In the 1960s, Israeli clerk-turned-secret agen...
1014	81079723	Movie	Paradise Beach	Xavier Durringer	Sami Bouajila, Tewfik Jallab, Mélanie Doutey, ...	France	November 8, 2019	2019	TV-MA	94 min	Action & Adventure, Dramas, International Movies	Mehdi gets out of prison, planning to settle o...
1791	81012340	Movie	Shéhérazade	Jean-Bernard Marlin	Dylan Robert, Kenza Fortas, Idir Azougli, Lisa...	France	May 11, 2019	2019	TV-MA	111 min	Dramas, Independent Movies, International Movies	Fresh out of juvenile detention in Marseille, ...
2516	81010818	TV Show	Family Business	NaN	Jonathan Cohen, Gérard Darmon, Julia Piaton, L...	France	June 28, 2019	2019	TV-MA	1 Season	International TV Shows, TV Comedies, TV Dramas	After learning France is about to legalize pot...
2587	81027187	Movie	The Wolf's Call	Antonin Baudry	François Civil, Omar Sy, Mathieu Kassovitz, Re...	France	June 20, 2019	2019	TV-14	116 min	Dramas, International Movies, Thrillers	With nuclear war looming, a military expert in...
1309	81096745	Movie	Fadily Camara : La plus drôle de tes copines	Gautier & Leduc	Fadily Camara	France	November 14, 2019	2019	TV-MA	54 min	Stand-Up Comedy	Irrepressible French comedian Fadily Camara we...
3858	81027190	Movie	Paris Is Us	Elisabeth Vogler	Noémie Schmidt, Grégoire Isvarine, Marie Motte...	France	February 22, 2019	2019	TV-MA	84 min	Dramas, Independent Movies, International Movies	Amid a turbulent romance and rising tensions i...
1192	80222157	TV Show	Who Killed Little Gregory?	NaN	NaN	France	November 20, 2019	2019	TV-MA	1 Season	Crime TV Shows, Docuseries, International TV S...	When their 4-year-old son is murdered, a young...
1152	80241539	TV Show	Mortel	NaN	Carl Malapa, Nemo Schiffman, Manon Bresch, Cor...	France	November 21, 2019	2019	TV-MA	1 Season	Crime TV Shows, International TV Shows, TV Dramas	After making a deal with a supernatural figure...
1135	80198859	Movie	Brother	Julien Abraham	MHD, Darren Muselet, Aïssa Maïga, Jalil Lesper...	France	November 22, 2019	2019	TV-MA	97 min	Dramas, Independent Movies, International Movies	Thrust from a violent home into a brutal custo...
1092	81018979	TV Show	Mythomaniac	NaN	Marina Hands, Mathieu Demy, Marie Drion, Jérém...	France	November 28, 2019	2019	TV-MA	1 Season	International TV Shows, TV Comedies, TV Dramas	Burned out and taken for granted, a working mo...
1945	81024044	Movie	Lady J	Emmanuel Mouret	Cécile De France, Edouard Baer, Alice Isaaz, N...	France	March 8, 2019	2019	TV-14	111 min	Dramas, Independent Movies, International Movies	When her romance with a lustful marquis takes ...
4339	80213020	TV Show	The Bonfire of Destiny	NaN	Audrey Fleurot, Julie de Bona, Camille Lou, Gi...	France	December 26, 2019	2019	TV-MA	1 Season	International TV Shows, TV Dramas	After a devastating fire in 1897 Paris, three ...
1079	81120982	Movie	I Lost My Body	Jérémy Clapin	Hakim Faris, Victoire Du Bois, Patrick d'Assum...	France	November 29, 2019	2019	TV-MA	81 min	Dramas, Independent Movies, International Movies	Romance, mystery and adventure intertwine as a...
4440	80989919	TV Show	Twice Upon A Time	NaN	Gaspard Ulliel, Freya Mavor	France	December 19, 2019	2019	TV-MA	1 Season	International TV Shows, Romantic TV Shows, TV ...	Months after a crushing breakup, a man receive...
534	81061828	TV Show	Nailed It! France	NaN	Artus, Noémie Honiat	France	October 25, 2019	2019	TV-14	1 Season	International TV Shows, Reality TV	On this fun and funny competition show, home b...
272	80217779	TV Show	Marianne	NaN	Victoire Du Bois, Lucie Boujenah, Tiphaine Dav...	France	September 13, 2019	2019	TV-MA	1 Season	International TV Shows, TV Dramas, TV Horror	Lured back to her hometown, a famous horror wr...
732	81034012	Movie	Street Flow	Leïla Sy, Kery James	Kery James, Jammeh Diangana, Chloé Jouannet, B...	France	October 13, 2019	2019	TV-MA	96 min	Dramas, International Movies	Three brothers – a gangster, a scholar and an ...
5692	80190086	TV Show	The Hook Up Plan	NaN	Marc Ruchmann, Zita Hanrot, Sabrina Ouazani, J...	France	October 11, 2019	2019	TV-14	2 Seasons	International TV Shows, Romantic TV Shows, TV ...	When Parisian Elsa gets hung up on her ex, her...

In [42]:

fig = go.Figure(data=[go.Table(header=dict(values=['Title', 'Release Year']),
                 cells=dict(values=[newest_fr_series['title'],newest_fr_series['release_year']]))
                     ])
fig.show()

Recommendation System (Content Based)¶

The TF-IDF(Term Frequency-Inverse Document Frequency (TF-IDF) ) score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that occur frequently in plot overviews and therefore, their significance in computing the final similarity score.

In [43]:

from sklearn.feature_extraction.text import TfidfVectorizer

In [44]:

#removing stopwords
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
netflix_overall['description'] = netflix_overall['description'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(netflix_overall['description'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

Out[44]:

(6234, 16151)

There are about 16151 words described for the 6234 movies in this dataset.

Here, The Cosine similarity score is used since it is independent of magnitude and is relatively easy and fast to calculate.

In [45]:

# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [46]:

indices = pd.Series(netflix_overall.index, index=netflix_overall['title']).drop_duplicates()

In [47]:

def get_recommendations(title, cosine_sim=cosine_sim):
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return netflix_overall['title'].iloc[movie_indices]

This recommendation is just based on the Plot.

In [48]:

get_recommendations('Peaky Blinders')

Out[48]:

296                     Our Godfather
4491                              Don
2015                         The Fear
4852    Jonathan Strange & Mr Norrell
1231                       The Prison
3737                Power Rangers Zeo
5986                       The Tudors
1753      Once Upon a Time in Mumbaai
5494     The Legend of Michael Mishra
1142                  Shelby American
Name: title, dtype: object

In [49]:

get_recommendations('Mortel')

Out[49]:

3016                  PILI Fantasy: War of Dragons
5688         Edgar Rice Burroughs' Tarzan and Jane
4401                             Figures of Speech
2001                                     FirstBorn
3310    My Entire High School Sinking Into the Sea
5307                                 Psychokinesis
4332                                At First Light
3955                          The Umbrella Academy
1761                                     Chamatkar
5421                              Maharakshak Devi
Name: title, dtype: object

It is seen that the model performs well, but is not very accurate.Therefore, more metrics are added to the model to improve performance.

* Content based filtering on multiple metrics¶

Content based filtering on the following factors:

Title
Cast
Director
Listed in
Plot

Filling null values with empty string.

In [50]:

filledna=netflix_overall.fillna('')
filledna.head(2)

Out[50]:

	show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
0	81145628	Movie	Norm of the North: King Sized Adventure	Richard Finn, Tim Maltby	Alan Marriott, Andrew Toth, Brian Dobson, Cole...	United States, India, South Korea, China	September 9, 2019	2019	TV-PG	90 min	Children & Family Movies, Comedies	Before planning an awesome wedding for his gra...
1	80117401	Movie	Jandino: Whatever it Takes		Jandino Asporaat	United Kingdom	September 9, 2016	2016	TV-MA	94 min	Stand-Up Comedy	Jandino Asporaat riffs on the challenges of ra...

Cleaning the data - making all the words lower case

In [51]:

def clean_data(x):
        return str.lower(x.replace(" ", ""))

Identifying features on which the model is to be filtered.

In [52]:

features=['title','director','cast','listed_in','description']
filledna=filledna[features]

In [53]:

for feature in features:
    filledna[feature] = filledna[feature].apply(clean_data)
    
filledna.head(2)

Out[53]:

	title	director	cast	listed_in	description
0	normofthenorth:kingsizedadventure	richardfinn,timmaltby	alanmarriott,andrewtoth,briandobson,colehoward...	children&familymovies,comedies	beforeplanninganawesomeweddingforhisgrandfathe...
1	jandino:whateverittakes		jandinoasporaat	stand-upcomedy	jandinoasporaatriffsonthechallengesofraisingki...

Creating a "soup" or a "bag of words" for all rows.

In [54]:

def create_soup(x):
    return x['title']+ ' ' + x['director'] + ' ' + x['cast'] + ' ' +x['listed_in']+' '+ x['description']

In [55]:

filledna['soup'] = filledna.apply(create_soup, axis=1)

From here on, the code is basically similar to the upper model except the fact that count vectorizer is used instead of tfidf.

In [56]:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(filledna['soup'])

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [57]:

filledna=filledna.reset_index()
indices = pd.Series(filledna.index, index=filledna['title'])

In [58]:

def get_recommendations_new(title, cosine_sim=cosine_sim):
    title=title.replace(' ','').lower()
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return netflix_overall['title'].iloc[movie_indices]

In [59]:

get_recommendations_new('PK', cosine_sim2)

Out[59]:

5054                           3 Idiots
5494       The Legend of Michael Mishra
3093                  Anthony Kaun Hai?
2196                             Haapus
691                               Sanju
4110                   Taare Zameen Par
1449                    Chance Pe Dance
2194                    Chal Dhar Pakad
1746    EMI: Liya Hai To Chukana Padega
4920                   Khosla Ka Ghosla
Name: title, dtype: object

In [60]:

get_recommendations_new('Peaky Blinders', cosine_sim2)

Out[60]:

3465                                   Giri / Haji
6050                   The Frankenstein Chronicles
2018                         The Murder Detectives
5529                                        Loaded
550                                      Bodyguard
2505                                 Kiss Me First
5859                                  Happy Valley
233     How to Live Mortgage Free with Sarah Beeny
522                          Terrorism Close Calls
1605                                Killer Ratings
Name: title, dtype: object

In [61]:

get_recommendations_new('The Hook Up Plan', cosine_sim2)

Out[61]:

2576                     Melodies of Life - Born This Way
5273                                       Dancing Angels
5708                                        Little Things
2210                                           Rishta.com
5441    Club Friday To Be Continued - My Beautiful Tomboy
1155                                          Oh My Ghost
625                                  Accidentally in Love
789                                       College Romance
2030                                  แผนร้ายนายเจ้าเล่ห์
2480                                    Bangkok Bachelors
Name: title, dtype: object

Hence, more accurate predictions are obtained.

Please upvote if you liked the kernel! 😀

Netflix Content Based Recommendation and EDA