In [1]:

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/goodbooks-10k/to_read.csv
/kaggle/input/goodbooks-10k/tags.csv
/kaggle/input/goodbooks-10k/book_tags.csv
/kaggle/input/goodbooks-10k/sample_book.xml
/kaggle/input/goodbooks-10k/ratings.csv
/kaggle/input/goodbooks-10k/books.csv
/kaggle/input/netflix-shows/netflix_titles.csv

The entire Netflix notebook can be found here : Netflix Visualizations, Recommendations, EDA

Problem Statement¶

There are a lot of movie recommendation engines made nowadays but not a single book recommendation engine. A book recommendation system performing both content based and collaborative filtering is to be developed to generate recommendations for users. This can essentially be used in an application like Netflix but for books.

Books

The notebook contains visualizations, analysis and content based and collaborative filtering recommendation systems on the goodreads books dataset.

Upvote if you like the kernel! 😃

INDEX¶

Cleaning the data
Visualizing books data
Recommendation system
Content- Based Recommendation system
Collaborative Filtering Recommendation System(User Based)
How many Netflix Shows/ Movies are made from books as their storylines?

In this digital world, one form of entertainment remains constant and will always be,Books. Movies and TV shows produce a sense of instant gratification releasing dopamine which regulates bodily movements like pleasure. It not only makes us lazy but also reliant on these digital forms of entertainment. Whereas when you read a book, you have to fully indulge into it and instant entertainment will not be available. It keeps your mind running and faciliates your thinking. Read books!

Importing Data¶

In [2]:

books= pd.read_csv('/kaggle/input/goodbooks-10k/books.csv',error_bad_lines = False)

In [3]:

books.head()

Out[3]:

	id	book_id	best_book_id	work_id	books_count	isbn	isbn13	authors	original_publication_year	original_title	...	ratings_count	work_ratings_count	work_text_reviews_count	ratings_1	ratings_2	ratings_3	ratings_4	ratings_5	image_url	small_image_url
0	1	2767052	2767052	2792775	272	439023483	9.780439e+12	Suzanne Collins	2008.0	The Hunger Games	...	4780653	4942365	155254	66715	127936	560092	1481305	2706317	https://images.gr-assets.com/books/1447303603m...	https://images.gr-assets.com/books/1447303603s...
1	2	3	3	4640799	491	439554934	9.780440e+12	J.K. Rowling, Mary GrandPré	1997.0	Harry Potter and the Philosopher's Stone	...	4602479	4800065	75867	75504	101676	455024	1156318	3011543	https://images.gr-assets.com/books/1474154022m...	https://images.gr-assets.com/books/1474154022s...
2	3	41865	41865	3212258	226	316015849	9.780316e+12	Stephenie Meyer	2005.0	Twilight	...	3866839	3916824	95009	456191	436802	793319	875073	1355439	https://images.gr-assets.com/books/1361039443m...	https://images.gr-assets.com/books/1361039443s...
3	4	2657	2657	3275794	487	61120081	9.780061e+12	Harper Lee	1960.0	To Kill a Mockingbird	...	3198671	3340896	72586	60427	117415	446835	1001952	1714267	https://images.gr-assets.com/books/1361975680m...	https://images.gr-assets.com/books/1361975680s...
4	5	4671	4671	245494	1356	743273567	9.780743e+12	F. Scott Fitzgerald	1925.0	The Great Gatsby	...	2683664	2773745	51992	86236	197621	606158	936012	947718	https://images.gr-assets.com/books/1490528560m...	https://images.gr-assets.com/books/1490528560s...

5 rows × 23 columns

In [4]:

books.shape

Out[4]:

(10000, 23)

In [5]:

ratings = pd.read_csv('/kaggle/input/goodbooks-10k/ratings.csv')

ratings.head()

Out[5]:

	book_id	user_id	rating
0	1	314	5
1	1	439	3
2	1	588	5
3	1	1169	4
4	1	1185	4

In [6]:

tags = pd.read_csv('/kaggle/input/goodbooks-10k/book_tags.csv')
tags.tail()

Out[6]:

	goodreads_book_id	tag_id	count
999907	33288638	21303	7
999908	33288638	17271	7
999909	33288638	1126	7
999910	33288638	11478	7
999911	33288638	27939	7

In [7]:

btags = pd.read_csv('/kaggle/input/goodbooks-10k/tags.csv')
btags.tail()

Out[7]:

	tag_id	tag_name
34247	34247	Ｃhildrens
34248	34248	Ｆａｖｏｒｉｔｅｓ
34249	34249	Ｍａｎｇａ
34250	34250	ＳＥＲＩＥＳ
34251	34251	ｆａｖｏｕｒｉｔｅｓ

Cleaning the data, removing duplicates¶

The data provided by the dataset is unclean, and they mention it clearly in the dataset description, if you wish to skip the cleaning process, head on over to the clean data available in the description of the goodbooks dataset. If you like a good challenge, use this one!

In [8]:

ratings=ratings.sort_values("user_id")
ratings.shape

Out[8]:

(981756, 3)

In [9]:

ratings.drop_duplicates(subset =["user_id","book_id"], 
                     keep = False, inplace = True) 
ratings.shape

Out[9]:

(977269, 3)

Therefore, 4487 duplicates were present in the data, that have been removed.

Lets check for the books dataset as well.

In [10]:

print(books.shape)
books.drop_duplicates(subset='original_title',keep=False,inplace=True)
print(books.shape)

(10000, 23)
(9151, 23)

849 rows removed.

In [11]:

print(btags.shape)
btags.drop_duplicates(subset='tag_id',keep=False,inplace=True)
print(btags.shape)

(34252, 2)
(34252, 2)

Cool, so there are no duplicates in the book_tags dataset.

In [12]:

print(tags.shape)
tags.drop_duplicates(subset=['tag_id','goodreads_book_id'],keep=False,inplace=True)
print(tags.shape)

(999912, 3)
(999896, 3)

Visualizing data¶

In [13]:

joint_tags=pd.merge(tags,btags,left_on='tag_id',right_on='tag_id',how='inner')

Top 10 rated books

In [14]:

top_rated=books.sort_values('average_rating', ascending=False)
top10=top_rated.head(10)
f=['title','small_image_url']
displ=(top10[f])
displ.set_index('title', inplace=True)

In [15]:

from IPython.display import Image, HTML

def path_to_image_html(path):
    '''
     This function essentially convert the image url to 
     '<img src="'+ path + '"/>' format. And one can put any
     formatting adjustments to control the height, aspect ratio, size etc.
     within as in the below example. 
    '''

    return '<img src="'+ path + '""/>'

HTML(displ.to_html(escape=False ,formatters=dict(small_image_url=path_to_image_html),justify='center'))

Out[15]:

	small_image_url
title
The Complete Calvin and Hobbes
Words of Radiance (The Stormlight Archive, #2)
Mark of the Lion Trilogy
It's a Magical World: A Calvin and Hobbes Collection
There's Treasure Everywhere: A Calvin and Hobbes Collection
Harry Potter Boxset (Harry Potter, #1-7)
Harry Potter Collection (Harry Potter, #1-6)
The Indispensable Calvin and Hobbes
The Authoritative Calvin and Hobbes: A Calvin and Hobbes Treasury
Attack of the Deranged Mutant Killer Monster Snow Goons

Top 10 most popular books

In [16]:

pop10=books.sort_values(by='ratings_count', ascending=False)
f=['title','small_image_url']
pop10=pop10.head(10)

pop10=(pop10[f])
pop10=pop10.set_index('title')

In [17]:

HTML(pop10.to_html(escape=False ,formatters=dict(small_image_url=path_to_image_html),justify='center'))

Out[17]:

	small_image_url
title
The Hunger Games (The Hunger Games, #1)
Harry Potter and the Sorcerer's Stone (Harry Potter, #1)
To Kill a Mockingbird
The Great Gatsby
The Fault in Our Stars
The Hobbit
The Catcher in the Rye
Pride and Prejudice
Angels & Demons (Robert Langdon, #1)
The Diary of a Young Girl

Most Common Rating Values

In [18]:

import seaborn as sns
plt.figure(figsize=(16,8))
sns.distplot(a=books['average_rating'], kde=True, color='r')

Out[18]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9ad26d4c90>

Therefore, the most common rating is somewhere between 3.5 to 4.

In [19]:

no_of_ratings_per_book=ratings.groupby('book_id').count()

In [20]:

no_of_ratings_per_book

Out[20]:

	user_id	rating
book_id
1	100	100
2	100	100
3	100	100
4	100	100
5	100	100
...	...	...
9996	96	96
9997	89	89
9998	95	95
9999	99	99
10000	96	96

10000 rows × 2 columns

In [21]:

plt.figure(figsize=(16,8))
sns.distplot(a=no_of_ratings_per_book['rating'], color='g')

Out[21]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9ad2462350>

It is seen that most books have count of ratings greater than 80. That is alot of audience.

Highly rated authors

In [22]:

books.head(2)

Out[22]:

	id	book_id	best_book_id	work_id	books_count	isbn	isbn13	authors	original_publication_year	original_title	...	ratings_count	work_ratings_count	work_text_reviews_count	ratings_1	ratings_2	ratings_3	ratings_4	ratings_5	image_url	small_image_url
0	1	2767052	2767052	2792775	272	439023483	9.780439e+12	Suzanne Collins	2008.0	The Hunger Games	...	4780653	4942365	155254	66715	127936	560092	1481305	2706317	https://images.gr-assets.com/books/1447303603m...	https://images.gr-assets.com/books/1447303603s...
1	2	3	3	4640799	491	439554934	9.780440e+12	J.K. Rowling, Mary GrandPré	1997.0	Harry Potter and the Philosopher's Stone	...	4602479	4800065	75867	75504	101676	455024	1156318	3011543	https://images.gr-assets.com/books/1474154022m...	https://images.gr-assets.com/books/1474154022s...

2 rows × 23 columns

In [23]:

f=['authors', 'average_rating']
top_authors=top_rated[f]
top_authors=top_authors.head(20)

In [24]:

fig = px.bar(top_authors, x='authors', y='average_rating', color ='average_rating')
fig.show()

Above barplot shows the top rated authors. Bill Waterson is on the top with a whopping rating of 4.82!

Finding popular genres and books available for those.

So, tags are added by users and we dont have any keywords to classify the books as genres, I have hard coded the genres and checked if the tags contain those values. Credits to this approach go to : @philispp on kaggle.

In [25]:

p=joint_tags.groupby('tag_name').count()

In [26]:

p=p.sort_values(by='count', ascending=False)
p

Out[26]:

	goodreads_book_id	tag_id	count
tag_name
to-read	9983	9983	9983
favorites	9881	9881	9881
owned	9856	9856	9856
books-i-own	9799	9799	9799
currently-reading	9776	9776	9776
...	...	...	...
hs	1	1	1
hrabal	1	1	1
hq-manga	1	1	1
hq-e-mangá	1	1	1
ｆａｖｏｕｒｉｔｅｓ	1	1	1

34251 rows × 3 columns

Hardcoding some basic genres

In [27]:

genres=["Art", "Biography", "Business", "Chick Lit", "Children's", "Christian", "Classics", "Comics", "Contemporary", "Cookbooks", "Crime", "Ebooks", "Fantasy", "Fiction", "Gay and Lesbian", "Graphic Novels", "Historical Fiction", "History", "Horror", "Humor and Comedy", "Manga", "Memoir", "Music", "Mystery", "Nonfiction", "Paranormal", "Philosophy", "Poetry", "Psychology", "Religion", "Romance", "Science", "Science Fiction", "Self Help", "Suspense", "Spirituality", "Sports", "Thriller", "Travel", "Young Adult"]
for i in range(len(genres)):
    genres[i]=genres[i].lower()

In [28]:

new_tags=p[p.index.isin(genres)]

In [29]:

import plotly.graph_objects as go

fig = go.Figure(go.Bar(
            x=new_tags['count'],
            y=new_tags.index,
            orientation='h'))

fig.show()

There's a lot of fiction present, but not a lot of cookbooks! Makes sense.

Relation between no of editions and ratings

The book_count column is also available in the dataset and even though the dataset description said it was the number of editions,I was unsure what that really meant. After some research, it was clear that this was the count of all editions, translations and formats available for the books (Kindle, Paperbacks, Hard Copies etc).

In [30]:

fig = px.line(books, y="books_count", x="average_rating", title='Book Count VS Average Rating')
fig.show()

Weirdly, it is seen that the average_rating increases with increases in number of editions of the book, but decreases after the count reaches about 2500. So, more is the number of editions, less is the average_rating.

In [31]:

dropna= books.dropna()
fig = px.treemap(dropna, path=['original_publication_year','language_code', "average_rating"],
                  color='average_rating')
fig.show()

Thus, a lot of books were publish in the year 2011, and most of them were in English.

Do readers prefer short titles or long titles?

In [32]:

books['length-title']=books['original_title'].str.len()

In [33]:

plt.figure(figsize=(16,8))
sns.regplot(x=books['length-title'], y=books['average_rating'])

Out[33]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9ad1632650>

So, the highly rated books have rather short titles. The graph shows that a straight line can be plotted but very approximately to say that as the length of title increases, the rating remains constant (at around 4).

Word Cloud for tags used by readers.

In [34]:

from wordcloud import WordCloud, STOPWORDS 
text = new_tags.index.values 

wordcloud = WordCloud().generate(str(text))
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Recommendation System¶

A recommendation engine filters the data using different algorithms and recommends the most relevant items to users. There are two types of recommendation systems:

Content Based : This approach analyzes the available content and find similarities between them and then recommendations the items obtained that have a high similarity.
Collaborative : This approach mines or analyzes the information about users who prefer the same content and recommend the items that similar users prefer.

1. Content Based¶

Content based filtering on the following factors:

Title
Authors
Average rating

Content based diagram

In [35]:

fillnabooks= books.fillna('')

Cleaning the data - making all the words lower case

In [36]:

def clean_data(x):
        return str.lower(x.replace(" ", ""))

In [37]:

features=['original_title','authors','average_rating']
fillednabooks=fillnabooks[features]

In [38]:

fillednabooks = fillednabooks.astype(str)
fillednabooks.dtypes

Out[38]:

original_title    object
authors           object
average_rating    object
dtype: object

In [39]:

for feature in features:
    fillednabooks[feature] = fillednabooks[feature].apply(clean_data)
    
fillednabooks.head(2)

Out[39]:

	original_title	authors	average_rating
0	thehungergames	suzannecollins	4.34
1	harrypotterandthephilosopher'sstone	j.k.rowling,marygrandpré	4.44

Creating a "soup" or a "bag of words" for all rows.

In [40]:

def create_soup(x):
    return x['original_title']+ ' ' + x['authors'] + ' ' + x['average_rating']

In [41]:

fillednabooks['soup'] = fillednabooks.apply(create_soup, axis=1)

Importing count vectorizer for term frequencies.

In [42]:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(fillednabooks['soup'])

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [43]:

fillednabooks=fillednabooks.reset_index()
indices = pd.Series(fillednabooks.index, index=fillednabooks['original_title'])

In [44]:

def get_recommendations_new(title, cosine_sim=cosine_sim2):
    title=title.replace(' ','').lower()
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return list(books['original_title'].iloc[movie_indices])

In [45]:

l=get_recommendations_new('The Hobbit', cosine_sim2)
fig = go.Figure(data=[go.Table(header=dict(values=l,fill_color='orange'))
                     ])
fig.show()

In [46]:

l=get_recommendations_new('Harry Potter and The Chamber of Secrets', cosine_sim2)
fig = go.Figure(data=[go.Table(header=dict(values=l,fill_color='orange'))
                     ])
fig.show()

2. Collaborative Filtering¶

While i was learning about collaborative recommendation systems, I noticed that a lot of kernels here on kaggle are really just content based recommendation systems but are titled as collaborative.

Collaborative Filtering

To explain collaborative filtering in simple words, consider the above users- User A and User B. User A and User B are considered similar users because they often bought similar or the same books in the past. Now, User A bought the Deep Learning and Neural Networks books. Therefore, when User B browses for books, he will be recommended Deep Learning and Neural Networks because User A( who User B has common interests with) bought those.

This data is very messy so dropping null values is crucial.

In [47]:

usecols=['book_id', 'original_title']
books_col=books[usecols]

In [48]:

books_col.dropna()

Out[48]:

	book_id	original_title
0	2767052	The Hunger Games
1	3	Harry Potter and the Philosopher's Stone
3	2657	To Kill a Mockingbird
4	4671	The Great Gatsby
5	11870085	The Fault in Our Stars
...	...	...
9995	7130616	Bayou Moon
9996	208324	Means of Ascent
9997	77431	The Mauritius Command
9998	8565083	Cinderella Ate My Daughter: Dispatches from th...
9999	8914	The First World War

9151 rows × 2 columns

Creating Compressed sparse row matrix

In [49]:

from scipy.sparse import csr_matrix
# pivot ratings into movie features
df_book_features = ratings.pivot(index='book_id',columns='user_id',values='rating').fillna(0)
mat_book_features = csr_matrix(df_book_features.values)

In [50]:

df_book_features.head()

Out[50]:

user_id	1	2	3	4	5	6	7	8	9	10	...	53415	53416	53417	53418	53419	53420	53421	53422	53423	53424
book_id
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 53380 columns

Here, K nearest neighbors algorithm is used to find the nearest book with least distance available.

In [51]:

from sklearn.neighbors import NearestNeighbors
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)


num_users = len(ratings.user_id.unique())
num_items = len(ratings.book_id.unique())
print('There are {} unique users and {} unique movies in this data set'.format(num_users, num_items))

There are 53380 unique users and 10000 unique movies in this data set

In [52]:

ratings=ratings.dropna()

In [53]:

df_ratings_cnt_tmp = pd.DataFrame(ratings.groupby('rating').size(), columns=['count'])
df_ratings_cnt_tmp.head(10)

Out[53]:

	count
rating
1	19485
2	63010
3	247698
4	355878
5	291198

In [54]:

total_cnt = num_users * num_items
rating_zero_cnt = total_cnt - ratings.shape[0]

df_ratings_cnt = df_ratings_cnt_tmp.append(
    pd.DataFrame({'count': rating_zero_cnt}, index=[0.0]),
    verify_integrity=True,
).sort_index()
df_ratings_cnt

Out[54]:

	count
0.0	532822731
1.0	19485
2.0	63010
3.0	247698
4.0	355878
5.0	291198

After counting all ratings, it is observed that a large amount of books are rated 0 or are unrated. These need to go!

In [55]:

import numpy as np
df_ratings_cnt['log_count'] = np.log(df_ratings_cnt['count'])
df_ratings_cnt

import matplotlib.pyplot as plt


get_ipython().run_line_magic('matplotlib', 'inline')
ax = df_ratings_cnt[['count']].reset_index().rename(columns={'index': 'rating score'}).plot(
    x='rating score',
    y='count',
    kind='bar',
    figsize=(12, 8),
    title='Count for Each Rating Score (in Log Scale)',
    logy=True,
    fontsize=12,color='black'
)
ax.set_xlabel("book rating score")
ax.set_ylabel("number of ratings")

Out[55]:

Text(0, 0.5, 'number of ratings')

Graph clearly shows that a lot of data is irrevelant and can be removed.

In [56]:

df_books_cnt = pd.DataFrame(ratings.groupby('book_id').size(), columns=['count'])
df_books_cnt.head()

Out[56]:

	count
book_id
1	100
2	100
3	100
4	100
5	100

In [57]:

#now we need to take only books that have been rated atleast 60 times to get some idea of the reactions of users towards it

popularity_thres = 60
popular_movies = list(set(df_books_cnt.query('count >= @popularity_thres').index))
df_ratings_drop = ratings[ratings.book_id.isin(popular_movies)]
print('shape of original ratings data: ', ratings.shape)
print('shape of ratings data after dropping unpopular movies: ', df_ratings_drop.shape)

shape of original ratings data:  (977269, 3)
shape of ratings data after dropping unpopular movies:  (975605, 3)

In [58]:

# get number of ratings given by every user
df_users_cnt = pd.DataFrame(df_ratings_drop.groupby('user_id').size(), columns=['count'])
df_users_cnt.head()

Out[58]:

	count
user_id
1	3
2	3
3	2
4	3
5	5

Dropping users who have rated less than 50 times

In [59]:

ratings_thres = 50
active_users = list(set(df_users_cnt.query('count >= @ratings_thres').index))
df_ratings_drop_users = df_ratings_drop[df_ratings_drop.user_id.isin(active_users)]
print('shape of original ratings data: ', ratings.shape)
print('shape of ratings data after dropping both unpopular movies and inactive users: ', df_ratings_drop_users.shape)

shape of original ratings data:  (977269, 3)
shape of ratings data after dropping both unpopular movies and inactive users:  (417687, 3)

In [60]:

book_user_mat = df_ratings_drop_users.pivot(index='book_id', columns='user_id', values='rating').fillna(0)
book_user_mat

Out[60]:

user_id	7	35	41	75	119	143	145	153	158	173	...	53245	53279	53281	53292	53293	53318	53352	53366	53373	53381
book_id
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	4.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	5.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	5.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	5.0	0.0	0.0	5.0	0.0	0.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	5.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9996	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
9997	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
9998	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
9999	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
10000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

9886 rows × 4892 columns

In [61]:

book_user_mat_sparse = csr_matrix(book_user_mat.values)

In [62]:

book_user_mat_sparse

Out[62]:

<9886x4892 sparse matrix of type '<class 'numpy.float64'>'
	with 417687 stored elements in Compressed Sparse Row format>

Fitting the model

In [63]:

model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)
# fit
model_knn.fit(book_user_mat_sparse)

Out[63]:

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=-1, n_neighbors=20, p=2,
                 radius=1.0)

Using fuzzy logic to get nearest distance.

In [64]:

from fuzzywuzzy import fuzz


# In[24]:


def fuzzy_matching(mapper, fav_book, verbose=True):
    """
    return the closest match via fuzzy ratio. 
    
    Parameters
    ----------    
    mapper: dict, map movie title name to index of the movie in data
    fav_movie: str, name of user input movie
    
    verbose: bool, print log if True
    Return
    ------
    index of the closest match
    """
    match_tuple = []
    # get match
    for title, idx in mapper.items():
        ratio = fuzz.ratio(title.lower(), fav_book.lower())
        if ratio >= 60:
            match_tuple.append((title, idx, ratio))
    # sort
    match_tuple = sorted(match_tuple, key=lambda x: x[2])[::-1]
    if not match_tuple:
        print('Oops! No match is found')
        return
    if verbose:
        print('Found possible matches in our database: {0}\n'.format([x[0] for x in match_tuple]))
    return match_tuple[0][1]

Writing the recommendation function.

In [65]:

def make_recommendation(model_knn, data, mapper, fav_book, n_recommendations):
    """
    return top n similar book recommendations based on user's input book
    Parameters
    ----------
    model_knn: sklearn model, knn model
    data: book-user matrix
    mapper: dict, map book title name to index of the book in data
    fav_book: str, name of user input book
    n_recommendations: int, top n recommendations
    Return
    ------
    list of top n similar book recommendations
    """
    # fit
    model_knn.fit(data)
    # get input movie index
    print('You have input book:', fav_book)
    idx = fuzzy_matching(mapper, fav_book, verbose=True)
    
    print('Recommendation system starting to make inference')
    print('......\n')
    distances, indices = model_knn.kneighbors(data[idx], n_neighbors=n_recommendations+1)
    
    raw_recommends =         sorted(list(zip(indices.squeeze().tolist(), distances.squeeze().tolist())), key=lambda x: x[1])[:0:-1]
    # get reverse mapper
    reverse_mapper = {v: k for k, v in mapper.items()}
    # print recommendations
    print('Recommendations for {}:'.format(fav_book))
    rec=[]
    for i, (idx, dist) in enumerate(raw_recommends):
        if idx not in reverse_mapper.keys():
            continue
        print('{0}: {1}, with distance of {2}'.format(i+1, reverse_mapper[idx], dist))
        rec.append(reverse_mapper[idx])
    return rec

Time to check!

In [66]:

my_favorite = 'To Kill a Mockingbird'
indices = pd.Series(books_col.index, index=books_col['original_title'])

In [67]:

make_recommendation(
    model_knn=model_knn,
    data=book_user_mat_sparse,
    fav_book=my_favorite,
    mapper=indices,
    n_recommendations=10)

You have input book: To Kill a Mockingbird
Found possible matches in our database: ['To Kill a Mockingbird', 'Mockingbird', 'Stolen Songbird']

Recommendation system starting to make inference
......

Recommendations for To Kill a Mockingbird:
1: Lord of the Flies , with distance of 0.45598309432313877
2: Little Women, with distance of 0.4526896099993938
3: Nineteen Eighty-Four, with distance of 0.4396460119625992
4: Memoirs of a Geisha, with distance of 0.43283216907946764
5: Animal Farm: A Fairy Story, with distance of 0.4252435075403517
6: Pride and Prejudice, with distance of 0.4251608152166305
7: Of Mice and Men , with distance of 0.4204446294803902
8: Harry Potter and the Philosopher's Stone, with distance of 0.3892592020883805
9: The Catcher in the Rye, with distance of 0.3699905318987523
10: The Great Gatsby, with distance of 0.2966652339964868

Out[67]:

['Lord of the Flies ',
 'Little Women',
 'Nineteen Eighty-Four',
 'Memoirs of a Geisha',
 'Animal Farm: A Fairy Story',
 'Pride and Prejudice',
 'Of Mice and Men ',
 "Harry Potter and the Philosopher's Stone",
 'The Catcher in the Rye',
 'The Great Gatsby']

In [68]:

make_recommendation(
    model_knn=model_knn,
    data=book_user_mat_sparse,
    fav_book='Harry Potter and the Chamber of Secrets',
    mapper=indices,
    n_recommendations=10)

You have input book: Harry Potter and the Chamber of Secrets
Found possible matches in our database: ['Harry Potter and the Chamber of Secrets', 'Harry Potter and the Goblet of Fire', 'Harry Potter and the Half-Blood Prince', 'Harry Potter and the Order of the Phoenix', 'Harry Potter and the Chamber of Secrets: Sheet Music for Flute with C.D', 'Gregor and the Marks of Secret', 'Harry Potter and the Prisoner of Azkaban', "Harry Potter and the Philosopher's Stone", 'Harry Potter and the Deathly Hallows', 'Haroun and the Sea of Stories', "James Potter and the Hall of Elders' Crossing ", 'Harry Potter and the Cursed Child, Parts One and Two', 'Peter and the Shadow Thieves']

Recommendation system starting to make inference
......

Recommendations for Harry Potter and the Chamber of Secrets:
1: The Return of the King, with distance of 0.5137453857083071
2: Mockingjay, with distance of 0.484811069871498
3: The Da Vinci Code, with distance of 0.48437188831920774
4: Catching Fire, with distance of 0.46678667832629206
5: Harry Potter and the Philosopher's Stone, with distance of 0.4454417431428892
6: Harry Potter and the Deathly Hallows, with distance of 0.2774345523014743
7: Harry Potter and the Half-Blood Prince, with distance of 0.21458444953407796
8: Harry Potter and the Order of the Phoenix, with distance of 0.17345094201226208
9: Harry Potter and the Goblet of Fire, with distance of 0.1489778170737216
10: Harry Potter and the Prisoner of Azkaban, with distance of 0.1395682125920943

Out[68]:

['The Return of the King',
 'Mockingjay',
 'The Da Vinci Code',
 'Catching Fire',
 "Harry Potter and the Philosopher's Stone",
 'Harry Potter and the Deathly Hallows',
 'Harry Potter and the Half-Blood Prince',
 'Harry Potter and the Order of the Phoenix',
 'Harry Potter and the Goblet of Fire',
 'Harry Potter and the Prisoner of Azkaban']

In [69]:

rec=make_recommendation(
    model_knn=model_knn,
    data=book_user_mat_sparse,
    fav_book='Gone Girl',
    mapper=indices,
    n_recommendations=10)

You have input book: Gone Girl
Found possible matches in our database: ['Gone Girl', 'Gone ', 'Roller Girl', 'Nature Girl', 'Gossip Girl', 'The Goose Girl', 'Ghostgirl', 'Vinegar Girl', 'The Good Girl', 'Twenties Girl', 'Candy Girl', 'Metro Girl', 'Funny Girl', 'Agnes Grey', 'Boy Meets Girl', 'Love, Stargirl', 'The House Girl', 'Silver Girl', 'Wintergirls']

Recommendation system starting to make inference
......

Recommendations for Gone Girl:
1: The Giver, with distance of 0.6158197177191762
2: Memoirs of a Geisha, with distance of 0.6111717525594234
3: Eat, pray, love: one woman's search for everything across Italy, India and Indonesia, with distance of 0.6074993104114255
4: The Night Circus, with distance of 0.6055880373834424
5: O Alquimista, with distance of 0.6029134027522959
6: The Fault in Our Stars, with distance of 0.6004559014025481
7: The Devil in the White City: Murder, Magic, and Madness at the Fair that Changed America, with distance of 0.5960388200819615
8: A Game of Thrones, with distance of 0.5951336904470853
9: Divergent, with distance of 0.5627069805246723
10: The Book Thief, with distance of 0.5089278920328403

In [70]:

rec=make_recommendation(
    model_knn=model_knn,
    data=book_user_mat_sparse,
    fav_book='Divergent',
    mapper=indices,
    n_recommendations=10)

You have input book: Divergent
Found possible matches in our database: ['Divergent', 'Evergreen', 'Driven', 'Insurgent', 'Descent', 'Deliverance']

Recommendation system starting to make inference
......

Recommendations for Divergent:
1: Mockingjay, with distance of 0.6368451672358991
2: Catching Fire, with distance of 0.6317072362567975
3: Insurgent, with distance of 0.6143238462674122
4: The Giver, with distance of 0.608207539785535
6: The Perks of Being a Wallflower, with distance of 0.6017219498552195
7: The Book Thief, with distance of 0.5813217330602747
8: The Lightning Thief, with distance of 0.5660509401292786
9: Gone Girl, with distance of 0.5627069805246723
10: The Fault in Our Stars, with distance of 0.41485257342792325

In [71]:

rec=make_recommendation(
    model_knn=model_knn,
    data=book_user_mat_sparse,
    fav_book='Kafka on the Shore',
    mapper=indices,
    n_recommendations=10)

You have input book: Kafka on the Shore
Found possible matches in our database: ['And the Shofar Blew', 'A Fraction of the Whole', 'Halfway to the Grave', 'Salt to the Sea', 'The Farthest Shore', 'A Map of the World', 'In Her Shoes']

Recommendation system starting to make inference
......

Recommendations for Kafka on the Shore:
1: Morgawr , with distance of 0.6555631323101632
2: The Woman Upstairs, with distance of 0.6127869510454468
3: Porno, with distance of 0.43176022797613023
4: The Hurricane Sisters , with distance of 0.4050860398093705
5: The Endurance: Shackleton's legendary Antarctic expedition, with distance of 0.393458565796727
6: Chill Factor, with distance of 0.288231688746155
8: Darkness, Be My Friend, with distance of 0.17858194650445824

In [72]:

rec=make_recommendation(
    model_knn=model_knn,
    data=book_user_mat_sparse,
    fav_book='Kafka on the Shore',
    mapper=indices,
    n_recommendations=10)

You have input book: Kafka on the Shore
Found possible matches in our database: ['And the Shofar Blew', 'A Fraction of the Whole', 'Halfway to the Grave', 'Salt to the Sea', 'The Farthest Shore', 'A Map of the World', 'In Her Shoes']

Recommendation system starting to make inference
......

Recommendations for Kafka on the Shore:
1: Morgawr , with distance of 0.6555631323101632
2: The Woman Upstairs, with distance of 0.6127869510454468
3: Porno, with distance of 0.43176022797613023
4: The Hurricane Sisters , with distance of 0.4050860398093705
5: The Endurance: Shackleton's legendary Antarctic expedition, with distance of 0.393458565796727
6: Chill Factor, with distance of 0.288231688746155
8: Darkness, Be My Friend, with distance of 0.17858194650445824

The above is a case where the recommender could not find the book title in the data as it may have been dropped while data cleaning because it maybe had a null value, in such cases the recommender skips the book and moves forward, which is why there are only 8 recommendations for the book 'Kafka and the Shore'.

How many Netflix Shows/ Movies are made from books as their storylines?¶

In [73]:

netflix=pd.read_csv('/kaggle/input/netflix-shows/netflix_titles.csv')
netflix.head()

Out[73]:

	show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
0	81145628	Movie	Norm of the North: King Sized Adventure	Richard Finn, Tim Maltby	Alan Marriott, Andrew Toth, Brian Dobson, Cole...	United States, India, South Korea, China	September 9, 2019	2019	TV-PG	90 min	Children & Family Movies, Comedies	Before planning an awesome wedding for his gra...
1	80117401	Movie	Jandino: Whatever it Takes	NaN	Jandino Asporaat	United Kingdom	September 9, 2016	2016	TV-MA	94 min	Stand-Up Comedy	Jandino Asporaat riffs on the challenges of ra...
2	70234439	TV Show	Transformers Prime	NaN	Peter Cullen, Sumalee Montano, Frank Welker, J...	United States	September 8, 2018	2013	TV-Y7-FV	1 Season	Kids' TV	With the help of three human allies, the Autob...
3	80058654	TV Show	Transformers: Robots in Disguise	NaN	Will Friedle, Darren Criss, Constance Zimmer, ...	United States	September 8, 2018	2016	TV-Y7	1 Season	Kids' TV	When a prison ship crash unleashes hundreds of...
4	80125979	Movie	#realityhigh	Fernando Lebrija	Nesta Cooper, Kate Walsh, John Michael Higgins...	United States	September 8, 2017	2017	TV-14	99 min	Comedies	When nerdy high schooler Dani finally attracts...

In [74]:

netflix.shape

Out[74]:

(6234, 12)

In [75]:

books['original_title']=books['original_title'].str.lower()
netflix['title']=netflix['title'].str.lower()

In [76]:

t=netflix.merge(books, left_on='title', right_on='original_title', how="inner")

In [77]:

t.shape

Out[77]:

(193, 36)

193 out of 6234 netflix shows are made from books.

In [78]:

import plotly.graph_objects as go

labels = ['Shows from books','Shows not from books']
values = [193,6234]

fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.show()

Thus, the model looks and works perfectly good.

Both content based and collaborative filtering recommendation systems are implemented. I am also implementing a plotly-dash based interface for the same which will be available in the next post.:)

Book Recommendations and EDA