Book Recommendations and EDA

By Niharika P, Tue 26 May 2020, in category Recommendation-system

books data analysis python

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/goodbooks-10k/to_read.csv
/kaggle/input/goodbooks-10k/tags.csv
/kaggle/input/goodbooks-10k/book_tags.csv
/kaggle/input/goodbooks-10k/sample_book.xml
/kaggle/input/goodbooks-10k/ratings.csv
/kaggle/input/goodbooks-10k/books.csv
/kaggle/input/netflix-shows/netflix_titles.csv

The entire Netflix notebook can be found here : Netflix Visualizations, Recommendations, EDA

Problem Statement

There are a lot of movie recommendation engines made nowadays but not a single book recommendation engine. A book recommendation system performing both content based and collaborative filtering is to be developed to generate recommendations for users. This can essentially be used in an application like Netflix but for books.

Books

The notebook contains visualizations, analysis and content based and collaborative filtering recommendation systems on the goodreads books dataset.

Upvote if you like the kernel! 😃

In this digital world, one form of entertainment remains constant and will always be,Books. Movies and TV shows produce a sense of instant gratification releasing dopamine which regulates bodily movements like pleasure. It not only makes us lazy but also reliant on these digital forms of entertainment. Whereas when you read a book, you have to fully indulge into it and instant entertainment will not be available. It keeps your mind running and faciliates your thinking. Read books!

Importing Data

In [2]:
books= pd.read_csv('/kaggle/input/goodbooks-10k/books.csv',error_bad_lines = False)
In [3]:
books.head()
Out[3]:
id book_id best_book_id work_id books_count isbn isbn13 authors original_publication_year original_title ... ratings_count work_ratings_count work_text_reviews_count ratings_1 ratings_2 ratings_3 ratings_4 ratings_5 image_url small_image_url
0 1 2767052 2767052 2792775 272 439023483 9.780439e+12 Suzanne Collins 2008.0 The Hunger Games ... 4780653 4942365 155254 66715 127936 560092 1481305 2706317 https://images.gr-assets.com/books/1447303603m... https://images.gr-assets.com/books/1447303603s...
1 2 3 3 4640799 491 439554934 9.780440e+12 J.K. Rowling, Mary GrandPré 1997.0 Harry Potter and the Philosopher's Stone ... 4602479 4800065 75867 75504 101676 455024 1156318 3011543 https://images.gr-assets.com/books/1474154022m... https://images.gr-assets.com/books/1474154022s...
2 3 41865 41865 3212258 226 316015849 9.780316e+12 Stephenie Meyer 2005.0 Twilight ... 3866839 3916824 95009 456191 436802 793319 875073 1355439 https://images.gr-assets.com/books/1361039443m... https://images.gr-assets.com/books/1361039443s...
3 4 2657 2657 3275794 487 61120081 9.780061e+12 Harper Lee 1960.0 To Kill a Mockingbird ... 3198671 3340896 72586 60427 117415 446835 1001952 1714267 https://images.gr-assets.com/books/1361975680m... https://images.gr-assets.com/books/1361975680s...
4 5 4671 4671 245494 1356 743273567 9.780743e+12 F. Scott Fitzgerald 1925.0 The Great Gatsby ... 2683664 2773745 51992 86236 197621 606158 936012 947718 https://images.gr-assets.com/books/1490528560m... https://images.gr-assets.com/books/1490528560s...

5 rows × 23 columns

In [4]:
books.shape
Out[4]:
(10000, 23)
In [5]:
ratings = pd.read_csv('/kaggle/input/goodbooks-10k/ratings.csv')

ratings.head()
Out[5]:
book_id user_id rating
0 1 314 5
1 1 439 3
2 1 588 5
3 1 1169 4
4 1 1185 4
In [6]:
tags = pd.read_csv('/kaggle/input/goodbooks-10k/book_tags.csv')
tags.tail()
Out[6]:
goodreads_book_id tag_id count
999907 33288638 21303 7
999908 33288638 17271 7
999909 33288638 1126 7
999910 33288638 11478 7
999911 33288638 27939 7
In [7]:
btags = pd.read_csv('/kaggle/input/goodbooks-10k/tags.csv')
btags.tail()
Out[7]:
tag_id tag_name
34247 34247 Childrens
34248 34248 Favorites
34249 34249 Manga
34250 34250 SERIES
34251 34251 favourites

Cleaning the data, removing duplicates

The data provided by the dataset is unclean, and they mention it clearly in the dataset description, if you wish to skip the cleaning process, head on over to the clean data available in the description of the goodbooks dataset. If you like a good challenge, use this one!

In [8]:
ratings=ratings.sort_values("user_id")
ratings.shape
Out[8]:
(981756, 3)
In [9]:
ratings.drop_duplicates(subset =["user_id","book_id"], 
                     keep = False, inplace = True) 
ratings.shape
Out[9]:
(977269, 3)

Therefore, 4487 duplicates were present in the data, that have been removed.

Lets check for the books dataset as well.

In [10]:
print(books.shape)
books.drop_duplicates(subset='original_title',keep=False,inplace=True)
print(books.shape)
(10000, 23)
(9151, 23)

849 rows removed.

In [11]:
print(btags.shape)
btags.drop_duplicates(subset='tag_id',keep=False,inplace=True)
print(btags.shape)
(34252, 2)
(34252, 2)

Cool, so there are no duplicates in the book_tags dataset.

In [12]:
print(tags.shape)
tags.drop_duplicates(subset=['tag_id','goodreads_book_id'],keep=False,inplace=True)
print(tags.shape)
(999912, 3)
(999896, 3)

Visualizing data

In [13]:
joint_tags=pd.merge(tags,btags,left_on='tag_id',right_on='tag_id',how='inner')

Top 10 rated books

In [14]:
top_rated=books.sort_values('average_rating', ascending=False)
top10=top_rated.head(10)
f=['title','small_image_url']
displ=(top10[f])
displ.set_index('title', inplace=True)
In [15]:
from IPython.display import Image, HTML

def path_to_image_html(path):
    '''
     This function essentially convert the image url to 
     '<img src="'+ path + '"/>' format. And one can put any
     formatting adjustments to control the height, aspect ratio, size etc.
     within as in the below example. 
    '''

    return '<img src="'+ path + '""/>'

HTML(displ.to_html(escape=False ,formatters=dict(small_image_url=path_to_image_html),justify='center'))
Out[15]:
small_image_url
title
The Complete Calvin and Hobbes
Words of Radiance (The Stormlight Archive, #2)
Mark of the Lion Trilogy
It's a Magical World: A Calvin and Hobbes Collection
There's Treasure Everywhere: A Calvin and Hobbes Collection
Harry Potter Boxset (Harry Potter, #1-7)
Harry Potter Collection (Harry Potter, #1-6)
The Indispensable Calvin and Hobbes
The Authoritative Calvin and Hobbes: A Calvin and Hobbes Treasury
Attack of the Deranged Mutant Killer Monster Snow Goons

Top 10 most popular books

In [16]:
pop10=books.sort_values(by='ratings_count', ascending=False)
f=['title','small_image_url']
pop10=pop10.head(10)

pop10=(pop10[f])
pop10=pop10.set_index('title')
In [17]:
HTML(pop10.to_html(escape=False ,formatters=dict(small_image_url=path_to_image_html),justify='center'))
Out[17]:
small_image_url
title
The Hunger Games (The Hunger Games, #1)
Harry Potter and the Sorcerer's Stone (Harry Potter, #1)
To Kill a Mockingbird
The Great Gatsby
The Fault in Our Stars
The Hobbit
The Catcher in the Rye
Pride and Prejudice
Angels & Demons (Robert Langdon, #1)
The Diary of a Young Girl

Most Common Rating Values

In [18]:
import seaborn as sns
plt.figure(figsize=(16,8))
sns.distplot(a=books['average_rating'], kde=True, color='r')
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9ad26d4c90>

Therefore, the most common rating is somewhere between 3.5 to 4.

In [19]:
no_of_ratings_per_book=ratings.groupby('book_id').count()
In [20]:
no_of_ratings_per_book
Out[20]:
user_id rating
book_id
1 100 100
2 100 100
3 100 100
4 100 100
5 100 100
... ... ...
9996 96 96
9997 89 89
9998 95 95
9999 99 99
10000 96 96

10000 rows × 2 columns

In [21]:
plt.figure(figsize=(16,8))
sns.distplot(a=no_of_ratings_per_book['rating'], color='g')
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9ad2462350>

It is seen that most books have count of ratings greater than 80. That is alot of audience.

Highly rated authors

In [22]:
books.head(2)
Out[22]:
id book_id best_book_id work_id books_count isbn isbn13 authors original_publication_year original_title ... ratings_count work_ratings_count work_text_reviews_count ratings_1 ratings_2 ratings_3 ratings_4 ratings_5 image_url small_image_url
0 1 2767052 2767052 2792775 272 439023483 9.780439e+12 Suzanne Collins 2008.0 The Hunger Games ... 4780653 4942365 155254 66715 127936 560092 1481305 2706317 https://images.gr-assets.com/books/1447303603m... https://images.gr-assets.com/books/1447303603s...
1 2 3 3 4640799 491 439554934 9.780440e+12 J.K. Rowling, Mary GrandPré 1997.0 Harry Potter and the Philosopher's Stone ... 4602479 4800065 75867 75504 101676 455024 1156318 3011543 https://images.gr-assets.com/books/1474154022m... https://images.gr-assets.com/books/1474154022s...

2 rows × 23 columns

In [23]:
f=['authors', 'average_rating']
top_authors=top_rated[f]
top_authors=top_authors.head(20)
In [24]:
fig = px.bar(top_authors, x='authors', y='average_rating', color ='average_rating')
fig.show()

Above barplot shows the top rated authors. Bill Waterson is on the top with a whopping rating of 4.82!

Finding popular genres and books available for those.

So, tags are added by users and we dont have any keywords to classify the books as genres, I have hard coded the genres and checked if the tags contain those values. Credits to this approach go to : @philispp on kaggle.

In [25]:
p=joint_tags.groupby('tag_name').count()
In [26]:
p=p.sort_values(by='count', ascending=False)
p
Out[26]:
goodreads_book_id tag_id count
tag_name
to-read 9983 9983 9983
favorites 9881 9881 9881
owned 9856 9856 9856
books-i-own 9799 9799 9799
currently-reading 9776 9776 9776
... ... ... ...
hs 1 1 1
hrabal 1 1 1
hq-manga 1 1 1
hq-e-mangá 1 1 1
favourites 1 1 1

34251 rows × 3 columns

Hardcoding some basic genres

In [27]:
genres=["Art", "Biography", "Business", "Chick Lit", "Children's", "Christian", "Classics", "Comics", "Contemporary", "Cookbooks", "Crime", "Ebooks", "Fantasy", "Fiction", "Gay and Lesbian", "Graphic Novels", "Historical Fiction", "History", "Horror", "Humor and Comedy", "Manga", "Memoir", "Music", "Mystery", "Nonfiction", "Paranormal", "Philosophy", "Poetry", "Psychology", "Religion", "Romance", "Science", "Science Fiction", "Self Help", "Suspense", "Spirituality", "Sports", "Thriller", "Travel", "Young Adult"]
for i in range(len(genres)):
    genres[i]=genres[i].lower()
In [28]:
new_tags=p[p.index.isin(genres)]
In [29]:
import plotly.graph_objects as go

fig = go.Figure(go.Bar(
            x=new_tags['count'],
            y=new_tags.index,
            orientation='h'))

fig.show()

There's a lot of fiction present, but not a lot of cookbooks! Makes sense.

Relation between no of editions and ratings

The book_count column is also available in the dataset and even though the dataset description said it was the number of editions,I was unsure what that really meant. After some research, it was clear that this was the count of all editions, translations and formats available for the books (Kindle, Paperbacks, Hard Copies etc).

In [30]:
fig = px.line(books, y="books_count", x="average_rating", title='Book Count VS Average Rating')
fig.show()

Weirdly, it is seen that the average_rating increases with increases in number of editions of the book, but decreases after the count reaches about 2500. So, more is the number of editions, less is the average_rating.

In [31]:
dropna= books.dropna()
fig = px.treemap(dropna, path=['original_publication_year','language_code', "average_rating"],
                  color='average_rating')
fig.show()

Thus, a lot of books were publish in the year 2011, and most of them were in English.

Do readers prefer short titles or long titles?

In [32]:
books['length-title']=books['original_title'].str.len()
In [33]:
plt.figure(figsize=(16,8))
sns.regplot(x=books['length-title'], y=books['average_rating'])
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9ad1632650>

So, the highly rated books have rather short titles. The graph shows that a straight line can be plotted but very approximately to say that as the length of title increases, the rating remains constant (at around 4).

Word Cloud for tags used by readers.

In [34]:
from wordcloud import WordCloud, STOPWORDS 
text = new_tags.index.values 

wordcloud = WordCloud().generate(str(text))
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Recommendation System

A recommendation engine filters the data using different algorithms and recommends the most relevant items to users. There are two types of recommendation systems:

  • Content Based : This approach analyzes the available content and find similarities between them and then recommendations the items obtained that have a high similarity.
  • Collaborative : This approach mines or analyzes the information about users who prefer the same content and recommend the items that similar users prefer.

1. Content Based

Content based filtering on the following factors:

  1. Title
  2. Authors
  3. Average rating

Content based diagram

In [35]:
fillnabooks= books.fillna('')

Cleaning the data - making all the words lower case

In [36]:
def clean_data(x):
        return str.lower(x.replace(" ", ""))
In [37]:
features=['original_title','authors','average_rating']
fillednabooks=fillnabooks[features]
In [38]:
fillednabooks = fillednabooks.astype(str)
fillednabooks.dtypes
Out[38]:
original_title    object
authors           object
average_rating    object
dtype: object
In [39]:
for feature in features:
    fillednabooks[feature] = fillednabooks[feature].apply(clean_data)
    
fillednabooks.head(2)
Out[39]:
original_title authors average_rating
0 thehungergames suzannecollins 4.34
1 harrypotterandthephilosopher'sstone j.k.rowling,marygrandpré 4.44

Creating a "soup" or a "bag of words" for all rows.

In [40]:
def create_soup(x):
    return x['original_title']+ ' ' + x['authors'] + ' ' + x['average_rating']
In [41]:
fillednabooks['soup'] = fillednabooks.apply(create_soup, axis=1)

Importing count vectorizer for term frequencies.

In [42]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(fillednabooks['soup'])

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)
In [43]:
fillednabooks=fillednabooks.reset_index()
indices = pd.Series(fillednabooks.index, index=fillednabooks['original_title'])
In [44]:
def get_recommendations_new(title, cosine_sim=cosine_sim2):
    title=title.replace(' ','').lower()
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return list(books['original_title'].iloc[movie_indices])
In [45]:
l=get_recommendations_new('The Hobbit', cosine_sim2)
fig = go.Figure(data=[go.Table(header=dict(values=l,fill_color='orange'))
                     ])
fig.show()
In [46]:
l=get_recommendations_new('Harry Potter and The Chamber of Secrets', cosine_sim2)
fig = go.Figure(data=[go.Table(header=dict(values=l,fill_color='orange'))
                     ])
fig.show()

2. Collaborative Filtering

While i was learning about collaborative recommendation systems, I noticed that a lot of kernels here on kaggle are really just content based recommendation systems but are titled as collaborative.

Collaborative Filtering

To explain collaborative filtering in simple words, consider the above users- User A and User B. User A and User B are considered similar users because they often bought similar or the same books in the past. Now, User A bought the Deep Learning and Neural Networks books. Therefore, when User B browses for books, he will be recommended Deep Learning and Neural Networks because User A( who User B has common interests with) bought those.

This data is very messy so dropping null values is crucial.

In [47]:
usecols=['book_id', 'original_title']
books_col=books[usecols]
In [48]:
books_col.dropna()
Out[48]:
book_id original_title
0 2767052 The Hunger Games
1 3 Harry Potter and the Philosopher's Stone
3 2657 To Kill a Mockingbird
4 4671 The Great Gatsby
5 11870085 The Fault in Our Stars
... ... ...
9995 7130616 Bayou Moon
9996 208324 Means of Ascent
9997 77431 The Mauritius Command
9998 8565083 Cinderella Ate My Daughter: Dispatches from th...
9999 8914 The First World War

9151 rows × 2 columns

Creating Compressed sparse row matrix

In [49]:
from scipy.sparse import csr_matrix
# pivot ratings into movie features
df_book_features = ratings.pivot(index='book_id',columns='user_id',values='rating').fillna(0)
mat_book_features = csr_matrix(df_book_features.values)
In [50]:
df_book_features.head()
Out[50]:
user_id 1 2 3 4 5 6 7 8 9 10 ... 53415 53416 53417 53418 53419 53420 53421 53422 53423 53424
book_id
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 53380 columns

Here, K nearest neighbors algorithm is used to find the nearest book with least distance available.

In [51]:
from sklearn.neighbors import NearestNeighbors
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)


num_users = len(ratings.user_id.unique())
num_items = len(ratings.book_id.unique())
print('There are {} unique users and {} unique movies in this data set'.format(num_users, num_items))
There are 53380 unique users and 10000 unique movies in this data set
In [52]:
ratings=ratings.dropna()
In [53]:
df_ratings_cnt_tmp = pd.DataFrame(ratings.groupby('rating').size(), columns=['count'])
df_ratings_cnt_tmp.head(10)
Out[53]:
count
rating
1 19485
2 63010
3 247698
4 355878
5 291198
In [54]:
total_cnt = num_users * num_items
rating_zero_cnt = total_cnt - ratings.shape[0]

df_ratings_cnt = df_ratings_cnt_tmp.append(
    pd.DataFrame({'count': rating_zero_cnt}, index=[0.0]),
    verify_integrity=True,
).sort_index()
df_ratings_cnt
Out[54]:
count
0.0 532822731
1.0 19485
2.0 63010
3.0 247698
4.0 355878
5.0 291198

After counting all ratings, it is observed that a large amount of books are rated 0 or are unrated. These need to go!

In [55]:
import numpy as np
df_ratings_cnt['log_count'] = np.log(df_ratings_cnt['count'])
df_ratings_cnt

import matplotlib.pyplot as plt


get_ipython().run_line_magic('matplotlib', 'inline')
ax = df_ratings_cnt[['count']].reset_index().rename(columns={'index': 'rating score'}).plot(
    x='rating score',
    y='count',
    kind='bar',
    figsize=(12, 8),
    title='Count for Each Rating Score (in Log Scale)',
    logy=True,
    fontsize=12,color='black'
)
ax.set_xlabel("book rating score")
ax.set_ylabel("number of ratings")
Out[55]:
Text(0, 0.5, 'number of ratings')

Graph clearly shows that a lot of data is irrevelant and can be removed.

In [56]:
df_books_cnt = pd.DataFrame(ratings.groupby('book_id').size(), columns=['count'])
df_books_cnt.head()
Out[56]:
count
book_id
1 100
2 100
3 100
4 100
5 100
In [57]:
#now we need to take only books that have been rated atleast 60 times to get some idea of the reactions of users towards it

popularity_thres = 60
popular_movies = list(set(df_books_cnt.query('count >= @popularity_thres').index))
df_ratings_drop = ratings[ratings.book_id.isin(popular_movies)]
print('shape of original ratings data: ', ratings.shape)
print('shape of ratings data after dropping unpopular movies: ', df_ratings_drop.shape)
shape of original ratings data:  (977269, 3)
shape of ratings data after dropping unpopular movies:  (975605, 3)
In [58]:
# get number of ratings given by every user
df_users_cnt = pd.DataFrame(df_ratings_drop.groupby('user_id').size(), columns=['count'])
df_users_cnt.head()
Out[58]:
count
user_id
1 3
2 3
3 2
4 3
5 5

Dropping users who have rated less than 50 times

In [59]:
ratings_thres = 50
active_users = list(set(df_users_cnt.query('count >= @ratings_thres').index))
df_ratings_drop_users = df_ratings_drop[df_ratings_drop.user_id.isin(active_users)]
print('shape of original ratings data: ', ratings.shape)
print('shape of ratings data after dropping both unpopular movies and inactive users: ', df_ratings_drop_users.shape)
shape of original ratings data:  (977269, 3)
shape of ratings data after dropping both unpopular movies and inactive users:  (417687, 3)
In [60]:
book_user_mat = df_ratings_drop_users.pivot(index='book_id', columns='user_id', values='rating').fillna(0)
book_user_mat
Out[60]:
user_id 7 35 41 75 119 143 145 153 158 173 ... 53245 53279 53281 53292 53293 53318 53352 53366 53373 53381
book_id
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 5.0 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9996 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9997 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9998 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9999 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

9886 rows × 4892 columns

In [61]:
book_user_mat_sparse = csr_matrix(book_user_mat.values)
In [62]:
book_user_mat_sparse
Out[62]:
<9886x4892 sparse matrix of type '<class 'numpy.float64'>'
	with 417687 stored elements in Compressed Sparse Row format>

Fitting the model

In [63]:
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)
# fit
model_knn.fit(book_user_mat_sparse)
Out[63]:
NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=-1, n_neighbors=20, p=2,
                 radius=1.0)

Using fuzzy logic to get nearest distance.

In [64]:
from fuzzywuzzy import fuzz


# In[24]:


def fuzzy_matching(mapper, fav_book, verbose=True):
    """
    return the closest match via fuzzy ratio. 
    
    Parameters
    ----------    
    mapper: dict, map movie title name to index of the movie in data
    fav_movie: str, name of user input movie
    
    verbose: bool, print log if True
    Return
    ------
    index of the closest match
    """
    match_tuple = []
    # get match
    for title, idx in mapper.items():
        ratio = fuzz.ratio(title.lower(), fav_book.lower())
        if ratio >= 60:
            match_tuple.append((title, idx, ratio))
    # sort
    match_tuple = sorted(match_tuple, key=lambda x: x[2])[::-1]
    if not match_tuple:
        print('Oops! No match is found')
        return
    if verbose:
        print('Found possible matches in our database: {0}\n'.format([x[0] for x in match_tuple]))
    return match_tuple[0][1]

Writing the recommendation function.

In [65]:
def make_recommendation(model_knn, data, mapper, fav_book, n_recommendations):
    """
    return top n similar book recommendations based on user's input book
    Parameters
    ----------
    model_knn: sklearn model, knn model
    data: book-user matrix
    mapper: dict, map book title name to index of the book in data
    fav_book: str, name of user input book
    n_recommendations: int, top n recommendations
    Return
    ------
    list of top n similar book recommendations
    """
    # fit
    model_knn.fit(data)
    # get input movie index
    print('You have input book:', fav_book)
    idx = fuzzy_matching(mapper, fav_book, verbose=True)
    
    print('Recommendation system starting to make inference')
    print('......\n')
    distances, indices = model_knn.kneighbors(data[idx], n_neighbors=n_recommendations+1)
    
    raw_recommends =         sorted(list(zip(indices.squeeze().tolist(), distances.squeeze().tolist())), key=lambda x: x[1])[:0:-1]
    # get reverse mapper
    reverse_mapper = {v: k for k, v in mapper.items()}
    # print recommendations
    print('Recommendations for {}:'.format(fav_book))
    rec=[]
    for i, (idx, dist) in enumerate(raw_recommends):
        if idx not in reverse_mapper.keys():
            continue
        print('{0}: {1}, with distance of {2}'.format(i+1, reverse_mapper[idx], dist))
        rec.append(reverse_mapper[idx])
    return rec

Time to check!

In [66]:
my_favorite = 'To Kill a Mockingbird'
indices = pd.Series(books_col.index, index=books_col['original_title'])
In [67]:
make_recommendation(
    model_knn=model_knn,
    data=book_user_mat_sparse,
    fav_book=my_favorite,
    mapper=indices,
    n_recommendations=10)
You have input book: To Kill a Mockingbird
Found possible matches in our database: ['To Kill a Mockingbird', 'Mockingbird', 'Stolen Songbird']

Recommendation system starting to make inference
......

Recommendations for To Kill a Mockingbird:
1: Lord of the Flies , with distance of 0.45598309432313877
2: Little Women, with distance of 0.4526896099993938
3: Nineteen Eighty-Four, with distance of 0.4396460119625992
4: Memoirs of a Geisha, with distance of 0.43283216907946764
5: Animal Farm: A Fairy Story, with distance of 0.4252435075403517
6: Pride and Prejudice, with distance of 0.4251608152166305
7: Of Mice and Men , with distance of 0.4204446294803902
8: Harry Potter and the Philosopher's Stone, with distance of 0.3892592020883805
9: The Catcher in the Rye, with distance of 0.3699905318987523
10: The Great Gatsby, with distance of 0.2966652339964868
Out[67]:
['Lord of the Flies ',
 'Little Women',
 'Nineteen Eighty-Four',
 'Memoirs of a Geisha',
 'Animal Farm: A Fairy Story',
 'Pride and Prejudice',
 'Of Mice and Men ',
 "Harry Potter and the Philosopher's Stone",
 'The Catcher in the Rye',
 'The Great Gatsby']
In [68]:
make_recommendation(
    model_knn=model_knn,
    data=book_user_mat_sparse,
    fav_book='Harry Potter and the Chamber of Secrets',
    mapper=indices,
    n_recommendations=10)
You have input book: Harry Potter and the Chamber of Secrets
Found possible matches in our database: ['Harry Potter and the Chamber of Secrets', 'Harry Potter and the Goblet of Fire', 'Harry Potter and the Half-Blood Prince', 'Harry Potter and the Order of the Phoenix', 'Harry Potter and the Chamber of Secrets: Sheet Music for Flute with C.D', 'Gregor and the Marks of Secret', 'Harry Potter and the Prisoner of Azkaban', "Harry Potter and the Philosopher's Stone", 'Harry Potter and the Deathly Hallows', 'Haroun and the Sea of Stories', "James Potter and the Hall of Elders' Crossing ", 'Harry Potter and the Cursed Child, Parts One and Two', 'Peter and the Shadow Thieves']

Recommendation system starting to make inference
......

Recommendations for Harry Potter and the Chamber of Secrets:
1: The Return of the King, with distance of 0.5137453857083071
2: Mockingjay, with distance of 0.484811069871498
3: The Da Vinci Code, with distance of 0.48437188831920774
4: Catching Fire, with distance of 0.46678667832629206
5: Harry Potter and the Philosopher's Stone, with distance of 0.4454417431428892
6: Harry Potter and the Deathly Hallows, with distance of 0.2774345523014743
7: Harry Potter and the Half-Blood Prince, with distance of 0.21458444953407796
8: Harry Potter and the Order of the Phoenix, with distance of 0.17345094201226208
9: Harry Potter and the Goblet of Fire, with distance of 0.1489778170737216
10: Harry Potter and the Prisoner of Azkaban, with distance of 0.1395682125920943
Out[68]:
['The Return of the King',
 'Mockingjay',
 'The Da Vinci Code',
 'Catching Fire',
 "Harry Potter and the Philosopher's Stone",
 'Harry Potter and the Deathly Hallows',
 'Harry Potter and the Half-Blood Prince',
 'Harry Potter and the Order of the Phoenix',
 'Harry Potter and the Goblet of Fire',
 'Harry Potter and the Prisoner of Azkaban']
In [69]:
rec=make_recommendation(
    model_knn=model_knn,
    data=book_user_mat_sparse,
    fav_book='Gone Girl',
    mapper=indices,
    n_recommendations=10)
You have input book: Gone Girl
Found possible matches in our database: ['Gone Girl', 'Gone ', 'Roller Girl', 'Nature Girl', 'Gossip Girl', 'The Goose Girl', 'Ghostgirl', 'Vinegar Girl', 'The Good Girl', 'Twenties Girl', 'Candy Girl', 'Metro Girl', 'Funny Girl', 'Agnes Grey', 'Boy Meets Girl', 'Love, Stargirl', 'The House Girl', 'Silver Girl', 'Wintergirls']

Recommendation system starting to make inference
......

Recommendations for Gone Girl:
1: The Giver, with distance of 0.6158197177191762
2: Memoirs of a Geisha, with distance of 0.6111717525594234
3: Eat, pray, love: one woman's search for everything across Italy, India and Indonesia, with distance of 0.6074993104114255
4: The Night Circus, with distance of 0.6055880373834424
5: O Alquimista, with distance of 0.6029134027522959
6: The Fault in Our Stars, with distance of 0.6004559014025481
7: The Devil in the White City: Murder, Magic, and Madness at the Fair that Changed America, with distance of 0.5960388200819615
8: A Game of Thrones, with distance of 0.5951336904470853
9: Divergent, with distance of 0.5627069805246723
10: The Book Thief, with distance of 0.5089278920328403
In [70]:
rec=make_recommendation(
    model_knn=model_knn,
    data=book_user_mat_sparse,
    fav_book='Divergent',
    mapper=indices,
    n_recommendations=10)
You have input book: Divergent
Found possible matches in our database: ['Divergent', 'Evergreen', 'Driven', 'Insurgent', 'Descent', 'Deliverance']

Recommendation system starting to make inference
......

Recommendations for Divergent:
1: Mockingjay, with distance of 0.6368451672358991
2: Catching Fire, with distance of 0.6317072362567975
3: Insurgent, with distance of 0.6143238462674122
4: The Giver, with distance of 0.608207539785535
6: The Perks of Being a Wallflower, with distance of 0.6017219498552195
7: The Book Thief, with distance of 0.5813217330602747
8: The Lightning Thief, with distance of 0.5660509401292786
9: Gone Girl, with distance of 0.5627069805246723
10: The Fault in Our Stars, with distance of 0.41485257342792325
In [71]:
rec=make_recommendation(
    model_knn=model_knn,
    data=book_user_mat_sparse,
    fav_book='Kafka on the Shore',
    mapper=indices,
    n_recommendations=10)
You have input book: Kafka on the Shore
Found possible matches in our database: ['And the Shofar Blew', 'A Fraction of the Whole', 'Halfway to the Grave', 'Salt to the Sea', 'The Farthest Shore', 'A Map of the World', 'In Her Shoes']

Recommendation system starting to make inference
......

Recommendations for Kafka on the Shore:
1: Morgawr , with distance of 0.6555631323101632
2: The Woman Upstairs, with distance of 0.6127869510454468
3: Porno, with distance of 0.43176022797613023
4: The Hurricane Sisters , with distance of 0.4050860398093705
5: The Endurance: Shackleton's legendary Antarctic expedition, with distance of 0.393458565796727
6: Chill Factor, with distance of 0.288231688746155
8: Darkness, Be My Friend, with distance of 0.17858194650445824
In [72]:
rec=make_recommendation(
    model_knn=model_knn,
    data=book_user_mat_sparse,
    fav_book='Kafka on the Shore',
    mapper=indices,
    n_recommendations=10)
You have input book: Kafka on the Shore
Found possible matches in our database: ['And the Shofar Blew', 'A Fraction of the Whole', 'Halfway to the Grave', 'Salt to the Sea', 'The Farthest Shore', 'A Map of the World', 'In Her Shoes']

Recommendation system starting to make inference
......

Recommendations for Kafka on the Shore:
1: Morgawr , with distance of 0.6555631323101632
2: The Woman Upstairs, with distance of 0.6127869510454468
3: Porno, with distance of 0.43176022797613023
4: The Hurricane Sisters , with distance of 0.4050860398093705
5: The Endurance: Shackleton's legendary Antarctic expedition, with distance of 0.393458565796727
6: Chill Factor, with distance of 0.288231688746155
8: Darkness, Be My Friend, with distance of 0.17858194650445824

The above is a case where the recommender could not find the book title in the data as it may have been dropped while data cleaning because it maybe had a null value, in such cases the recommender skips the book and moves forward, which is why there are only 8 recommendations for the book 'Kafka and the Shore'.

How many Netflix Shows/ Movies are made from books as their storylines?

In [73]:
netflix=pd.read_csv('/kaggle/input/netflix-shows/netflix_titles.csv')
netflix.head()
Out[73]:
show_id type title director cast country date_added release_year rating duration listed_in description
0 81145628 Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole... United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra...
1 80117401 Movie Jandino: Whatever it Takes NaN Jandino Asporaat United Kingdom September 9, 2016 2016 TV-MA 94 min Stand-Up Comedy Jandino Asporaat riffs on the challenges of ra...
2 70234439 TV Show Transformers Prime NaN Peter Cullen, Sumalee Montano, Frank Welker, J... United States September 8, 2018 2013 TV-Y7-FV 1 Season Kids' TV With the help of three human allies, the Autob...
3 80058654 TV Show Transformers: Robots in Disguise NaN Will Friedle, Darren Criss, Constance Zimmer, ... United States September 8, 2018 2016 TV-Y7 1 Season Kids' TV When a prison ship crash unleashes hundreds of...
4 80125979 Movie #realityhigh Fernando Lebrija Nesta Cooper, Kate Walsh, John Michael Higgins... United States September 8, 2017 2017 TV-14 99 min Comedies When nerdy high schooler Dani finally attracts...
In [74]:
netflix.shape
Out[74]:
(6234, 12)
In [75]:
books['original_title']=books['original_title'].str.lower()
netflix['title']=netflix['title'].str.lower()
In [76]:
t=netflix.merge(books, left_on='title', right_on='original_title', how="inner")
In [77]:
t.shape
Out[77]:
(193, 36)

193 out of 6234 netflix shows are made from books.

In [78]:
import plotly.graph_objects as go

labels = ['Shows from books','Shows not from books']
values = [193,6234]

fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.show()

Thus, the model looks and works perfectly good.

Both content based and collaborative filtering recommendation systems are implemented. I am also implementing a plotly-dash based interface for the same which will be available in the next post.:)

counter free