Recommendation systems

You, Tue Mar 26 2024 • recommendation model

Recommendation models are used everywhere on the internet: e-commerce, streaming services, or social media, these systems leverage algorithms to analyze user preferences, behaviors, and historical data to predict and suggest items that align with individual tastes. They play an integral role in enhancing user experiences by delivering tailored content and products. As machine learning continues to evolve, recommendation systems not only contribute to the efficiency of online platforms but also shape the way individuals discover and engage with information, products, and services in our dynamic and interconnected society.

There are many types of recommendation systems, ine common architecture for recommendation systems consists of the following components ^1:

Candidate Generation: In candidate generation, a diverse set of potential items is identified for recommendation. Techniques like collaborative filtering or content-based filtering are employed to create an initial pool of candidates from possibly thousands or millions of options, laying the foundation for the subsequent steps in the recommendation process.
Scoring: Scoring involves assigning relevance scores to the candidate items and ranking them. Since this step evaluates less items than the last, it can use aditional queries as information.
Re-ranking: Re-ranking is the final refinement step where the initially scored list is adjusted based on additional factors. Real-time information, contextual data, or business rules may be considered to optimize the order of recommendations. This phase aims to fine-tune the suggestions, providing a more accurate and contextually relevant set of items to enhance the overall user experience.

Content-based Recommendation

We will build a content-based recommendation model to recommend movies, this type of model uses the information about what the user likes to recommend similar items, it requires knowledge of the previous actions of the user and knowledge of about the items (features). Differently from Collaborative Filtering that recommends items based on the preferences and behaviors of users who are similar to that user, content-based doesn't use any information of other users. In this sense some level of hand-engineering is required to build the feature representations for the model, we have information on the movies to build these features.

This type of model model can capture the specific interests of a user, and can recommend niche items that very few other users are interested in, this is very suited for movie recommendations, since users can have very specific tastes. Also since the model doesn't need any data about other users, it makes it easier to scale to a large number of users which is our case. For these reasons we chose this approach.

In this tutorial, we will comprehensively cover the entire data science pipeline, starting from data acquisition, preprocessing, and visualization, to feature extraction, model building, evaluation, and interpretation. The complete jupyter notebook can be found here.

The framework of our model is given by:

Preparation: We install the necessary libraries, download the data and preprocess the data
Visualization: We visualize our data through graphs to get a sense of the statistics and behavior of our data
Model building: We create the movie and user features that will be used in our content-based model
Model traning: We train our model using a gradient boosting machine
Evaluation: We evaluate our model using sensitive metrics for our problem

1 Preparation

1.1 Install and import necessary libraries

#Install libraries
!pip install pandas
!pip install numpy
!pip install lightgbm
!pip install matplotlib

#Import libraries
import pandas as pd
import numpy as np
import lightgbm as lgbm
import matplotlib.pyplot as plt

1.2 Download and extract Movie dataset

1.3 Preprocess the data

We here create the relevant databases from our data, selecting the information that we will be using: movie name, year, genre and ratings from users.

#Load files
df_ratings = pd.read_csv('ratings.csv')
df_movies_metadata = pd.read_csv('movies_metadata.csv')


#df_movies_metadata[~(df_movies_metadata['budget']=='0')]
# We see that the budget is missing on a lot of movies, so we wont use this for our model

# As said before we also wont use informations like overview and tagline since this would involve processing text via NLP techniques
df_movies_metadata = df_movies_metadata[['genres','release_date','title']]

#Get year release
df_movies_metadata['year'] = df_movies_metadata['release_date'].str[:4]
df_movies_metadata = df_movies_metadata[~df_movies_metadata['year'].isna()]
df_movies_metadata = df_movies_metadata.astype({'year':'int'})

# We will create a table for movie genres
genre_df_list = []
# Iterate over each row in the original DataFrame
for movie_index, row in df_movies_metadata.iterrows():
    # Convert genres from string to list of dictionaries
    genres_list = eval(row['genres'])

    # Iterate over each genre in the list and append to the new DataFrame
    for genre_dict in genres_list:
        genre_df_list.append({'movieId': movie_index, 'genre': genre_dict['name']})
genres_df = pd.DataFrame(genre_df_list)
# genres_df.index = genres_df.index + 1

movies_df = df_movies_metadata[['year','title']]
movies_df = movies_df[~movies_df.isna()['title']]
movies_df = movies_df.reset_index(names='movieId')
# movies_df = movies_df[movies_df.index.isin(genres_df.groupby('movieId').sum().index)] #get only movies we have genre information

# ratings_df = df_ratings[df_ratings['movieId'].isin(genres_df.groupby('movieId').sum().index)] #get only ratings we have genre information
df_ratings['movieId'] -= 1
ratings_df = df_ratings[df_ratings['movieId'].isin(movies_df['movieId'])] #get only ratings we have information

2 Visualization

We visualize some statistics of our data.

plt.hist(np.log10(ratings_df['userId'].value_counts().to_numpy()), bins=30)
plt.xlim(0)
plt.xlabel('log(# of ratings)')
plt.ylabel('# of users')
plt.title("Histogram of ratings per user")
print("Mean number of ratings: {} \nMedian number of ratings: {}".format(ratings_df['userId'].value_counts().mean(),ratings_df['userId'].value_counts().median()))

Mean number of ratings: 81.91753051608356 . Median number of ratings: 26.0

plt.hist(np.log10(ratings_df['movieId'].value_counts().to_numpy()), bins=30)
plt.xlim(0)
plt.yscale('log')
plt.xlabel('log(# of ratings)')
plt.ylabel('log(# of movies)')
plt.title("Histogram of ratings per movie")
print("Mean number of ratings: {} \nMedian number of ratings: {}".format(ratings_df['movieId'].value_counts().mean(),ratings_df['movieId'].value_counts().median()))

Mean number of ratings: 2003.3782778332275 . Median number of ratings: 222.0

3 Model building

3.1 Temporal train-valid split per user

We split the dataset into training and validation taking 80% oldest ratings in the training set and 20% newest ratings in the validation set.

#Limit size of rating dataset
user_35 = (ratings_df['userId'].value_counts()>35)
user_sample = user_35[user_35].sample(n=15000).index
ratings_df = ratings_df[ratings_df['userId'].isin(user_sample)]

#Create training mask
train_th = ratings_df.groupby('userId')['timestamp'].quantile(0.8)
train_th = pd.merge(ratings_df[['userId']], train_th, how='left',
                        left_on='userId', right_index=True)['timestamp']
mask_train = ratings_df['timestamp'] < train_th
print(mask_train.mean())

#Split feature(X) and target(Y)
y = ratings_df['rating'].copy()
x = ratings_df.drop(columns=['rating'])

# Split training
# mask_train_likes = (mask_train & (y >= 4.)).to_numpy()
# x_train = x.loc[mask_train]

3.2 Craft movies features

Since we are using a content-based recommendation system we need to create the features for our movies, we do it based on the genres of each movie, assigning a value of either 1 if the movie is of a certain genre or 0 if it isn't.


genres_dummies = pd.get_dummies(genres_df)
genres_dummies = genres_dummies.groupby('movieId').sum().reset_index()

x_movies = pd.merge(movies_df[['year', 'movieId']], genres_dummies, on = 'movieId', how='left').fillna(0)
movie_counts = x[mask_train]['movieId'].value_counts()
x_movies['popularity'] = x_movies['movieId'].map(movie_counts).fillna(0)

x_movie_feat = pd.merge(x, x_movies, how='left', on='movieId').fillna(0)


ratings_genres = x_movie_feat.sum().filter(like='genre').sort_values(ascending=False)
ratings_genres = ratings_genres.cumsum()/ratings_genres.sum()*100

# Create a bar plot with cumulative values
ratings_genres.plot(kind='bar')

# Set plot labels and title
plt.xlabel('Genres')
plt.ylabel('Cumulative sum of ratings by genre (%)')
plt.title('Bar Plot of Cumulative Sum of Columns')
plt.show()

n_top_genres = (ratings_genres.reset_index(drop=True)>99).idxmax()+1
print('More than 99% of ratings belong to the first {} most watched genres'.format(n_top_genres))
top_ratings_genres = ratings_genres.iloc[:n_top_genres].index
not_top_ratings_genres = ratings_genres.iloc[n_top_genres:].index

3.3 Craft user features

To create our user features we take the percentage of movies watched for each genre, 0.8 in the column "user_genre_animation" means 80% of the movies watched by that user are animations. We also define quantiles of year and popularity of the movies for each user. year25 of 1957 means 25% of the movies watched by this user came before 1957, year90 of 2001 means 90% of the movies watched came before 2001.


x_movie_feat = x_movie_feat.drop(columns = not_top_ratings_genres)
x_train = x_movie_feat.loc[mask_train.reset_index(drop=True)]

# Add "average liked genres":
# cols = [col for col in x.columns if 'genre' in col]
cols = [col for col in top_ratings_genres if 'genre' in col]
x_users = x_train[['userId'] + cols].groupby('userId').mean()
x_users = x_users.rename(columns={col: f'user_{col}' for col in cols})
x_users /= x_users.sum(axis=1).to_numpy().reshape(-1, 1)
x_users = x_users.reset_index()
x_users

# A years quantiles of liked movies:
x_train['year'] = pd.to_numeric(x_train['year'], errors='coerce')
def add_quantiles(x_users, col, qs):
    quantiles = x_train.groupby('userId')[col].quantile(qs)
    df = quantiles.reset_index().pivot(index='userId', columns='level_1', values=col)
    return pd.merge(
        x_users,
        df.reset_index().rename(columns={q: f'{col}_{q}' for q in qs}),
        validate='1:1'
    )
x_users = add_quantiles(x_users, 'year', [0.1, 0.25, 0.5, 0.75, 0.9])
x_users = add_quantiles(x_users, 'popularity', [0.1, 0.25, 0.5, 0.75, 0.9])

assert 'userId' in x_users
x_user_feat = pd.merge(x_movie_feat, x_users, on='userId', how='left', validate='m:1').fillna(0)
x_user_feat

4 Train a model

We are training a content-based model, the model should not learn anything about the user id or the movie id, so we remove the columns containing this info from our model input X.

x_CB = x_user_feat.drop(columns=['userId', 'movieId', 'timestamp'])

#Split train and valid datasets
mask_train_reset = mask_train.reset_index(drop=True)
X_train = x_CB.loc[mask_train_reset]
X_valid = x_CB.loc[~mask_train_reset]
y_train = y[mask_train].to_numpy()
y_valid = y[~mask_train].to_numpy()

model = LGBMRegressor(
    num_leaves=80,
    max_depth=-1,
    learning_rate=0.4,
    n_estimators=150,
    reg_alpha=0.2,
    # reg_lambda=0.3
)

subsampling_fraction = 1.0
mask = np.random.rand(len(X_train)) < subsampling_fraction
X_train_subsample = X_train.iloc[mask]
y_train_subsample  = y_train[mask]
model.fit(X_train_subsample, y_train_subsample)

5 Recommend movies for user and evaluate

We calculate the ratio of movies among the top recommendations given by our model that were also top rated movies of each user and take the average for all users. This metric is called precision at top-k.

from sklearn.metrics import precision_score

Xy_valid_pred = x_user_feat.loc[~mask_train_reset]
Xy_valid_pred['predictions'] = preds
Xy_valid_pred = Xy_valid_pred[['userId','movieId','predictions']]
Xy_valid_pred['ratings'] = y_valid
# Xy_valid_pred.groupby('userId').apply(lambda group: group.nlargest(5, 'predictions')).reset_index(drop=True)

top_k=10
top_K_pred = Xy_valid_pred.groupby('userId')['predictions'].nlargest(top_k).reset_index(level=0, drop=True)#.reset_index(level=0, drop=True).reset_index()
top_K_pred = Xy_valid_pred[Xy_valid_pred.index.isin(top_K_pred.index)]
top_K_rating = Xy_valid_pred.groupby('userId')['ratings'].nlargest(top_k).reset_index(level=0, drop=True)#.reset_index(level=0, drop=True).reset_index()
top_K_rating = Xy_valid_pred[Xy_valid_pred.index.isin(top_K_rating.index)]#.reset_index(level=0, drop=True).reset_index()

# Iterate over groups
precision ={}
for pred_group, rating_group in zip(top_K_pred.groupby('userId'),top_K_rating.groupby('userId')):
    precision[pred_group[0]] = rating_group[1]['movieId'].isin(pred_group[1]['movieId']).sum()/len(rating_group[1]['movieId'])
global_precision = sum(precision.values()) / len(precision)
print('Global precision at top-10:',global_precision)

Global precision at top-10: 0.62241

So we see that among the top 10 recommened movies given by our model 62% of them also belong to the top 10 highest scores given by that user. So we see that our model is capable of giving good sugestions that are validated based on the real user data.

Conclusion

We have succesfully built a content-based recommendation system, following the entire framework of building a Machine learning model, from data acquisition, through preprocessing, visualization, model building, training and validation. We have seen that our the Gradient Boosting Machine was able to learn from the engineered features of our users and movies and accurately recommend movies that have been shown to be of interest for the user.