10 1016@j Simpat 2020 102198
10 1016@j Simpat 2020 102198
10 1016@j Simpat 2020 102198
Journal Pre-proof
PII: S1569-190X(20)30137-4
DOI: https://doi.org/10.1016/j.simpat.2020.102198
Reference: SIMPAT 102198
Please cite this article as: AFOUDI Yassine, LAZAAR Mohamed, Intelligent recommender system
based on unsupervised machine learning and demographic attributes, Simulation Modelling Practice
and Theory (2020), doi: https://doi.org/10.1016/j.simpat.2020.102198
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition
of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of
record. This version will undergo additional copyediting, typesetting and review before it is published
in its final form, but we are providing this version to give early visibility of the article. Please note that,
during the production process, errors may be discovered which could affect the content, and all legal
disclaimers that apply to the journal pertain.
Abstract
Recommendation systems aim to predict users interests and recommend items most likely
to interest them. In this paper, we propose a new intelligent recommender system that
combines collaborative filtering (CF) with the popular unsupervised machine learning
algorithm K-means clustering. Also, we use certain user demographic attributes such as
the gender and age to create segmented user profiles, when items (movies) are clustered by
genre attributes using K-means and users are classified based on the preference of items
and the genres they prefer to watch. To recommend items to an active user, Collaborative
Filtering approach then is applied to the cluster where the user belongs.
Following the experimentation for well known movies, we show that the proposed system
satisfies the predictability of the CF algorithm in GroupLens. In addition, our proposed
system improves the performance and time response speed of the traditional collaborative
Filtering technique and the Content-Based technique too.
1. Introduction
Today, the internet allows people to access abundant online resources.
Amazon, for example, has a huge collection of products. Although the
amount of information available is increasing day by day, this leaves peo-
ple or specially costumers in new trouble, they find it too difficult to choose
what they really want to see or buy. This is where the recommendation sys-
tem comes in.
Recommendation systems help users find and choose items (books, movies,
restaurants, music... etc.) that may interest them, from the large number of
2
discuss the results. Finally, Section4 summarizes the conclusion.
2. BACKGROUND
2.1. Recommendation Tasks
Recommendation systems have many algorithms that aim to provide the
most relevant and accurate product to the user by filtering useful information
from a huge database pool. In general, recommendation engine based on
three types of approaches collaborative filtering, content based and hybrid
recommender system, but in this article we are interested in Collaborative
Filtering.
3
to use the complete dataset every time. This approach potentially offers the benefits of
both speed and scalability. In this paper, we will use a Model-Based technique based on
matrix factorization named Singular Value decomposition (SVD).
4
Figure 1: Adopted architecture
tion module, k-means clustering module, user profile module and finally recommendation
module.
5
Figure 2: The Collaborative RMSE results using all features and PCA features on 4
random users
using 19 genre features because the Mean error has been reduced. To make the decision
on which datasets we will use, we repeat our test several times, for each one, we found
that the dataset contains PCA features gives us much better performance, that’s why we
will continue our work only with our 10 movie PCA features to minimize the memory
occupied by our system.
6
After all the information given above, we have arrived at the stage of finding the value of
k. The most common method for choosing the number of clusters is to launch K-Means
with different values of K and calculate the variance of the different clusters. The vari-
ance is the sum of the distances between each centroid of a cluster and the different items
included in the same cluster. Thus, we seek to find a certain number of clusters K so that
the selected clusters minimize the distance between their centroids and the items in the
same cluster. Generally, by plotting the different numbers of clusters K as a function of
the variance, the point of the elbow is that of the number of clusters whose variance no
longer decreases significantly (Elbow approach).
In this work, we give our 10 components from the previous module to the K-means ap-
proach and we obtain as output our dataset of movies classified into a specific number of
clusters.
7
4. RESULT AND DISCUSSION
4.1. The dataset
We chose the movielens 100k dataset [9] to evaluate our experiments because it is
widely used and publicly available. MovieLens data sets were collected by the GroupLens
Research Project at the University of Minnesota. We use this dataset for a study where
the goal is to generate recommendation of movies to users. This data set contains 100,000
ratings (1 to 5 scale) from 943 users on 1682 movies and each user have rated at least
20 movies, and there is some demographic information for the users like age, gender,
occupation and zip code.
4.2. Implementation
4.2.1. Experimental steps
First step is to import the dataset in our project, and then we split the whole users
interaction dataset into 75% as the training set and 25% as the test set.
The second step is to use PCA feature extraction to minimize the genre attributes in the
movie dataset from 19 genres to 10 components, after that we use K-means clustering
to group the whole dataset into a specific number of clusters, by plotting the different
10 numbers of clusters as a function of the variance, based on the approach of Elbow
explained above and by observing the Figure 3 we choose the k=6 the number of our
k-means clustering.
After the movies are grouped into 6 clusters,, we assign for each movie its cluster class
number, then we create a user profile for all users according to the preferred viewed class,
then we take an active user and we search for all movies in his favorite class and all users
with his gender, and finally we apply the SVD collaborative filtering approach to make
recommendations and give the result to the user after deleting all the movies previously
viewed.
We will test and evaluate our recommendations by searching whether the recommended
movies to a user according to the training set, are in the item list the user have seen and
rated in the test set.
8
Figure 3: The variance of 10 numbers of clusters on our dataset (Elbow approach)
4.2.4. Evaluation
In the area of recommendation systems, the user wishes to receive the N best recom-
mended items. In this way, the user will view some recommended movies, classified from
best to worst. In fact, in some cases, the user does not care much about the exact order
of the list, a few good recommendations are enough.
As a result, many evaluation techniques are involved in this area, classified as online and
offline techniques. For those online, we need real users to give their opinion and feedback
on the recommendation given, in this work we will use the offline techniques based on the
movies cited for each user in the test set.
In this article, from many offline approaches,we choose to use the RMSE and Precision-
Recall at k techniques. Root Mean Squared Error (RMSE) is a popular technique used to
evaluate a recommender system accuracy based on ratings data. In other way after using
the train set to build our model we predict movies rating for a given user in the test set
and here we compare the predicted ratings with the true values, then we just compute the
square root average of the errors from the whole test set to find the RMSE value as shown
in formula below, when P is the predicted rating and R the true rating.
sP
2
ratings (P − R)
RM SE = (4)
#ratings
Recall@k = Ri /N r (5)
9
Table 1: The Top 3 movies recommended with our system on some random users.
User User Preferred User Top three recommended movies
ids movie class Gender
- The Sting
5 class 3 Female - Young Frankenstein
- A Fish Called Wanda
Fargo
17 class 2 Male Twelve Monkeys
Lone Star
- Twelve Monkeys
9 class 2 Male - Fargo
- The Silence of the Lambs
P recision@k = Ri /T r (6)
Where Ri is the number of recommended movies at k that are relevant, Tr is the total of
relevant items and Nr is the number of recommended items at @k. After calculating The
Recall and Precision at k we should normalize the result, then we introduce F-Measure,
which is
F − measure@k = (2 × P @k × R@k)/(P @k + R@k) (7)
After building our system, we want to know and evaluate which model gives us powerful
recommendation and best accuracy of our Movielens 100k database. As explained above
we use precision-recall method, Figure 4 shows us a plot of the mean F-measure at 5 and
10 the number of recommended movies for all users, the Figure 5 shows us the F-measure
at k=5 scatter plot of 10 first users and Figure 6 shows us the Precision curve at k=10 for
first 20 users, when the Figure 7 shows us the time in seconds taken by all the models to
recommend 20 movies to the same three users.
Finally, it should be mentioned that recommendation systems take a bit space and run-
time to give suggestions due to the large number of features and parameters. By observing
and reading the results, we can assume that the use of K-means clustering the popular
unsupervised machine learning technique with Collaborative Filtering associated with de-
mographics attributes gives us much better performance and accuracy, also a fast time
response on movies recommendation than the models based on the traditional Collabora-
tive Filtering or Content-Based approaches but our approach still has a limited point is
the attributes associated with the items. Obviously, if the item does not have descriptive
attributes, it cannot be used in the K-means clustering step to find out which cluster
contains this item.
To show how well our approach is performing, we are testing our model using the IMDB
dataset which contains 671 users with at least 20 interactions and 42,262 unique movies
and 10,0004 user / item interactions, we split the rating data by 75% for the train set and
25% as a test set, reading the result given by the precision recall approach as shown in
10
Figure 4: The mean F-measure plot of six models at K=5 and K=10 using Movielens
dataset
figure 8, we have found that our approach always gives good and better results than the
state of art approaches existing in the field of recommendation. To give more details on
the generosity of our approach, we can say that, if we have an item dataset with many
genre attributes and a user dataset with demographic attributes, our approach can work
well and it will be a good choice to offer customers what they want.
The main Managerial Insights of our approach aim to give a powerful recommendation
in a short time. It also helps to eliminate the limit of the cold start problem and give
suggestions to a new user even if we don’t have much information about user transaction
11
Figure 5: The F-measure scatter plot of 10 users
12
Figure 7: The time in seconds taken by all the models to recommend movies to three users
Figure 8: Global F-measure plot at K=5 and K=10 for 671 users using IMDB dataset
13
ACKNOWLEDGMENT
The authors would like to thank the Smart System Lab our research laboratory, and
Al Borchers for cleaning up this data.
References
[1] Afoudi, Y., Lazaar, M., Al Achhab, M., 2019. Collaborative Filtering Recommender
System. In: Advanced Intelligent Systems for Sustainable Development (AI2SD2018).
Advances in Intelligent Systems and Computing. Springer International Publishing,
Cham, pp. 332–345.
[2] Afoudi, Y., Lazaar, M., Al Achhab, M., 2019. Impact of Feature selection on content-
based recommendation system. In: 2019 International Conference on Wireless Tech-
nologies, Embedded and Intelligent Systems (WITS). pp. 1–6.
[3] Cami, B. R., Hassanpour, H., Mashayekhi, H., 2017. A content-based movie recom-
mender system based on temporal user preferences. In: 2017 3rd Iranian Conference
on Intelligent Systems and Signal Processing (ICSPIS). IEEE, Shahrood, pp. 121–125.
[4] Do, H.-Q., Le, T.-H., Yoon, B., ???? Dynamic weighted hybrid recommender systems.
In: 2020 22nd International Conference on Advanced Communication Technology
(ICACT). pp. 644–650.
[5] Du, Z., Zhang, T., Chen, Y., Ai, L., Wang, X., 2011. A content and user-oblivious
video-recommendation algorithm. Simulation Modelling Practice and Theory 19 (9),
1895–1912.
[6] Duwairi, R., Abu-Rahmeh, M., 2015. A novel approach for initializing the spherical
K-means clustering algorithm. Simulation Modelling Practice and Theory 54, 49–63.
[7] Duwairi, R., Ammari, H., 2016. An enhanced CBAR algorithm for improving recom-
mendation systems accuracy. Simulation Modelling Practice and Theory 60, 54–68.
[8] Graef, G., Schaefer, C., 2002. Application of ART2 Networks and Self-Organizing
Maps to Collaborative Filtering. In: Hypermedia: Openness, Structural Awareness,
and Adaptivity. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg,
pp. 296–309.
[9] Harper, F. M., Konstan, J. A., 2016. The MovieLens Datasets: History and Context.
ACM Transactions on Interactive Intelligent Systems 5 (4), 1–19.
[10] Jakomin, M., Curk, T., Bosni, Z., 2018. Generating inter-dependent data streams for
recommender systems. Simulation Modelling Practice and Theory 88, 1–16.
[11] Katarya, R., 2018. Movie recommender system with metaheuristic artificial bee. Neu-
ral Computing and Applications 30 (6), 1983–1990.
14
[12] Lee, M., Choi, P., Woo, Y., 2002. A Hybrid Recommender System Combining Col-
laborative Filtering with Neural Network. In: Adaptive Hypermedia and Adaptive
Web-Based Systems. Lecture Notes in Computer Science. Springer, Berlin, Heidel-
berg, pp. 531–534.
[13] Ma, Z., Yang, Y., Wang, F., Li, C., Li, L., 2014. The SOM Based Improved K-Means
Clustering Collaborative Filtering Algorithm in TV Recommendation System. In:
2014 Second International Conference on Advanced Cloud and Big Data. pp. 288–
295.
[14] Nayak, R., Mirajkar, A., Rokade, J., Wadhwa, G., 2018. Hybrid Recommendation
System For Movies 05 (03), 4.
[15] Ponnam, L. T., Deepak Punyasamudram, S., Nallagulla, S. N., Yellamati, S., 2016.
Movie recommender system using item based collaborative filtering technique. In:
2016 International Conference on Emerging Trends in Engineering, Technology and
Science (ICETETS). IEEE, Pudukkottai, India, pp. 1–5.
[17] Reddy, S., Nalluri, S., Kunisetti, S., Ashok, S., Venkatesh, B., 2019. Content-Based
Movie Recommendation System Using Genre Correlation. In: Smart Intelligent Com-
puting and Applications. Smart Innovation, Systems and Technologies. Springer, Sin-
gapore, pp. 391–397.
[18] Vembu, S., Baumann, S., 2005. A Self-Organizing Map Based Knowledge Discovery
for Music Recommendation Systems. In: Computer Music Modeling and Retrieval.
Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, pp. 119–129.
[19] Vozalis, M. G., Margaritis, K. G., 2007. Using svd and demographic data for the
enhancement of generalized collaborative filtering. Information Sciences 177 (15),
3017–3037.
[20] Wadia, K., Gupta, P., 2011. Movie recommendation system based on self-organizing
maps. The University of Texas at Austin, Austin, Texas.
[21] Walek, B., Fojtik, V., 2020. A hybrid recommender system for recommending relevant
movies using an expert system 158, 113452.
[22] Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton, W. L., Leskovec, J., 2018.
Graph Convolutional Neural Networks for Web-Scale Recommender Systems. In: Pro-
ceedings of the 24th ACM SIGKDD International Conference on Knowledge Discov-
ery & Data Mining. KDD ’18. Association for Computing Machinery, London, United
Kingdom, pp. 974–983.
15
[23] Zhou, Y., Wilkinson, D., Schreiber, R., Pan, R., 2008. Large-Scale Parallel Collab-
orative Filtering for the Netflix Prize. In: Algorithmic Aspects in Information and
Management. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, pp.
337–348.
16