\documentclass[nojss]{jss} %\documentclass[fleqn, a4paper]{article} %\usepackage{a4wide} %\usepackage[round,longnamesfirst]{natbib} %\usepackage{graphicx,keyval,thumbpdf,url} %\usepackage{hyperref} %\usepackage{fancyvrb} %\usepackage{Sweave} %\SweaveOpts{strip.white=true} %\AtBeginDocument{\setkeys{Gin}{width=0.6\textwidth}} \usepackage[utf8]{inputenc} \usepackage[english]{babel} %% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \usepackage{amsmath} \usepackage{amsfonts} %\newcommand{\strong}[1]{{\normalfont\fontseries{b}\selectfont #1}} \newcommand{\class}[1]{\mbox{\textsf{#1}}} \newcommand{\func}[1]{\mbox{\texttt{#1()}}} %\newcommand{\code}[1]{\mbox{\texttt{#1}}} %\newcommand{\pkg}[1]{\strong{#1}} %\newcommand{\samp}[1]{`\mbox{\texttt{#1}}'} %\newcommand{\proglang}[1]{\textsf{#1}} \newcommand{\set}[1]{\mathcal{#1}} %\newcommand{\sQuote}[1]{`{#1}'} %\newcommand{\dQuote}[1]{``{#1}''} %\newcommand\R{{\mathbb{R}}} \newcommand{\mat}[1]{{\mathbf{#1}}} %\renewcommand{\vec}[1]{{\mathbf{#1}}} % \sloppy %% \VignetteIndexEntry{An introduction to the R package recommenderlab} \author{Michael Hahsler} \title{\pkg{recommenderlab}: An R Framework for Developing and Testing Recommendation Algorithms} %% for pretty printing and a nice hypersummary also set: \Plainauthor{Michael Hahsler} %% comma-separated \Plaintitle{recommenderlab: An R Framework for Developing and Testing Recommendation Algorithms} %% without formatting \Shorttitle{Developing and Testing Recommendation Algorithms with R} %% a short title (if necessary) \Abstract{Algorithms that create recommendations based on observed data have significant commercial value for online retailers and many other industries. Recommender systems has a significant research community and studying such systems is part of most modern data science curricula. While there is an abundance of software that implements recommendation algorithms, there is little in terms of supporting recommender system research and education. This paper describes the open-source software~\pkg{recommenderlab} which was created with supporting research and education in mind. The package can be directly installed in R or downloaded from \url{https://github.com/mhahsler/recommenderlab}.} \Keywords{recommender systems, collaborative filtering, evaluation} \Plainkeywords{recommender systems, collaborative filtering, evaluation} %% without formatting \Address{ Michael Hahsler\\ Computer Science\\ Lyle School of Engineering\\ Southern Methodist University\\ P.O. Box 750122 \\ Dallas, TX 75275-0122\\ E-mail: \email{mhahsler@lyle.smu.edu}\\ URL: \url{michael@hahsler.net} } \begin{document} %\title{\pkg{recommenderlab}: A Framework for Developing and Testing Recommendation Algorithms\footnote{This research was funded in part by the NSF Industry/University Cooperative Research Center for Net-Centric Software \& Systems.} %} %\author{Michael Hahsler} %\maketitle %\sloppy % \abstract{ % Algorithms that create recommendations based on observed data have % significant commercial value for online retailers and many other industries. % Recommender systems has a significant research % community and studying such systems is part of most modern data science % curricula. While there is an abundance of software that implements recommendation algorithms, there is little in terms of supporting recommender system research and education. This paper describes the open-source software~\pkg{recommenderlab} which was created with supporting research and education in mind. % } <>= options(scipen=3, digits=4, prompt="R> ", eps=FALSE, width=75) ### for sampling set.seed(1234) @ \section{Introduction} Recommender systems apply statistical and knowledge discovery techniques to the problem of making product recommendations based on previously recorded usage data~\citep{recommender:Sarwar:2000}. Creating such automatically generated personalized recommendations for products including books, songs, TV shows and movies using collaborative filtering have come a long way since Information Lense, the first system using \emph{social filtering} was created more than 30 years ago~\citep{recommender:Malone:1987}. Today, recommender systems are a successful technology used by market leaders in several industries (e.g., by Amazon, %\footnote{\url{http://www.amazon.com}}, Netflix, and %\footnote{\url{http://www.netflix.com}} and Pandora). %\footnote{\url{http://www.pandora.com}}). In retail, such recommendations can improve conversion rates by helping the customer to find products she/he wants to buy faster, promote cross-selling by suggesting additional products and can improve customer loyalty through creating a value-added relationship~\citep{recommender:Schafer:2001}. Even after 30 years, recommender systems still have a very active research community. It is often not clear which of the many available algorithms is appropriate for a particular application and new approaches are constantly proposed. Many commercially available software applications implement recommender algorithms, however, this paper focuses on software support for recommender systems research which includes rapid prototyping algorithms and thorough evaluation and comparison of algorithms. For this purpose, access to the source code is paramount. Many open-source projects implementing recommender algorithms have been initiated over the years. Table~\ref{tab:software} provides links to several popular open source implementations which provide code which can be used by researchers. The extent of (currently available) functionality as well as the target usage of the available software packages vary greatly and many projects have been abandoned over the years. A comprehensive list of recommender systems software is maintained by~\cite{Jenson:2019}. %Crab, MyMediaLite and Vogoo PHP LIB aim at providing simple %recommender systems to be easily integrated into web sites. %SVDFeature focuses only on matrix factorization.. %LensKit is a relatively new software package with the aim %to provide reference implementations for common collaborative %filtering algorithms. %Finally, Apache Mahout, a machine learning %library aimed to be scalable to large data sets incorporated %collaborative filtering algorithms formerly developed under the %name Taste. \begin{table} \begin{tabular}{|p{1.9cm}|p{3cm}|l|l|} \hline {\bf Software} & {\bf Description} & {\bf Language} & {\bf URL} \\ \hline Apache Mahout & Machine learning library includes collaborative filtering &Java& \small\url{http://mahout.apache.org/}\\ \hline Crab & Components to create recommender systems & Python & \small\url{https://github.com/muricoca/crab}\\ \hline LensKit & Collaborative filtering algorithms from GroupLens Research & Python & \small\url{http://lenskit.grouplens.org/}\\ \hline MyMediaLite & Recommender system algorithms. & C\#/Mono & \small\url{http://www.mymedialite.net/}\\ \hline %RACOFI: A rule-applying collaborative filtering system & Java & \url{http://lemire.me/fr/abstracts/COLA2003.html}\\ %\hrule SVDFeature & Toolkit for feature-based matrix factorization& C++ & \small\url{https://www.jmlr.org/papers/v13/chen12a.html}\\ \hline Vogoo PHP LIB & Collaborative filtering engine for personalizing web sites & PHP & \small\url{http://sourceforge.net/projects/vogoo/} \\ \hline \end{tabular} \caption{Recommender System Software freely available for research.} \label{tab:software} \end{table} Most available software focuses on creating recommender applications for deployment as a production system or they implement a single method as part of a research project. The R extension package~\pkg{recommenderlab} described in this paper was designed for a completely different purpose. It aims at providing a comprehensive research infrastructure for recommender systems. The focus is on consistent and efficient data handling, easy incorporation of existing and new algorithms, experiment set up and evaluation of the results. The open-source programming language R, a popular software environment for statistical computing and data scientists~\citep{R:2018}, %(\url{https://www.r-project.org/}), is used as the platform since it easily allows the researcher to either implement and integrate algorithms written in a wide range of programming languages including R, Python, Java, C/C++ and already provides all needed statistical tools which is important to provide a useful research environment. Although developed to support the author's own research and teaching needs in 2010, there proved to be a need for research-focused software and the package \pkg{recommenderlab} turned out to be quite popular. A healthy community of users who file bug reports, suggest improvements and contribute their own algorithms has grown around the package and is coordinated using the software's GitHub page\footnote{\url{https://github.com/mhahsler/recommenderlab}}. Authors not related to the developer have written a textbook about how to use the package~\citep{Suresh:2015}. The package is used in several university courses to demonstrate the basics of recommender system development. Finally, the package was employed by several researchers to develop and test their own algorithms~\citep[e.g.,][]{Chen:2013, Buhl:2016, Beel:2016, Lombardi:2017}. Package~\pkg{recommenderlab} focuses on collaborative filtering which is based on the idea that given rating data by many users for many items (e.g., 1 to 5 stars), %for movies elicited directly from the users), one can predict a user's rating for an item not known to her or him~\citep[see, e.g.,][]{recommender:Goldberg:1992} or create for each user a so called top-$N$ lists of recommended items~\citep[see, e.g.,][]{recommender:Sarwar:2001,recommender:Deshpande:2004}. The premise is that users who agreed on the rating for some items typically also tend to agree on the rating for other items. \pkg{recommenderlab} provides implementations of many popular algorithms, including the following. \begin{itemize} \item {\bf User-based collaborative filtering (UBCF)} predicts ratings by aggregating the ratings of users who have a similar rating history to the active user~\citep{recommender:Goldberg:1992,Resnick:1994,recommender:Shardanand:1995}. \item {\bf Item-based collaborative filtering (IBCF)} uses item-to-item similarity based on user ratings to find items that are similar to the items the active user likes~\citep{Kitts:2000,recommender:Sarwar:2001, recommender:Linden:2003,recommender:Deshpande:2004}. \item {\bf Latent factor models} use singular value decomposition (SVD) to estimate missing ratings using methods like SVD with column-mean imputation, Funk SVD or alternating least squares~\citep{Hu:2008,recommender:Koren:2009}. \item {\bf Association rule-based recommender (AR)} uses association rules to find recommended items~\citep{Fu:2000,Mobasher:2001,Geyer-Schulz:2002,Lin:2002,Demiriz:2004}. \item {\bf Popular items (POPULAR)} is a non-personalized algorithm which recommends to all users the most popular items they have not rated yet. \item {\bf Randomly chosen items (RANDOM)} creates random recommendations which can be used as a baseline for recommender algorithm evaluation. \item {\bf Re-recommend liked items (RERECOMMEND)} recommends items which the user has rated highly in the past. These recommendations can be useful for items that are typically consumed more than once (e.g., listening to songs or buying groceries). \item {\bf Hybrid recommendations (HybridRecommender)} aggregates the recommendations of several algorithms~\citep{Cano:2017}. \end{itemize} We will discuss some of these algorithms in the rest of the paper. Detailed information can be found in the survey book by~\cite{recommender:Desrosiers:2011}. This rest of this paper is structured as follows. Section~\ref{sec:CF} introduces collaborative filtering and some of its popular algorithms. In section~\ref{sec:evaluation} we discuss the evaluation of recommender algorithms. We introduce the infrastructure provided by \pkg{recommenderlab} in section~\ref{sec:infrastructure}. In section~\ref{sec:examples} we illustrate the capabilities on the package to create and evaluate recommender algorithms. We conclude with section~\ref{sec:conclusion}. \section{Collaborative Filtering} \label{sec:CF} To understand the use of the software, a few formal definitions are necessary. We will often give examples for a movie recommender, but the examples generalize to other types of items as well. Let $\set{U} = \{u_1, u_2, \ldots, u_m\}$ be the set of users and $\set{I} = \{i_1, i_2, \ldots, i_n\}$ the set of items. Ratings are stored in a $m \times n$ user-item rating matrix $\mat{R} = (r_{jl})$, where each row represents a user $u_j$ with $1 \le j\le m$ and columns represent items $i_l$ with $1 \le l\le n$. We use $\mathbf{r}_j$ to denote the row vector of $\mat{R}$ with the ratings of user $u_j$. Ratings use a specific rating scale. For example, Netflix uses 1 to 5 stars. and estimated ratings are allowed to be within an interval of matching range (e.g., $[1,5]$). Typically, only a small fraction of ratings are known and for many cells in $\mat{R}$, the values are missing. Missing values represent movies that the user has not rated and potentially also not seen yet. Collaborative filtering aims to create recommendations for a user called the active user $u_a \in \set{U}$. We define the set of items unknown to user $u_a$ as $\set{I}_a = \set{I} \setminus \{i_l \in \set{I}\;|\;r_{al}\ \textrm{is not missing}\}$. The two typical tasks are to predict ratings for all items in $\set{I}_a$ or to create a list containing the best $N$ recommended items from $\set{I}_a$ (i.e., a top-$N$ recommendation list) for $u_a$. Predicting all missing ratings means completing the row of the rating matrix %$\hat{r}_{a\cdot}$ where the missing values for items in $\set{I}_a$ are replaced by ratings estimated from other data in $\mat{R}$. %The estimated ratings are in the same range as the original rating %(e.g., in the range $[1,5]$ for a five star rating scheme). From this point of view, recommender systems are related to matrix completion problem. %\marginpar{cite} Creating a top-$N$ list can be seen as a second step after predicting ratings for all unknown items in $\set{I}_a$ and then taking the $N$ items with the highest predicted ratings. Some algorithms skip predicting ratings first and are able to find the top $N$ items directly. A list of top-$N$ recommendations for a user $u_a$ is an partially ordered set $T_N = (\set{X}, \ge)$, where $\set{X} \subset \set{I}_a$ and $|\set{X}| \le N$ ($|\cdot|$ denotes the cardinality of the set). Note that there may exist cases where top-$N$ lists contain less than $N$ items. This can happen if $|\set{I}_a| < N$ or if the CF algorithm is unable to identify $N$ items to recommend. The binary relation $\ge$ is defined as $x\ge y$ if and only if $\hat{r}_{ax} \ge \hat{r}_{ay}$ for all $x,y \in \set{X}$. Furthermore we require that $\forall_{x\in \set{X}} \forall_{y\in \set{I}_a} \quad \hat{r}_{ax} \ge \hat{r}_{ay}$ to ensure that the top-$N$ list contains only the items with the highest estimated rating. Typically we deal with a very large number of items with unknown ratings which makes first predicting rating values for all of them computationally expensive. Some approaches (e.g., rule based approaches) can predict the top-$N$ list directly without considering all unknown items first. Collaborative filtering algorithms are typically divided into two groups, \emph{memory-based CF} and \emph{model-based CF} algorithms~\citep{recommender:Breese:1998}. Memory-based CF use the whole (or at least a large sample of the) user database to create recommendations. The most prominent algorithm is user-based collaborative filtering. The disadvantages of this approach is scalability since the whole user database has to be processed online for creating recommendations. Model-based algorithms use the user database to learn a more compact model (e.g, clusters with users of similar preferences) that is later used to create recommendations. In the following we will present the basics of well known memory and model-based collaborative filtering algorithms. Further information about these algorithms can be found in the recent survey book chapter by~\cite{recommender:Desrosiers:2011}. \subsection{User-based Collaborative Filtering} User-based CF~\citep{recommender:Goldberg:1992,Resnick:1994,recommender:Shardanand:1995} is a memory-based algorithm which tries to mimics word-of-mouth by analyzing rating data from many individuals. The assumption is that users with similar preferences will rate items similarly. Thus missing ratings for a user can be predicted by first finding a \emph{neighborhood} of similar users and then aggregate the ratings of these users to form a prediction. The neighborhood is defined in terms of similarity between users, either by taking a given number of most similar users ($k$ nearest neighbors) or all users within a given similarity threshold. Popular similarity measures for CF are the \emph{Pearson correlation coefficient} and the \emph{Cosine similarity}. These measures are defined between two vectors with ratings $\mathbf{x}$ and $\mathbf{y}$. \begin{equation} \mathrm{m_{Pearson}}(\mathbf{x},\mathbf{y}) = \frac{1}{n - 1} \sum_{l = 1}^n \left( \frac{x_l - \bar{\mathbf{x}}}{s_x} \right) \left( \frac{y_l - \bar{\mathbf{y}}}{s_y} \right) \label{equ:pearson} \end{equation} and \begin{equation} \mathrm{m_{Cosine}}(\mathbf{x},\mathbf{y}) = \frac{\mathbf{x}\cdot\mathbf{y}} {\|\mathbf{x}\| \|\mathbf{y}\|}, \label{equ:cosine} \end{equation} where $n$ is the number of elements in the rating vectors and missing ratings are skipped in the calculation. $s_x$ and $s_y$ is the standard deviation and $\|\cdot\|$ is the $l^2$-norm of a vector. To calculate the measure between two users, $u_x$ and $u_y$, $\mathbf{x} = \mathbf{r}_{x}$ and $\mathbf{y} = \mathbf{r}_{y}$ represent the row vectors in $\mat{R}$ with the two users' profile vectors are used. For calculating similarity using rating data only the dimensions (items) are used which were rated by both users. Note that cosine and correlation are in the range $[-1, 1]$, but similarity measures need to be in the range of $[0, 1]$. We, use the transformation $$s = \frac{m + 1}{2}.$$ Using the similarity, the neighborhood for the active user $\set{N}(a) \subset \set{U}$ can be selected by either a threshold on the similarity or by taking the $k$ nearest neighbors. Once the users in the neighborhood are found, their ratings are aggregated to form the predicted rating for the active user $u_a$. The easiest form is to just average the ratings in the neighborhood. For item $i_l$ this is \begin{equation} \hat{r}_{al}=\frac{1}{|\set{N}(a)|} \sum_{i\in\set{N}(a)} r_{il}. \label{equ:aggregation1} \end{equation} \begin{figure} \centerline{\includegraphics[width=13cm]{user-based}} \caption{User-based collaborative filtering example with (a) rating matrix $R$ and estimated ratings for the active user, (b), similarites between the active user and the other users $s_a$ (Euclidean distance converted to similarities), and (b) the user neighborhood formation. } \label{fig:UBCF} \end{figure} An example of the process of creating recommendations by user-based CF is shown in Figure~\ref{fig:UBCF}. To the left is the rating matrix $\mat{R}$ with 6 users and 8 items and ratings in the range 1 to 5 (stars). We want to create recommendations for the active user $u_a$ shown at the bottom of the matrix. To find the $k$-neighborhood (i.e., the $k$ nearest neighbors) we calculate the similarity between the active user and all other users based on their ratings in the database and then select the $k$ users with the highest similarity. To the right in Figure~\ref{fig:UBCF} we see a $2$-dimensional representation of the similarities (users with higher similarity are displayed closer) with the active user in the center. The $k=3$ nearest neighbors ($u_1, u_2$ and $u_3$) are selected and marked in the database to the left. To generate an aggregated estimated rating, we compute the average ratings in the neighborhood for each item not rated by the active user. To create a top-$N$ recommendation list, the items are ordered by predicted rating. In the small example in Figure~\ref{fig:UBCF} the order in the top-$N$ list (with $N \ge 4$) is $i_2, i_1, i_7$ and $i_5$. However, for a real application we probably would not recommend items $i_7$ and $i_5$ because of their low ratings. The fact that some users in the neighborhood are more similar to the active user than others (see Figure~\ref{fig:UBCF} (b)) can be incorporated as weights into Equation~(\ref{equ:aggregation1}). \begin{equation} \hat{r}_{al}=\frac{1}{\sum_{i\in\set{N}(a)} s_{ai}} \sum_{i\in\set{N}(a)} s_{ai} r_{il} \label{equ:aggregation2} \end{equation} $s_{ai}$ is the similarity between the active user~$u_a$ and user $u_i$ in the neighborhood. For rating data, the performance of the recommender algorithm can be improved by removing user rating bias where some users tend to always use higher ratings while others tend to use lower ratings. This can be done by normalizing the rating data before applying the recommender algorithm. Any normalization function $h: \mathbb{R}^{n \times m} \mapsto \mathbb{R}^{n \times m}$ can be used for preprocessing. Such a function needs to be reversible by $h^{-1}$ to map the predicted rating on the normalized scale back to the original rating scale. Normalization is used to remove individual rating bias by users who consistently always use lower or higher ratings than other users. The most popular method is to center the rows of the user-item rating matrix by $$h(r_{jl}) = r_{jl} - \bar{\mathbf{r}}_j,$$ where $\bar{r}_j$ is the mean of all available ratings in row~$j$ of the user-item rating matrix $\mat{R}$. This means that ratings are now measured for each user by how much they are above or below the user's average rating. The inverse is simply, $$h^{-1}(h(r_{jl})) = h(r_{jl}) + \bar{\mathbf{r}}_j = r_{jl}.$$ Other methods like Z-score normalization which also takes rating variance into account can be found in the literature \citep[see, e.g.,][]{recommender:Desrosiers:2011}. The two main problems of user-based CF are that the whole user database has to be kept in memory and that expensive similarity computation between the active user and all other users in the database has to be performed. \subsection{Item-based Collaborative Filtering} Item-based CF~\citep{Kitts:2000,recommender:Sarwar:2001, recommender:Linden:2003,recommender:Deshpande:2004} is a model-based approach which produces recommendations based on the relationship between items inferred from the rating matrix. The assumption behind this approach is that users will prefer items that are similar to other items they like. The model-building step consists of calculating a similarity matrix containing all item-to-item similarities using a given similarity measure. Popular are again Pearson correlation and Cosine similarity and Equations~\ref{equ:pearson} and \ref{equ:cosine} are used. Here the rating vectors $x$ and $y$ are columns of $\mat{R}$ representing the ratings for two items. %% FIXME for real data %For item-based CF, \cite{recommender:Deshpande:2004} proposed %a \emph{Conditional probability-based %similarity} defined as: % $$\mathrm{sim_{Conditional}}(x,y) = % \frac{\mathrm{Freq}(xy) } % {\mathrm{Freq}(x)} = \hat{P}(y|x),$$ %where $x, y \in \set{I}$ are two items and %$\mathrm{Freq(\cdot)}$ is the number of users with the given items in their %profile. This similarity is in fact an estimate of the conditional %probability to see item $y$ in a profile given that the profile contains %items $x$. This similarity is equivalent to the confidence measure used %for association rules (see section~\ref{sec:AR} below). A drawback %of this similarity measure is its sensitivity to the frequency of $x$ with %rare $x$ producing high similarities. %To reduce the sensitivity, \cite{recommender:Deshpande:2004} propose a %normalized version of the similarity measure: % $$\mathrm{sim_{Karypis}}(x,y) = % \frac{\sum_{\forall i b_{i,x}} b_{i,y} } % {\mathrm{Freq}(x) \mathrm{Freq}(y)^\alpha}$$ %where $\mathbf{B} = (b_{i,j})$ %is a normalized rating matrix where all rows sum up to $1$. %$\mathrm{Freq}(y)^\alpha$ reduces the problem with rare $x$. % All pairwise similarities are stored in a $n \times n$ similarity matrix $\mat{S}$. %%, %%which is again normalized such that rows sum up to $1$. To reduce the model size to $n \times k$ with $k \ll n$, for each item only a list of the $k$ most similar items and their similarity values are stored. The $k$ items which are most similar to item $i_l$ is denoted by the set $\set{S}(l)$ which can be seen as the neighborhood of size $k$ of the item. Retaining only $k$ similarities per item improves the space and time complexity significantly but potentially sacrifices some recommendation quality~\citep{recommender:Sarwar:2001}. To make a recommendation based on the model we use the similarities to calculate a weighted sum of the user's ratings for related items. \begin{figure} \centerline{\includegraphics[width=9cm]{item-based2}} \caption{Item-based collaborative filtering} \label{fig:IBCF} \end{figure} \begin{equation} \hat{r}_{al} = \frac{1}{\sum_{i \in \set{S}(l)}{s_{li}}} \sum_{i \in \set{S}(l)}{s_{li} r_{ai}} \end{equation} Figure~\ref{fig:IBCF} shows an example for $n=8$ items with $k=3$. For the similarity matrix $\mat{S}$ only the $k=3$ largest entries are stored per row (these entries are marked using bold face). For the example we assume that we have ratings for the active user for items $i_1, i_5$ and $i_8$. The rows corresponding to these items are highlighted in the item similarity matrix. We can now compute the weighted sum using the similarities (only the reduced matrix with the $k=3$ highest ratings is used) and the user's ratings. The result (below the matrix) shows that $i_3$ has the highest estimated rating for the active user. Similar to user-based recommender algorithms, user-bias can be reduced by first normalizing the user-item rating matrix before computing the item-to-item similarity matrix. Item-based CF is more efficient than user-based CF since the model (reduced similarity matrix) is relatively small ($N \times k$) and can be fully precomputed. Item-based CF is known to only produce slightly inferior results compared to user-based CF and higher order models which take the joint distribution of sets of items into account are possible~\citep{recommender:Deshpande:2004}. Furthermore, item-based CF is successfully applied in large scale recommender systems (e.g., by Amazon.com). \subsection{User and Item-Based CF using 0-1 Data} Less research is available for situations where no large amount of detailed directly elicited rating data is available. %~\citep[some notable exceptions are][]{recommender:Weiss:2001,recommender:Mild:2003,recommender:Lee:2005,recommender:Pan:2008}. However, this is a common situation and occurs when users do not want to directly reveal their preferences by rating an item (e.g., because it is to time consuming). In this case preferences can only be inferred by analyzing usage behavior. For example, we can easily record in a supermarket setting what items a customer purchases. However, we do not know why other products were not purchased. The reason might be one of the following. \begin{itemize} \item The customer does not need the product right now. \item The customer does not know about the product. Such a product is a good candidate for recommendation. \item The customer does not like the product. Such a product should obviously not be recommended. \end{itemize} \cite{recommender:Mild:2003} and \cite{recommender:Lee:2005} present and evaluate recommender algorithms for this setting. The same reasoning is true for recommending pages of a web site given click-stream data. Here we only have information about which pages were viewed but not why some pages were not viewed. This situation leads to binary data or more exactly to 0-1 data where 1 means that we inferred that the user has a preference for an item and 0 means that either the user does not like the item or does not know about it. \cite{recommender:Pan:2008} call this type of data in the context of collaborative filtering analogous to similar situations for classifiers \emph{one-class data} since only the 1-class is pure and contains only positive examples. The 0-class is a mixture of positive and negative examples. In the 0-1 case with $r_{jl} \in {0,1}$ where we define: \begin{equation} r_{jl}= \begin{cases} 1& \text{user $u_j$ is known to have a preference for item $i_l$} \\ 0& \text{otherwise.} \end{cases} \end{equation} Two strategies to deal with one-class data is to assume all missing ratings (zeros) are negative examples or to assume that all missing ratings are unknown. %Here will will follow mostly the first strategy based on the assumption that %users typically favor only a small fraction of the items and thus most %items with no rating will be indeed negative examples. However, it has to be %said that In addition, \cite{recommender:Pan:2008} propose strategies which represent a trade-off between the two extreme strategies based on wighted low rank approximations of the rating matrix and on negative example sampling which might improve results across all recommender algorithms. %This is however outside %the scope of this paper. If we assume that users typically favor only a small fraction of the items and thus most items with no rating will be indeed negative examples. then we have no missing values and can use the approaches described above for real valued rating data. However, if we assume all zeroes are missing values, then this lead to the problem that we cannot compute similarities using Pearson correlation or Cosine similarity since the not missing parts of the vectors only contains ones. A similarity measure which only focuses on matching ones and thus prevents the problem with zeroes is the \emph{Jaccard index}: \begin{equation} \mathrm{sim_{Jaccard}}(\set{X},\set{Y}) = \frac{|\set{X}\cap \set{Y}|}{|\set{X}\cup \set{Y}|}, \end{equation} where $\set{X}$ and $\set{Y}$ are the sets of the items with a 1 in user profiles $u_a$ and $u_b$, respectively. The Jaccard index can be used between users for user-based filtering and between items for item-based filtering as described above. %For 0-1 data we suggest, following \cite{recommender:Weiss:2001}, %to calculate the following score. %$$s_{aj} = \sum_{i\in\set{N}(a)}{w_{ai}\, r_{ij}}$$ %This score is not normalized but can be easily used to find the %top-$N$ items with the highest score. %\marginpar{More on this!} \subsection{Recommendations for 0-1 Data Based on Association Rules} \label{sec:AR} Recommender systems using association rules produce recommendations based on a dependency model for items given by a set of association rules~\citep{Fu:2000,Mobasher:2001,Geyer-Schulz:2002,Lin:2002,Demiriz:2004}. The binary profile matrix $\mat{R}$ is seen as a database where each user is treated as a transaction that contains the subset of items in $\set{I}$ with a rating of 1. Hence transaction $k$ is defined as $\set{T}_k = \{i_j \in \set{I} | r_{jk} = 1\}$ and the whole transaction data base is $\set{D} = \{\set{T}_1, \set{T}_2, \ldots, \set{T}_U\}$ where $U$ is the number of users. To build the dependency model, a set of association rules $\set{R}$ is mined from $\mat{R}$. Association rules are rules of the form $\set{X} \rightarrow \set{Y}$ where $\set{X}, \set{Y} \subseteq \set{I}$ and $\set{X} \cap \set{Y} = \emptyset$. For the model we only use association rules with a single item in the right-hand-side of the rule ($|\set{Y}| = 1$). To select a set of useful association rules, thresholds on measures of significance and interestingness are used. Two widely applied measures are: \begin{equation*} \mathrm{support}(\set{X} \rightarrow \set{Y}) = \mathrm{support}(\set{X} \cup \set{Y}) = \mathrm{Freq}(\set{X} \cup \set{Y}) / |\set{D}| = \hat{P}(E_\set{X} \cap E_\set{Y}) \end{equation*} \begin{equation*} \mathrm{confidence}(\set{X} \rightarrow \set{Y}) = \mathrm{support}(\set{X} \cup \set{Y}) / \mathrm{support}(\set{X}) = \hat{P}(E_\set{Y}|E_\set{X}) \end{equation*} $\mathrm{Freq}(\set{I})$ gives the number of transactions in the data base $\set{D}$ that contains all items in $\set{I}$. $E_\set{I}$ is the event that the itemset $\set{I}$ is contained in a transaction. We now require $\mathrm{support}(\set{X} \rightarrow \set{Y}) > s$ and $\mathrm{confidence}(\set{X} \rightarrow \set{Y}) > c$ and also include a length constraint $|\set{X} \cup \set{Y}|\leq l$. The set of rules $\set{R}$ that satisfy these constraints form the dependency model. Although finding all association rules given thresholds on support and confidence is a hard problem (the model grows in the worse case exponential with the number of items), algorithms that efficiently find all rules in most cases are available~\citep[e.g.,][]{arules:Agrawal:1994,arules:Zaki:2000,arules:Han:2004}. Also model size can be controlled by $l$, $s$ and $c$. To make a recommendation for an active user $u_a$ given the set of items $\set{T}_a$ the user likes and the set of association rules $\set{R}$ (dependency model), the following steps are necessary: \begin{enumerate} \item Find all matching rules $\set{X} \rightarrow \set{Y}$ for which $\set{X} \subseteq \set{T}_a$ in $\set{R}$. \item Recommend $N$ unique right-hand-sides ($\set{Y}$) of the matching rules with the highest confidence (or another measure of interestingness). \end{enumerate} The dependency model is very similar to item-based CF with conditional probability-based similarity~\citep{recommender:Deshpande:2004}. It can be fully precomputed and rules with more than one items in the left-hand-side ($\set{X}$), it incorporates higher order effects between more than two items. %Mbasher et al [Mobasher et al. 2000] presented an algorithm for recommending %additional webpages to be visited by a user based on association rules. In this %approach, the historical information about users and their web-access patterns %were mined using a frequent itemset discovery algorithm and were used to %generate a set of high con- fidence association rules. The recommendations were %computed as the union of the consequent of the rules that were supported by the %pages visited by the user. Lin et al [Lin et al. 2000] used a similar approach %but they developed an algorithm that is guaranteed to find association rules %for all the items in the database. Finally, within the context of using %association rules to derive top-N recommendations, Demiriz [Demiriz 2001] %studied the problem of how to weight the different rules that are supported %by the active user. He presented a method that computes the similarity %between a rule and the active user's basket as the product of the confi- %dence of the rule and the Euclidean distance between items in the %antecedent of the association rule and the items in the user¿s basket. He %compared this approach both with the item-based scheme described in Section %4 (based on our preliminary work presented in [Karypis 2001]) and the %dependency network-based algorithm [Heckman et al. 2000]. \subsection{Other collaborative filtering methods} Over time several other model-based approaches have been developed. A popular simple item-based approach is the \emph{Slope One} algorithm \citep{recommender:Lemire:2005}. Another family of algorithms is based on latent factors approach using matrix decomposition~\citep{recommender:Koren:2009}. More recently, deep learning has become a very popular method for flexible matrix completion, matrix factorization and collaborative ranking. A comprehensive survey is presented by~\cite{recommender:Zhang:2019}. These algorithms are outside the scope of this introductory paper. %% FIXME: Expand here \section{Evaluation of Recommender Algorithms} \label{sec:evaluation} Evaluation of recommender systems is an important topic and reviews were presented by \cite{recommender:Herlocker:2004} and \cite{recommender:Gunawardana:2009}. Typically, given a rating matrix $\mat{R}$, recommender algorithms are evaluated by first partitioning the users (rows) in $\mat{R}$ into two sets $\set{U}_\mathit{train}\, \cup\, \set{U}_\mathit{test} = \set{U}$. The rows of $\mat{R}$ corresponding to the training users $U_\mathit{train}$ are used to learn the recommender model. Then each user $u_a \in \set{U}_\mathit{test}$ is seen as an active user. Before creating recommendations some items are withheld from the profile $r_{u_a\cdot}$ and it measured either how well the predicted rating matches the withheld value or, for top-$N$ algorithms, if the items in the recommended list are rated highly by the user. Finally, the evaluation measures calculated for all test users are averaged. %It is assumed that if %a recommender algorithm performed better in predicting the withheld items, %it will also perform better in finding good recommendations for unknown items. To determine how to split $\set{U}$ into $\set{U}_\mathit{train}$ and $\set{U}_\mathit{test}$ we can use several approaches~\citep{recommender:Kohavi:1995}. \begin{itemize} \item {\bf Splitting:} We can randomly assign a predefined proportion of the users to the training set and all others to the test set. \item {\bf Bootstrap sampling:} We can sample from $\set{U}_\mathit{test}$ with replacement to create the training set and then use the users not in the training set as the test set. This procedure has the advantage that for smaller data sets we can create larger training sets and still have users left for testing. \item {\bf $k$-fold cross-validation:} Here we split $\set{U}$ into $k$ sets (called folds) of approximately the same size. Then we evaluate $k$ times, always using one fold for testing and all other folds for leaning. The $k$ results can be averaged. This approach makes sure that each user is at least once in the test set and the averaging produces more robust results and error estimates. \end{itemize} The items withheld in the test data are randomly chosen. \cite{recommender:Breese:1998} introduced the four experimental protocols called {\em Given 2}, {\em Given 5}, {\em Given 10} and {\em All-but-1}. For the Given $x$ protocols for each user $x$ randomly chosen items are given to the recommender algorithm and the remaining items are withheld for evaluation. For All but $x$ the algorithm gets all but $x$ withheld items. In the following we discuss the evaluation of predicted ratings and then of top-$N$ recommendation lists. \subsection{Evaluation of predicted ratings} A typical way to evaluate a prediction is to compute the deviation of the prediction from the true value. This is the basis for the \emph{Mean Average Error (MAE)} \begin{equation} \mathrm{MAE} = \frac{1}{|\set{K}|} \sum_{(j,l)\in \set{K}} |r_{jl} - \hat{r}_{jl}|, \end{equation} where $\set{K}$ is the set of all user-item pairings $(j,l)$ for which we have a predicted rating $\hat{r}_{jl}$ and a known rating $r_{jl}$ which was not used to learn the recommendation model. Another popular measure is the \emph{Root Mean Square Error (RMSE)}. \begin{equation} \mathrm{RMSE} = \sqrt{\frac{\sum_{(j,l)\in \set{K}} (r_{jl} - \hat{r}_{jl})^2}{|\set{K}|}} \end{equation} RMSE penalizes larger errors stronger than MAE and thus is suitable for situations where small prediction errors are not very important. \subsection{Evaluation Top-$N$ recommendations} The items in the predicted top-$N$ lists and the withheld items liked by the user (typically determined by a simple threshold on the actual rating) for all test users $\set{U}_{test}$ can be aggregated into a so called {\em confusion matrix} depicted in table~\ref{tab_confusion} (see \cite{recommender:Kohavi:1998}) which corresponds exactly to the outcomes of a classical statistical experiment. The confusion matrix shows how many of the items recommended in the top-$N$ lists (column predicted positive; $d+b$) were withheld items and thus correct recommendations (cell $d$) and how many where potentially incorrect (cell $b$). The matrix also shows how many of the not recommended items (column predicted negative; $a+c$) should have actually been recommended since they represent withheld items (cell $c$). \begin{table}[tbp] \caption{2x2 confusion matrix \label{tab_confusion} } \center \begin{tabular}{|c||c|c|} \hline {\bf actual / predicted} & {\bf negative} & {\bf positive} \\ \hline \hline {\bf negative} & $a$ & $b$ \\ \hline {\bf positive} & $c$ & $d$ \\ \hline \end{tabular} \end{table} From the confusion matrix several performance measures can be derived. For the data mining task of a recommender system the performance of an algorithm depends on its ability to learn significant patterns in the data set. Performance measures used to evaluate these algorithms have their root in machine learning. A commonly used measure is {\em accuracy,} the fraction of correct recommendations to total possible recommendations. \begin{equation} \mathit{Accuracy} = \frac{\mathit{correct\ recommendations}}{\mathit{total\ possible\ recommendations}} = \frac{a+d}{a+b+c+d} \label{accur} \end{equation} A common error measure is the {\em mean absolute error} ({\em MAE}, also called {\em mean absolute deviation} or {\em MAD}). \begin{equation} \mathit{MAE} = \frac{1}{N}\sum_{i=1}^N{|\epsilon_i|} = \frac{b+c}{a+b+c+d}, \label{mae} \end{equation} where $N = a+b+c+d$ is the total number of items which can be recommended and $|\epsilon_i|$ is the absolute error of each item. Since we deal with 0-1 data, $|\epsilon_i|$ can only be zero (in cells $a$ and $d$ in the confusion matrix) or one (in cells $b$ and $c$). For evaluation recommender algorithms for rating data, the root mean square error is often used. For 0-1 data it reduces to the square root of MAE. Recommender systems help to find items of interest from the set of all available items. This can be seen as a retrieval task known from information retrieval. Therefore, standard information retrieval performance measures are frequently used to evaluate recommender performance. {\em Precision} and {\em recall} are the best known measures used in information retrieval \citep{recommender:Salton:1983,recommender:Rijsbergen:1979}. \begin{equation} \mathit{Precision} = \frac{\mathit{correctly\ recommended\ items}}{\mathit{total\ recommended\ items}} = \frac{d}{b+d} \label{prec} \end{equation} \begin{equation} \mathit{Recall} = \frac{\mathit{correctly\ recommended\ items}}{\mathit{total\ useful\ recommendations}} = \frac{d}{c+d} \label{rec} \end{equation} %% fallout, generality Often the number of {\em total\ useful\ recommendations} needed for recall is unknown since the whole collection would have to be inspected. However, instead of the actual {\em total\ useful\ recommendations} often the total number of known useful recommendations is used. Precision and recall are conflicting properties, high precision means low recall and vice versa. To find an optimal trade-off between precision and recall a single-valued measure like the {\em E-measure} \citep{recommender:Rijsbergen:1979} can be used. The parameter $\alpha$ controls the trade-off between precision and recall. \begin{equation} \mbox{\textit{E-measure}} = \frac{1}{\alpha(1/\mathit{Precision}) + (1-\alpha)(1/\mathit{Recall})} \label{e-measure} \end{equation} A popular single-valued measure is the {\em F-measure.} It is defined as the harmonic mean of precision and recall. \begin{equation} \mbox{\textit{F-measure}} = \frac{2\, \mathit{Precision}\, \mathit{Recall}}{\mathit{Precision} + \mathit{Recall}} = \frac{2}{1/\mathit{Precision} + 1/\mathit{Recall}} \label{f-measure} \end{equation} It is a special case of the E-measure with $\alpha=.5$ which places the same weight on both, precision and recall. In the recommender evaluation literature the F-measure is often referred to as the measure {\em F1.} Another method used in the literature to compare two classifiers at different parameter settings is the {\em Receiver Operating Characteristic (ROC)}. The method was developed for signal detection and goes back to the Swets model \citep{recommender:Rijsbergen:1979}. The ROC-curve is a plot of the system's {\em probability of detection} (also called $\mathit{sensitivity}$ or true positive rate TPR which is equivalent to recall as defined in formula \ref{rec}) by the {\em probability of false alarm} (also called false positive rate FPR or $1-\mathit{specificity}$, where $\mathit{specificity} = \frac{a}{a+b}$) with regard to model parameters. A possible way to compare the efficiency of two systems is by comparing the size of the area under the ROC-curve, where a bigger area indicates better performance. %% \cite{recommender:Schein2005} for CROC \section{Recommenderlab Infrastructure} \label{sec:infrastructure} \pkg{recommenderlab} is implemented using formal classes in the \proglang{S4} class system. Figure~\ref{fig:class_diagram} shows the main classes and their relationships. \begin{figure} \centerline{\includegraphics[width=1\textwidth, trim={0 23cm 0 0}, clip]{recommenderlab_classes}} \caption{UML class diagram for package~\pkg{recommenderlab}~\citep{misc:Fowler:2004}.} \label{fig:class_diagram} \end{figure} The package uses the abstract \class{ratingMatrix} to provide a common interface for rating data. \class{ratingMatrix} implements many methods typically available for matrix-like objects. For example, \func{dim}, \func{dimnames}, \func{colCounts}, \func{rowCounts}, \func{colMeans}, \func{rowMeans}, \func{colSums} and \func{rowSums}. Additionally \func{sample} can be used to sample from users (rows) and \func{image} produces an image plot. For \class{ratingMatrix} we provide two concrete implementations \class{realRatingMatrix} and \class{binaryRatingMatrix} to represent different types of rating matrices $\mat{R}$. \class{realRatingMatrix} implements a rating matrix with real valued ratings stored in sparse format defined in package \pkg{Matrix}. Sparse matrices in \pkg{Matrix} typically do not store 0s explicitly, however for \class{realRatingMatrix} we use these sparse matrices such that instead of 0s, NAs are not explicitly stored. \class{binaryRatingMatrix} implements a 0-1 rating matrix using the implementation of \class{itemMatrix} defined in package~\pkg{arules}. \class{itemMatrix} stores only the ones and internally uses a sparse representation from package \pkg{Matrix}. With this class structure \pkg{recommenderlab} can be easily extended to other forms of rating matrices with different concepts for efficient storage in the future. Class \class{Recommender} implements the data structure to store recommendation models. The creator method \begin{center} \code{Recommender(data, method, parameter = NULL)} \end{center} takes data as a \class{ratingMatrix}, a method name and some optional parameters for the method and returns a \class{Recommender} object. Once we have a recommender object, we can predict top-$N$ recommendations for active users using \begin{center} \code{predict(object, newdata, n=10, type=c("topNList", "ratings", "ratingMatrix"), ...)}. \end{center} Predict can return either top-$N$ lists (default setting) or predicted ratings. \code{object} is the recommender object, \code{newdata} is the data for the active users. For top-$N$ lists \code{n} is the maximal number of recommended items in each list and \func{predict} will return an objects of class \class{topNList} which contains one top-$N$ list for each active user. For \code{"ratings"} and \code{"ratingMatrix"}, \code{n} is ignored and an object of \class{realRatingMatrix} is returned. Each row contains the predicted ratings for one active user. The difference is, that for \code{"ratings"}, the items for which a rating exists in \code{newdata} have a \code{NA} instead of a predicted/actual ratings. The actual implementations for the recommendation algorithms are managed using the registry mechanism provided by package \pkg{registry}. The registry called \code{recommenderRegistry} and stores recommendation method names and a short description. %uses the four fields %\code{method} (name of the recommendation algorithm), \code{dataType} (name of %the concrete implementation of \class{ratingMatrix} the algorithm works on), %\code{fun} (function to create a recommender model as an instance of class %\class{Recommender}), and \code{description} (verbal description of the %algorithm) to describe a recommender algorithm. Generally, the registry mechanism is hidden from the user and the creator function \func{Recommender} uses it in the background to map a recommender method name to its implementation. However, the registry can be directly queried by \begin{center} \code{recommenderRegistry\$get\_entries()} \end{center} and new recommender algorithms can be added by the user. We will give and example for this feature in the examples section of this paper. To evaluate recommender algorithms package~\pkg{recommenderlab} provides the infrastructure to create and maintain evaluation schemes stored as an object of class \class{evaluationScheme} from rating data. The creator function \begin{center} \code{evaluationScheme(data, method="split", train=0.9, k=10, given=3)} \end{center} creates the evaluation scheme from a data set using a method (e.g., simple split, bootstrap sampling, $k$-fold cross validation). Testing is perfomed by withholding items (parameter \code{given}). \cite{recommender:Breese:1998} introduced the four experimental witholding protocols called \emph{Given 2, Given 5, Given 10} and \emph{All-but-1.} During testing, the \emph{Given $x$} protocol presents the algorithm with only $x$ randomly chosen items for the test user, and the algorithm is evaluated by how well it is able to predict the withheld items. For \emph{All-but-$x$}, a generalization of \emph{All-but-1}, the algorithm sees all but $x$ withheld ratings for the test user. \code{given} controls x in the evaluations scheme. Positive integers result in a Given $x$ protocol, while negative values produce a All-but-$x$ protocol. The function \func{evaluate} is then used to evaluate several recommender algorithms using an evaluation scheme resulting in a evaluation result list (class~\class{evaluationResultList}) with one entry (class~\class{evaluationResult}) per algorithm. Each object of \class{evaluationResult} contains one or several object of \class{confusionMatrix} depending on the number of evaluations specified in the \class{evaluationScheme} (e.g., $k$ for $k$-fold cross validation). With this infrastructure several recommender algorithms can be compared on a data set with a single line of code. In the following, we will illustrate the usage of \pkg{recommenderlab} with several examples. \section{Examples} \label{sec:examples} This fist few example shows how to manage data in recommender lab and then we create and evaluate recommenders. First, we load the package. <<>>= library("recommenderlab") @ \subsection{Coercion to and from rating matrices} For this example we create a small artificial data set as a matrix. <<>>= m <- matrix(sample(c(as.numeric(0:5), NA), 50, replace=TRUE, prob=c(rep(.4/6,6),.6)), ncol=10, dimnames=list(user=paste("u", 1:5, sep=''), item=paste("i", 1:10, sep=''))) m @ With coercion, the matrix can be easily converted into a realRatingMatrix object which stores the data in sparse format (only non-NA values are stored explicitly; NA values are represented by a dot). <<>>= r <- as(m, "realRatingMatrix") r getRatingMatrix(r) @ The realRatingMatrix can be coerced back into a matrix which is identical to the original matrix. <<>>= identical(as(r, "matrix"),m) @ It can also be coerced into a list of users with their ratings for closer inspection or into a data.frame with user/item/rating tuples. <<>>= as(r, "list") head(as(r, "data.frame")) @ The data.frame version is especially suited for writing rating data to a file (e.g., by \func{write.csv}). Coercion from data.frame (user/item/rating tuples) and list into a sparse rating matrix is also provided. This way, external rating data can easily be imported into \pkg{recommenderlab}. \subsection{Normalization} An important operation for rating matrices is to normalize the entries to, e.g., centering to remove rating bias by subtracting the row mean from all ratings in the row. This is can be easily done using \code{normalize()}. <<>>= r_m <- normalize(r) r_m getRatingMatrix(r_m) @ Normalization can be reversed using \code{denormalize()}. <<>>= denormalize(r_m) @ Small portions of rating matrices can be visually inspected using \code{image()}. <>= image(r, main = "Raw Ratings") image(r_m, main = "Normalized Ratings") @ <>= print(image(r, main = "Raw Ratings")) @ <>= print(image(r_m, main = "Normalized Ratings")) @ \begin{figure} \begin{minipage}[b]{.48\linewidth} \centerline{\includegraphics[width=\linewidth]{recommenderlab-image1}} \end{minipage} \begin{minipage}[b]{.48\linewidth} \centerline{\includegraphics[width=\linewidth]{recommenderlab-image2}} \end{minipage} \caption{Image plot the artificial rating data before and after normalization.} \label{fig:image1} \end{figure} Figure~\ref{fig:image1} shows the resulting plots. \subsection{Binarization of data} A matrix with real valued ratings can be transformed into a 0-1 matrix with \code{binarize()} and a user specified threshold (\code{min\_ratings}) on the raw or normalized ratings. In the following only items with a rating of 4 or higher will become a positive rating in the new binary rating matrix. <<>>= r_b <- binarize(r, minRating=4) r_b as(r_b, "matrix") @ \subsection{Inspection of data set properties} We will use the data set Jester5k for the rest of this section. This data set comes with \pkg{recommenderlab} and contains a sample of 5000 users from the anonymous ratings data from the Jester Online Joke Recommender System collected between April 1999 and May 2003~\citep{recommender:Goldberg:2001}. The data set contains ratings for 100 jokes on a scale from $-10$ to $+10$. All users in the data set have rated 36 or more jokes. <<>>= data(Jester5k) Jester5k @ Jester5k contains \Sexpr{nratings(Jester5k)}~ratings. For the following examples we use only a subset of the data containing a sample of 1000 users (we set the random number generator seed for reproducibility). For random sampling \func{sample} is provided for rating matrices. <<>>= set.seed(1234) r <- sample(Jester5k, 1000) r @ This subset still contains \Sexpr{nratings(r)}~ratings. Next, we inspect the ratings for the first user. We can select an individual user with the extraction operator. <<>>= rowCounts(r[1,]) as(r[1,], "list") rowMeans(r[1,]) @ The user has rated \Sexpr{rowCounts(r[1,])} jokes, the list shows the ratings and the user's rating average is \Sexpr{rowMeans(r[1,])} . Next, we look at several distributions to understand the data better. \code{getRatings()} extracts a vector with all non-missing ratings from a rating matrix. <>= hist(getRatings(r), breaks=100) @ \begin{figure} \centerline{\includegraphics[width=.5\linewidth]{recommenderlab-hist1}} \caption{Raw rating distribution for as sample of Jester.} \label{fig:hist1} \end{figure} In the histogram in Figure~\ref{fig:hist1} shoes an interesting distribution where all negative values occur with a almost identical frequency and the positive ratings more frequent with a steady decline towards the rating 10. Since this distribution can be the result of users with strong rating bias, we look next at the rating distribution after normalization. <>= hist(getRatings(normalize(r)), breaks=100) @ <>= hist(getRatings(normalize(r, method="Z-score")), breaks=100) @ \begin{figure} \begin{minipage}[b]{.48\linewidth} \centerline{\includegraphics[width=\linewidth]{recommenderlab-hist2}} \end{minipage} \begin{minipage}[b]{.48\linewidth} \centerline{\includegraphics[width=\linewidth]{recommenderlab-hist3}} \end{minipage} \caption{Histogram of normalized ratings using row centering (left) and Z-score normalization (right).} \label{fig:hist2} \end{figure} Figure~\ref{fig:hist2} shows that the distribution of ratings ins closer to a normal distribution after row centering and Z-score normalization additionally reduces the variance to a range of roughly $-3$ to $+3$ standard deviations. It is interesting to see that there is a pronounced peak of ratings between zero and two. Finally, we look at how many jokes each user has rated and what the mean rating for each Joke is. <>= hist(rowCounts(r), breaks=50) @ <>= hist(colMeans(r), breaks=20) @ \begin{figure} \begin{minipage}[b]{.48\linewidth} \centerline{\includegraphics[width=\linewidth]{recommenderlab-hist4}} \end{minipage} \begin{minipage}[b]{.48\linewidth} \centerline{\includegraphics[width=\linewidth]{recommenderlab-hist5}} \end{minipage} \caption{Distribution of the number of rated items per user (left) and of the average ratings per joke (right).} \label{fig:hist3} \end{figure} Figure~\ref{fig:hist3} shows that there are unusually many users with ratings around 70 and users who have rated all jokes. The average ratings per joke look closer to a normal distribution with a mean above 0. \subsection{Creating a recommender} A recommender is created using the creator function \func{Recommender}. Available recommendation methods are stored in a registry. The registry can be queried. Here we are only interested in methods for real-valued rating data. <<>>= recommenderRegistry$get_entries(dataType = "realRatingMatrix") @ Next, we create a recommender which generates recommendations solely on the popularity of items (the number of users who have the item in their profile). We create a recommender from the first 1000 users in the Jester5k data set. <<>>= r <- Recommender(Jester5k[1:1000], method = "POPULAR") r @ The model can be obtained from a recommender using \func{getModel}. <<>>= names(getModel(r)) getModel(r)$topN @ In this case the model has a top-$N$ list to store the popularity order and further elements (average ratings, if it used normalization and the used aggregation function). Recommendations are generated by \func{predict} (consistent with its use for other types of models in \proglang{R}). The result are recommendations in the form of an object of class~\class{TopNList}. Here we create top-5 recommendation lists for two users who were not used to learn the model. <<>>= recom <- predict(r, Jester5k[1001:1002], n=5) recom @ The result contains two ordered top-$N$ recommendation lists, one for each user. The recommended items can be inspected as a list. <<>>= as(recom, "list") @ Since the top-$N$ lists are ordered, we can extract sublists of the best items in the top-$N$. For example, we can get the best 3 recommendations for each list using \func{bestN}. <<>>= recom3 <- bestN(recom, n = 3) recom3 as(recom3, "list") @ Many recommender algorithms can also predict ratings. This is also implemented using \func{predict} with the parameter \code{type} set to \code{"ratings"}. <<>>= recom <- predict(r, Jester5k[1001:1002], type="ratings") recom as(recom, "matrix")[,1:10] @ Predicted ratings are returned as an object of \class{realRatingMatrix}. The prediction contains \code{NA} for the items rated by the active users. In the example we show the predicted ratings for the first 10 items for the two active users. Alternatively, we can also request the complete rating matrix which includes the original ratings by the user. <<>>= recom <- predict(r, Jester5k[1001:1002], type="ratingMatrix") recom as(recom, "matrix")[,1:10] @ \subsection{Evaluation of predicted ratings} Next, we will look at the evaluation of recommender algorithms. \pkg{recommenderlab} implements several standard evaluation methods for recommender systems. Evaluation starts with creating an evaluation scheme that determines what and how data is used for training and testing. Here we create an evaluation scheme which splits the first 1000 users in Jester5k into a training set (90\%) and a test set (10\%). For the test set 15 items will be given to the recommender algorithm and the other items will be held out for computing the error. <<>>= e <- evaluationScheme(Jester5k[1:1000], method="split", train=0.9, given=15, goodRating=5) e @ We create two recommenders (user-based and item-based collaborative filtering) using the training data. <<>>= r1 <- Recommender(getData(e, "train"), "UBCF") r1 r2 <- Recommender(getData(e, "train"), "IBCF") r2 @ Next, we compute predicted ratings for the known part of the test data (15 items for each user) using the two algorithms. <<>>= p1 <- predict(r1, getData(e, "known"), type="ratings") p1 p2 <- predict(r2, getData(e, "known"), type="ratings") p2 @ Finally, we can calculate the error between the prediction and the unknown part of the test data. <<>>= error <- rbind( UBCF = calcPredictionAccuracy(p1, getData(e, "unknown")), IBCF = calcPredictionAccuracy(p2, getData(e, "unknown")) ) error @ In this example user-based collaborative filtering produces a smaller prediction error. \subsection{Evaluation of a top-$N$ recommender algorithm} For this example we create a $4$-fold cross validation scheme with the the Given-3 protocol, i.e., for the test users all but three randomly selected items are withheld for evaluation. <<>>= scheme <- evaluationScheme(Jester5k[1:1000], method="cross", k=4, given=3, goodRating=5) scheme @ Next we use the created evaluation scheme to evaluate the recommender method popular. We evaluate top-1, top-3, top-5, top-10, top-15 and top-20 recommendation lists. <<>>= results <- evaluate(scheme, method="POPULAR", type = "topNList", n=c(1,3,5,10,15,20)) results @ The result is an object of class~\class{EvaluationResult} which contains several confusion matrices. \func{getConfusionMatrix} will return the confusion matrices for the 4 runs (we used 4-fold cross evaluation) as a list. In the following we look at the first element of the list which represents the first of the 4 runs. <<>>= getConfusionMatrix(results)[[1]] @ For the first run we have 6 confusion matrices represented by rows, one for each of the six different top-$N$ lists we used for evaluation. $n$ is the number of recommendations per list. TP, FP, FN and TN are the entries for true positives, false positives, false negatives and true negatives in the confusion matrix. The remaining columns contain precomputed performance measures. The average for all runs can be obtained from the evaluation results directly using \func{avg}. <<>>= avg(results) @ Evaluation results can be plotted using \func{plot}. The default plot is the ROC curve which plots the true positive rate (TPR) against the false positive rate (FPR). <>= plot(results, annotate=TRUE) @ \begin{figure} \centerline{\includegraphics[scale=1]{recommenderlab-roc1}} \caption{ROC curve for recommender method POPULAR.} \label{fig:roc1} \end{figure} For the plot where we annotated the curve with the size of the top-$N$ list is shown in Figure~\ref{fig:roc1}. By using \code{"prec/rec"} as the second argument, a precision-recall plot is produced (see Figure~\ref{fig:precrec1}). <>= plot(results, "prec/rec", annotate=TRUE) @ \begin{figure} \centerline{\includegraphics[scale=1]{recommenderlab-precrec1}} \caption{Precision-recall plot for method POPULAR.} \label{fig:precrec1} \end{figure} \subsection{Comparing recommender algorithms} \subsubsection{Comparing top-$N$ recommendations} The comparison of several recommender algorithms is one of the main functions of \pkg{recommenderlab}. For comparison also \func{evaluate} is used. The only change is to use \func{evaluate} with a list of algorithms together with their parameters instead of a single method name. In the following we use the evaluation scheme created above to compare the five recommender algorithms: random items, popular items, user-based CF, item-based CF, and SVD approximation. Note that when running the following code, the CF based algorithms are very slow. For the evaluation we use a ``all-but-5'' scheme. This is indicated by a negative number for \code{given}. <<>>= set.seed(2016) scheme <- evaluationScheme(Jester5k[1:1000], method="split", train = .9, given=-5, goodRating=5) scheme algorithms <- list( "random items" = list(name="RANDOM", param=NULL), "popular items" = list(name="POPULAR", param=NULL), "user-based CF" = list(name="UBCF", param=list(nn=50)), "item-based CF" = list(name="IBCF", param=list(k=50)), "SVD approximation" = list(name="SVD", param=list(k = 50)) ) ## run algorithms results <- evaluate(scheme, algorithms, type = "topNList", n=c(1, 3, 5, 10, 15, 20)) @ The result is an object of class~\class{evaluationResultList} for the five recommender algorithms. <<>>= results @ Individual results can be accessed by list subsetting using an index or the name specified when calling \func{evaluate}. <<>>= names(results) results[["user-based CF"]] @ Again \func{plot} can be used to create ROC and precision-recall plots (see Figures~\ref{fig:roc2} and \ref{fig:precrec2}). Plot accepts most of the usual graphical parameters like pch, type, lty, etc. In addition annotate can be used to annotate the points on selected curves with the list length. <>= plot(results, annotate=c(1,3), legend="bottomright") @ \begin{figure} \centerline{\includegraphics[scale=1]{recommenderlab-roc2}} \caption{Comparison of ROC curves for several recommender methods for the given-3 evaluation scheme.} \label{fig:roc2} \end{figure} <>= plot(results, "prec/rec", annotate=3, legend="topleft") @ \begin{figure} \centerline{\includegraphics[scale=1]{recommenderlab-precrec2}} \caption{Comparison of precision-recall curves for several recommender methods for the given-3 evaluation scheme.} \label{fig:precrec2} \end{figure} For this data set and the given evaluation scheme popular items and the user-based CF methods clearly outperform the other methods. In Figure~\ref{fig:roc2} we see that they dominate (almost completely) the other method since for each length of top-$N$ list they provide a better combination of TPR and FPR. \subsubsection{Comparing ratings} Next, we evaluate not top-$N$ recommendations, but how well the algorithms can predict ratings. <<>>= ## run algorithms results <- evaluate(scheme, algorithms, type = "ratings") @ The result is again an object of class~\class{evaluationResultList} for the five recommender algorithms. <<>>= results @ <>= plot(results, ylim = c(0,100)) @ \begin{figure} \centerline{\includegraphics[scale=1]{recommenderlab-real}} \caption{Comparison of RMSE, MSE, and MAE for recommender methods for the given-3 evaluation scheme.} \label{fig:real} \end{figure} Plotting the results shows a barplot with the root mean square error, the mean square error and the mean absolute error (see Figures~\ref{fig:real}). \subsubsection{Using a 0-1 data set} For comparison we will check how the algorithms compare given less information. We convert the data set into 0-1 data and instead of a all-but-5 we use the given-3 scheme. <<>>= Jester_binary <- binarize(Jester5k, minRating=5) Jester_binary <- Jester_binary[rowCounts(Jester_binary)>20] Jester_binary scheme_binary <- evaluationScheme(Jester_binary[1:1000], method="split", train=.9, k=1, given=3) scheme_binary @ <<>>= results_binary <- evaluate(scheme_binary, algorithms, type = "topNList", n=c(1,3,5,10,15,20)) @ Note that SVD does not implement a method for binary data and is thus skipped. <>= plot(results_binary, annotate=c(1,3), legend="topright") @ \begin{figure} \centerline{\includegraphics[scale=1]{recommenderlab-roc3}} \caption{Comparison of ROC curves for several recommender methods for the given-3 evaluation scheme.} \label{fig:roc3} \end{figure} From Figure~\ref{fig:roc3} we see that given less information, the performance of user-based CF suffers the most and the simple popularity based recommender performs almost a well as item-based CF. Similar to the examples presented here, it is easy to compare different recommender algorithms for different data sets or to compare different algorithm settings (e.g., the influence of neighborhood formation using different distance measures or different neighborhood sizes). %\marginpar{getting data in missing!} %\subsection{Compare rankings} % %Top-$N$ recommendation lists can be considered rankings and can be represented %as $N$-ary order relations ($\leq$) on the set of all items in the lists (items %which do not occur in all lists are added to the end of the corresponding %order). % %{\bf Average distance between methods:} %E.g., the cardinality of the symmetric difference of two relations (the % number of tuples contained in exactly one of two relations) %{\small\citep{Hornik:2008}} % %Total number of tuples in relations: 8649 (for 93 items) % % \begin{verbatim} % UB IB AR Pop N % UB 0 2555 2634 3317 % IB 2555 0 1611 2052 % AR 2634 1611 0 2636 % Pop N 3317 2052 2636 0 % \end{verbatim} % %\small %\begin{verbatim} %u_a = {Toy Story (1995), Air Force One (1997), % Cop Land (1997), Michael (1996), Blade Runner (1982)} % % UB IB AR Top N Consensus* % Contact (1997) 1 4 25 4 4 % Liar Liar (1997) 2 18 32 10 8 % Conspiracy Theory (1997) 3 NA NA NA NA % Star Wars (1977) 4 1 1 1 1 % Return of the Jedi (1983) 5 2 4 2 2 % English Patient, The (1996) 6 26 41 6 12 % Saint, The (1997) 7 NA NA NA NA % Fargo (1996) 8 3 13 3 3 % Mr. Holland's Opus (1995) 9 34 30 39 17 % Scream (1996) 10 20 35 7 10 % \end{verbatim} % % * Consensus method: Condorcet (minimizes the weighted sum % of symmetric difference distance) % % %\subsection{ROCR interface} \subsection{Implementing a new recommender algorithm} Adding a new recommender algorithm to \pkg{recommenderlab} is straight forward since it uses a registry mechanism to manage the algorithms. To implement the actual recommender algorithm we need to implement a creator function which takes a training data set, trains a model and provides a predict function which uses the model to create recommendations for new data. The model and the predict function are both encapsulated in an object of class~\class{Recommender}. For examples look at the files starting with \code{RECOM} in the packages \code{R} directory. A good examples is in \code{RECOM_POPULAR.R}. \section{Conclusion} \label{sec:conclusion} %% FIXME %Being able to use automated recommender systems %offers a big advantage for online %retailers and for many other applications. But %often no extensive data base of rating data is available and %it makes sense to think about using 0-1 data, which is in many %cases easier to obtain, instead. %Unfortunately there is only limited research on collaborative filtering %based recommender systems using 0-1 data available. % In this paper we described the \proglang{R} extension package~\pkg{recommenderlab} which is especially geared towards developing and testing recommender algorithms. %for 0-1 data. The package allows to create evaluation schemes following accepted methods and then use them to evaluate and compare recommender algorithms. \pkg{recommenderlab} currently includes several standard algorithms and adding new recommender algorithms to the package is facilitated by the built in registry mechanism to manage algorithms. % %\pkg{recommenderlab} currently includes several algorithms for 0-1 data, %however, the infrastructure is flexible enough to also %extend to the more conventional non-binary rating data and its algorithms. In the future we will add more and more of these algorithms to the package and we hope that some algorithms will also be contributed by other researchers. %Applying collaborative filtering methods to binary and %especially to 0-1 data is difficult because %the sparsity of data has an adverse effect on similarity %calculation/neighborhood formation. %Also the meaning of $0$ in the data can be either of ``unknown'' or %``dislike'' and thus it is not clear how to best handle it. %Measures like the Jaccard index can be used to mitigate this %problem since they focus mainly on $1$s. %Overall, the quality of recommendations is extremely data set dependent. %Also evaluation is biased since the complete set of ``liked'' items %is unknown. Measured precision is only a lower bound to real precision. %\begin{itemize} %\item Many companies face heterogeneous data sources which can best be %aggregated into binary data. %\end{itemize} %\item Model ``dislike'' in binary data {\small\citep{Mobasher:2001}}. %\item Account for fast changing products % (e.g., vintages differ often significantly and change every year). %\item Investigate recommender engines based on consensus rankings. \section*{Acknowledgments} This research was funded in part by the NSF Industry/University Cooperative Research Center for Net-Centric Software \& Systems. % %\bibliographystyle{abbrvnat} %\bibliography{recommenderlab} % %\bibliographystyle{abbrvnat} \bibliography{recommender,recom_talk,data,stuff,association_rules,recommenderlab_usage} \end{document}