\documentclass[nojss]{jss}

%\documentclass[fleqn, a4paper]{article}
%\usepackage{a4wide}
%\usepackage[round,longnamesfirst]{natbib}
%\usepackage{graphicx,keyval,thumbpdf,url}
%\usepackage{hyperref}
%\usepackage{fancyvrb}
%\usepackage{Sweave}
%\SweaveOpts{strip.white=true}
%\AtBeginDocument{\setkeys{Gin}{width=0.6\textwidth}}

\usepackage[utf8]{inputenc}
\usepackage[english]{babel}

%% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage{amsmath}
\usepackage{amsfonts}


%\newcommand{\strong}[1]{{\normalfont\fontseries{b}\selectfont #1}}
\newcommand{\class}[1]{\mbox{\textsf{#1}}}
\newcommand{\func}[1]{\mbox{\texttt{#1()}}}
%\newcommand{\code}[1]{\mbox{\texttt{#1}}}
%\newcommand{\pkg}[1]{\strong{#1}}
%\newcommand{\samp}[1]{`\mbox{\texttt{#1}}'}
%\newcommand{\proglang}[1]{\textsf{#1}}
\newcommand{\set}[1]{\mathcal{#1}}
%\newcommand{\sQuote}[1]{`{#1}'}
%\newcommand{\dQuote}[1]{``{#1}''}
%\newcommand\R{{\mathbb{R}}}
\newcommand{\mat}[1]{{\mathbf{#1}}}
%\renewcommand{\vec}[1]{{\mathbf{#1}}}
%
\sloppy

%% \VignetteIndexEntry{An introduction to the R package recommenderlab}

\author{Michael Hahsler}

\title{\pkg{recommenderlab}: An R Framework for Developing and Testing Recommendation Algorithms}

%% for pretty printing and a nice hypersummary also set:
\Plainauthor{Michael Hahsler} %% comma-separated
\Plaintitle{recommenderlab: An R Framework for Developing and Testing Recommendation Algorithms} %% without formatting
\Shorttitle{Developing and Testing Recommendation Algorithms with R} %% a short title (if necessary)

\Abstract{Algorithms that create recommendations based on observed data have
significant commercial value for online retailers and many other industries.
Recommender systems has a significant research
community and studying such systems is part of most modern data science
curricula. While there is an abundance of software that implements recommendation algorithms, there is little in terms of supporting recommender system research and education. This paper describes the open-source software~\pkg{recommenderlab} which was created with supporting research and education in mind.
The package can be directly installed in R or downloaded from
\url{https://github.com/mhahsler/recommenderlab}.}

\Keywords{recommender systems, collaborative filtering, evaluation}
\Plainkeywords{recommender systems, collaborative filtering, evaluation} %% without formatting

\Address{
  Michael Hahsler\\
	Computer Science\\
	Lyle School of Engineering\\
	Southern Methodist University\\
	P.O. Box 750122 \\
	Dallas, TX 75275-0122\\
	E-mail: \email{mhahsler@lyle.smu.edu}\\
	URL: \url{michael@hahsler.net}
}


\begin{document}

%\title{\pkg{recommenderlab}: A Framework for Developing and Testing Recommendation Algorithms\footnote{This research was funded in part by the NSF Industry/University Cooperative Research Center for Net-Centric Software \& Systems.}
%}
%\author{Michael Hahsler}
%\maketitle
%\sloppy


% \abstract{
% Algorithms that create recommendations based on observed data have
% significant commercial value for online retailers and many other industries.
% Recommender systems has a significant research
% community and studying such systems is part of most modern data science
% curricula. While there is an abundance of software that implements recommendation algorithms, there is little in terms of supporting recommender system research and education. This paper describes the open-source software~\pkg{recommenderlab} which was created with supporting research and education in mind.
% }

<<echo=FALSE>>=
options(scipen=3, digits=4, prompt="R> ", eps=FALSE, width=75)
### for sampling
set.seed(1234)
@

\section{Introduction}
Recommender
systems apply statistical and knowledge discovery techniques to the problem of
making product recommendations based on previously recorded
usage data~\citep{recommender:Sarwar:2000}.
Creating such automatically generated personalized recommendations for
products including
books, songs, TV shows and movies using collaborative filtering
have come a long way since
Information Lense,
the first system using
\emph{social filtering} was created
more than 30 years ago~\citep{recommender:Malone:1987}.
Today, recommender systems
are a successful technology used by market
leaders in several industries (e.g., by
Amazon, %\footnote{\url{http://www.amazon.com}},
Netflix, and %\footnote{\url{http://www.netflix.com}} and
Pandora). %\footnote{\url{http://www.pandora.com}}).
In retail, such
recommendations can improve conversion rates by helping
the customer to find
products she/he wants to buy faster, promote cross-selling by suggesting
additional products and can improve customer loyalty through creating a
value-added relationship~\citep{recommender:Schafer:2001}.

Even after 30 years, recommender systems still have a very active research
community.
It is often not clear which of the many available algorithms is appropriate for a particular application and new approaches are constantly proposed. Many commercially available software applications implement recommender
algorithms, however,
this paper focuses on software support for recommender systems
research which includes rapid prototyping algorithms and thorough
evaluation and comparison of algorithms.
For this purpose, access to the source code is paramount.
Many open-source projects implementing recommender algorithms have been
initiated over the years. Table~\ref{tab:software} provides links to several popular
open source implementations
which provide code which can be used by researchers.
The extent of (currently available) functionality as well as the target
usage of the available software packages vary greatly and many projects have been abandoned over the years. A comprehensive list of recommender systems software is maintained by~\cite{Jenson:2019}.

%Crab, MyMediaLite and Vogoo PHP LIB aim at providing simple
%recommender systems to be easily integrated into web sites.
%SVDFeature focuses only on matrix factorization..
%LensKit is a relatively new software package with the aim
%to provide reference implementations for common collaborative
%filtering algorithms.
%Finally, Apache Mahout, a machine learning
%library aimed to be scalable to large data sets incorporated
%collaborative filtering algorithms formerly developed under the
%name Taste.

\begin{table}
\begin{tabular}{|p{1.9cm}|p{3cm}|l|l|}
\hline
{\bf Software} & {\bf Description} & {\bf Language} & {\bf URL} \\
\hline
Apache Mahout & Machine learning library includes collaborative filtering &Java& \small\url{http://mahout.apache.org/}\\
\hline
Crab & Components to create recommender systems & Python & \small\url{https://github.com/muricoca/crab}\\
\hline
LensKit & Collaborative filtering algorithms from GroupLens Research & Python & \small\url{http://lenskit.grouplens.org/}\\
\hline
MyMediaLite & Recommender system algorithms. & C\#/Mono & \small\url{http://www.mymedialite.net/}\\
\hline
%RACOFI: A rule-applying collaborative filtering system & Java &  \url{http://lemire.me/fr/abstracts/COLA2003.html}\\
%\hrule
SVDFeature & Toolkit for feature-based matrix factorization& C++ & \small\url{https://www.jmlr.org/papers/v13/chen12a.html}\\
\hline
Vogoo PHP LIB & Collaborative filtering engine for personalizing web sites & PHP & \small\url{http://sourceforge.net/projects/vogoo/} \\

\hline
\end{tabular}
\caption{Recommender System Software freely available for research.}
\label{tab:software}
\end{table}

Most available software focuses on creating
recommender applications for deployment
as a production system or they implement a single method as part of a research
project.
The R extension package~\pkg{recommenderlab}
described in this paper
was designed for a completely different purpose.
It aims at providing a comprehensive research infrastructure for
recommender systems.
The focus is on consistent and efficient data handling,
easy incorporation of existing and new algorithms,
experiment set up and evaluation of the results.
The open-source programming language R, a popular software
environment for statistical computing and data scientists~\citep{R:2018},
%(\url{https://www.r-project.org/}),
is used as the platform since it easily allows the researcher to
either implement and integrate algorithms written in a wide range of programming languages including R, Python, Java, C/C++
and already provides all needed statistical tools
which is important to provide a useful research environment.

Although
developed to support the author's own research and teaching needs
in 2010,
there proved to be a need for research-focused software and the
package \pkg{recommenderlab} turned out to be quite popular.
A healthy community of users who file bug reports, suggest improvements and contribute their own algorithms has grown around the package and is coordinated using the software's GitHub page\footnote{\url{https://github.com/mhahsler/recommenderlab}}.
Authors not related to the developer have written a textbook about how to use the package~\citep{Suresh:2015}. The package is used in several university courses to demonstrate the basics of recommender system development.
Finally, the package was employed by several researchers to develop and test their own algorithms~\citep[e.g.,][]{Chen:2013, Buhl:2016, Beel:2016, Lombardi:2017}.


Package~\pkg{recommenderlab} focuses on collaborative filtering
which is based on the idea that given rating data by many users for many items
(e.g., 1 to 5 stars),
%for movies elicited directly from the users),
one can predict a user's rating for an item
not known to her or him~\citep[see, e.g.,][]{recommender:Goldberg:1992}
or create
for each user a so called top-$N$ lists of recommended
items~\citep[see, e.g.,][]{recommender:Sarwar:2001,recommender:Deshpande:2004}.
The premise is that users who agreed on the rating for some items
typically also tend to agree on the rating for other items.

\pkg{recommenderlab} provides implementations of many popular algorithms, including the following.

\begin{itemize}
    \item {\bf User-based collaborative filtering (UBCF)} predicts ratings by aggregating the ratings of users who have a similar rating history to the active user~\citep{recommender:Goldberg:1992,Resnick:1994,recommender:Shardanand:1995}.
    \item {\bf Item-based collaborative filtering (IBCF)} uses item-to-item similarity based on user ratings to find items that are similar to the items the active user likes~\citep{Kitts:2000,recommender:Sarwar:2001,
recommender:Linden:2003,recommender:Deshpande:2004}.
    \item {\bf Latent factor models} use singular value decomposition (SVD) to
        estimate missing ratings using methods like SVD with column-mean imputation, Funk SVD or alternating least squares~\citep{Hu:2008,recommender:Koren:2009}.
    \item {\bf Association rule-based recommender (AR)} uses association rules to find recommended items~\citep{Fu:2000,Mobasher:2001,Geyer-Schulz:2002,Lin:2002,Demiriz:2004}.
    \item {\bf Popular items (POPULAR)} is a non-personalized algorithm which recommends to all users the most popular items they have not rated yet.
    \item {\bf Randomly chosen items (RANDOM)} creates random recommendations which can be used as a baseline for recommender algorithm evaluation.
    \item {\bf Re-recommend liked items (RERECOMMEND)} recommends items which the user has rated highly in the past. These recommendations can be useful for items that are typically consumed more than once (e.g., listening to songs or buying groceries).
    \item {\bf Hybrid recommendations (HybridRecommender)} aggregates the recommendations of several algorithms~\citep{Cano:2017}.
\end{itemize}

We will discuss some of these algorithms in the rest of the paper. Detailed information
can be found in the
survey book by~\cite{recommender:Desrosiers:2011}.

This rest of this paper is structured as follows. Section~\ref{sec:CF} introduces
collaborative filtering and
some of its popular algorithms.
In section~\ref{sec:evaluation} we discuss the evaluation of recommender
algorithms.
We introduce the infrastructure provided by \pkg{recommenderlab}
in section~\ref{sec:infrastructure}. In section~\ref{sec:examples} we
illustrate the capabilities on the package to create and evaluate
recommender algorithms. We conclude with section~\ref{sec:conclusion}.

\section{Collaborative Filtering}
\label{sec:CF}

To understand the use of the software, a few formal definitions are
necessary.
We will often give examples for a movie recommender, but the examples
generalize to other types of items as well.
Let $\set{U} = \{u_1, u_2, \ldots, u_m\}$ be the set of users
and $\set{I} = \{i_1, i_2, \ldots, i_n\}$ the set of items.
Ratings are stored in a $m \times n$ user-item rating matrix $\mat{R} = (r_{jl})$, where
each row represents a user $u_j$ with $1 \le j\le m$ and columns represent
items $i_l$ with $1 \le l\le n$.
We use $\mathbf{r}_j$ to denote the row vector of $\mat{R}$
with the ratings of user $u_j$.
Ratings use a specific rating scale. For example, Netflix uses 1 to 5 stars.
and estimated ratings are allowed to be within an interval of matching range
(e.g., $[1,5]$).
Typically, only a small fraction of ratings are
known and for many cells in $\mat{R}$, the values are missing.
Missing values represent movies that the user has not rated and potentially also not seen yet.

Collaborative filtering aims to create recommendations for a user
called the active user $u_a \in \set{U}$.
We define the set of items unknown to user $u_a$ as
$\set{I}_a = \set{I} \setminus \{i_l \in \set{I}\;|\;r_{al}\ \textrm{is not missing}\}$.
The two typical tasks are to predict
ratings for all items in $\set{I}_a$ or to create a list containing the best $N$ recommended items from $\set{I}_a$
(i.e., a top-$N$ recommendation list) for
$u_a$.
Predicting all missing ratings means completing the row of the
rating matrix
%$\hat{r}_{a\cdot}$
where the missing values
for items in $\set{I}_a$
are replaced by ratings estimated from other data in $\mat{R}$.
%The estimated ratings are in the same range as the original rating
%(e.g., in the range $[1,5]$ for a five star rating scheme).
From this point of view, recommender systems are related to matrix completion
problem.
%\marginpar{cite}
Creating a top-$N$ list can
be seen as a second step after predicting ratings for all
unknown items in $\set{I}_a$
and then taking the $N$ items with the highest predicted ratings.
Some algorithms skip predicting ratings first and are able to
find the top $N$ items directly.
A list
of top-$N$ recommendations for a user $u_a$ is an partially ordered set
$T_N = (\set{X}, \ge)$, where
$\set{X} \subset \set{I}_a$ and
$|\set{X}| \le N$ ($|\cdot|$ denotes the cardinality of the set).
Note that there may exist cases
where top-$N$ lists contain less than $N$ items. This can happen if
$|\set{I}_a| < N$ or if the CF algorithm is unable to
identify $N$ items to recommend.
The binary relation $\ge$ is defined as
$x\ge y$ if and only if
$\hat{r}_{ax} \ge
\hat{r}_{ay}$ for all $x,y \in \set{X}$. Furthermore we
require that $\forall_{x\in \set{X}} \forall_{y\in \set{I}_a} \quad \hat{r}_{ax} \ge \hat{r}_{ay}$ to ensure that the top-$N$ list contains
only the items with the highest estimated rating.

Typically we deal with a very large number of items
with unknown ratings
which makes first predicting rating values for all of them computationally
expensive.
Some approaches (e.g., rule based approaches) can
predict the top-$N$ list directly without considering all unknown items first.

Collaborative filtering algorithms are typically divided into two groups,
\emph{memory-based CF} and \emph{model-based CF}
algorithms~\citep{recommender:Breese:1998}. Memory-based CF
use the whole (or at least a large sample of the) user database to create
recommendations. The most prominent algorithm is
user-based collaborative filtering.
The disadvantages of this approach is scalability since the whole
user database has to be processed online for creating recommendations.
Model-based algorithms
use the user database to learn a more compact model (e.g, clusters
with users of similar preferences) that is later used to create
recommendations.

In the following we will present the basics of well known memory and model-based
collaborative filtering algorithms. Further information about
these algorithms can be found in the recent
survey book chapter by~\cite{recommender:Desrosiers:2011}.

\subsection{User-based Collaborative Filtering}

User-based CF~\citep{recommender:Goldberg:1992,Resnick:1994,recommender:Shardanand:1995} is
a memory-based algorithm which tries to mimics word-of-mouth by analyzing
rating data from many individuals. The assumption is that
users with similar preferences will rate items similarly. Thus
missing ratings for a user can be predicted by first
finding a \emph{neighborhood} of similar users and then aggregate the
ratings of these users to form a prediction.

The neighborhood is defined in terms of similarity between users,
either by taking a given number of most similar users ($k$ nearest neighbors)
or all users within
a given similarity threshold.
Popular similarity measures for CF are
the \emph{Pearson correlation coefficient} and
the \emph{Cosine similarity}. These measures are defined
between two vectors with ratings $\mathbf{x}$ and $\mathbf{y}$.

\begin{equation}
    \mathrm{m_{Pearson}}(\mathbf{x},\mathbf{y}) =
    \frac{1}{n - 1}
    \sum_{l = 1}^n
    \left( \frac{x_l - \bar{\mathbf{x}}}{s_x} \right)
    \left( \frac{y_l - \bar{\mathbf{y}}}{s_y} \right)
    \label{equ:pearson}
\end{equation}
and
\begin{equation}
    \mathrm{m_{Cosine}}(\mathbf{x},\mathbf{y}) =
	    \frac{\mathbf{x}\cdot\mathbf{y}}
		    {\|\mathbf{x}\| \|\mathbf{y}\|},
		    \label{equ:cosine}
\end{equation}
where $n$ is the number of elements in the rating vectors and missing ratings are skipped in the calculation. $s_x$ and $s_y$  is the standard deviation and
$\|\cdot\|$ is the $l^2$-norm of a vector. To calculate the measure between two users, $u_x$ and $u_y$,
$\mathbf{x} = \mathbf{r}_{x}$ and
$\mathbf{y} = \mathbf{r}_{y}$ represent the
row vectors in $\mat{R}$ with the
two users' profile vectors are used.
For calculating similarity using rating data only the dimensions (items)
are used which were rated by both users. Note that cosine and
correlation are in the range $[-1, 1]$, but similarity measures need to
be in the range of $[0, 1]$. We, use the transformation

$$s = \frac{m + 1}{2}.$$

Using the similarity,
the neighborhood for the active user $\set{N}(a) \subset \set{U}$
can be selected
by either a threshold on the similarity or by taking the
$k$ nearest neighbors.
Once the users in the neighborhood are found, their ratings are
aggregated to form the predicted rating for the active user $u_a$.
The easiest form is to just average the ratings in the neighborhood.
For item $i_l$ this is

\begin{equation}
\hat{r}_{al}=\frac{1}{|\set{N}(a)|} \sum_{i\in\set{N}(a)} r_{il}.
\label{equ:aggregation1}
\end{equation}

\begin{figure}
\centerline{\includegraphics[width=13cm]{user-based}}
\caption{User-based collaborative filtering example with (a) rating matrix $R$
and estimated ratings for the active user, (b), similarites between the active user and the other users $s_a$ (Euclidean distance converted to similarities),
    and
(b) the user neighborhood formation.
}
\label{fig:UBCF}
\end{figure}

An example of the process of creating recommendations by
user-based CF is shown in Figure~\ref{fig:UBCF}. To the left is the rating
matrix $\mat{R}$ with 6 users and 8 items and ratings in the
range 1 to 5 (stars).
We want to create recommendations for the active user $u_a$
shown at the bottom of the matrix.  To find the
$k$-neighborhood (i.e., the $k$ nearest neighbors) we calculate the similarity
between the active user and all other users
based on their ratings
in the database and then select the
$k$ users with the highest similarity. To the right in Figure~\ref{fig:UBCF}
we see a $2$-dimensional representation of the similarities (users with higher
similarity are displayed closer) with the active user in the center.
The $k=3$ nearest
neighbors ($u_1, u_2$ and $u_3$) are selected and marked in the
database to the left. To generate an
aggregated estimated rating, we
compute the average ratings in the neighborhood for each item not
rated by the active user.
To create a top-$N$ recommendation list, the items are ordered by
predicted rating.
In the small example in
Figure~\ref{fig:UBCF} the order in the top-$N$ list (with $N \ge 4$)
is $i_2, i_1, i_7$ and $i_5$. However, for a real application we
probably would
not recommend items $i_7$ and $i_5$ because of their low ratings.

The fact that some users in the neighborhood are more similar to
the active user than others (see Figure~\ref{fig:UBCF} (b)) can be
incorporated as weights into
Equation~(\ref{equ:aggregation1}).

\begin{equation}
\hat{r}_{al}=\frac{1}{\sum_{i\in\set{N}(a)} s_{ai}} \sum_{i\in\set{N}(a)} s_{ai} r_{il}
\label{equ:aggregation2}
\end{equation}
$s_{ai}$ is the similarity between the active user~$u_a$ and user
$u_i$ in the neighborhood.

For rating data, the performance of the recommender algorithm can
be improved by removing user rating bias where some users tend to always use higher ratings while others tend to use lower ratings.
This can be done by normalizing the rating data before applying the
recommender algorithm.
Any normalization
function $h: \mathbb{R}^{n \times m} \mapsto \mathbb{R}^{n \times m}$
can be used for preprocessing. Such a function needs to be reversible by
$h^{-1}$
to map the predicted rating on the normalized scale back
to the original rating scale.
Normalization is used to remove individual rating bias by
users who consistently always use lower or higher
ratings than other users.
The most popular method is to center the rows of the user-item rating matrix
by
$$h(r_{jl}) = r_{jl} - \bar{\mathbf{r}}_j,$$
where $\bar{r}_j$ is the mean of all available ratings in row~$j$ of
the user-item rating matrix $\mat{R}$.
This means that ratings are now measured for each user by how much they are
above or below the user's average rating.
The inverse is simply,

$$h^{-1}(h(r_{jl})) = h(r_{jl}) + \bar{\mathbf{r}}_j = r_{jl}.$$

Other methods like Z-score normalization which also takes rating variance
into account can be found in the
literature \citep[see, e.g.,][]{recommender:Desrosiers:2011}.


The two main problems of user-based CF are that the whole
user database has to be kept in memory and that
expensive similarity computation between the active user and
all other users in the database has to be performed.


\subsection{Item-based Collaborative Filtering}
Item-based CF~\citep{Kitts:2000,recommender:Sarwar:2001,
recommender:Linden:2003,recommender:Deshpande:2004}
is a model-based approach which produces recommendations
based on the relationship between items inferred from the rating
matrix. The assumption behind this approach is that users will prefer items
that are similar to other items they like.

The model-building step consists of calculating a similarity matrix
containing all item-to-item similarities using a given similarity measure.
Popular are again Pearson correlation and Cosine similarity and
Equations~\ref{equ:pearson} and \ref{equ:cosine} are used. Here the rating vectors
$x$ and $y$ are columns of $\mat{R}$ representing the ratings for two items.

%% FIXME for real data
%For item-based CF, \cite{recommender:Deshpande:2004} proposed
%a \emph{Conditional probability-based
%similarity} defined as:
%	$$\mathrm{sim_{Conditional}}(x,y) =
%	    \frac{\mathrm{Freq}(xy) }
%		    {\mathrm{Freq}(x)} = \hat{P}(y|x),$$
%where $x, y \in \set{I}$ are two items and
%$\mathrm{Freq(\cdot)}$ is the number of users with the given items in their
%profile. This similarity is in fact an estimate of the conditional
%probability to see item $y$ in a profile given that the profile contains
%items $x$. This similarity is equivalent to the confidence measure used
%for association rules (see section~\ref{sec:AR} below). A drawback
%of this similarity measure is its sensitivity to the frequency of $x$ with
%rare $x$ producing high similarities.
%To reduce the sensitivity, \cite{recommender:Deshpande:2004} propose a
%normalized version of the similarity measure:
%	$$\mathrm{sim_{Karypis}}(x,y) =
%	    \frac{\sum_{\forall i b_{i,x}} b_{i,y} }
%	    {\mathrm{Freq}(x) \mathrm{Freq}(y)^\alpha}$$
%where $\mathbf{B} = (b_{i,j})$
%is a normalized rating matrix where all rows sum up to $1$.
%$\mathrm{Freq}(y)^\alpha$ reduces the problem with rare $x$.
%
All pairwise similarities are stored in a $n \times n$
similarity matrix $\mat{S}$.
%%,
%%which is again normalized such that rows sum up to $1$.
To reduce the model size
to $n \times k$ with $k \ll n$,
for each item only a list of the $k$ most similar
items and their similarity values are stored.
The $k$ items which are most similar to item $i_l$ is
denoted by the set $\set{S}(l)$ which can be seen as the
neighborhood of size $k$ of the item.
Retaining only $k$ similarities per item improves the
space and time complexity significantly but potentially sacrifices
some recommendation quality~\citep{recommender:Sarwar:2001}.

To make a recommendation based on the model
we use the similarities to calculate a weighted sum of the user's ratings
for related items.


\begin{figure}
\centerline{\includegraphics[width=9cm]{item-based2}}
\caption{Item-based collaborative filtering}
\label{fig:IBCF}
\end{figure}

\begin{equation}
    \hat{r}_{al} =  \frac{1}{\sum_{i \in \set{S}(l)}{s_{li}}}
    \sum_{i \in \set{S}(l)}{s_{li} r_{ai}}
\end{equation}


Figure~\ref{fig:IBCF} shows an example for $n=8$ items with $k=3$. For the
similarity matrix $\mat{S}$ only the $k=3$ largest entries are
stored per row (these entries are marked using bold face).
For the example we assume that we have ratings for the
active user for items
$i_1, i_5$ and $i_8$. The rows corresponding to these items are highlighted
in the item similarity matrix. We can now compute the weighted sum using
the similarities (only the reduced matrix with the $k=3$
highest ratings is used) and the user's ratings.
The result (below the matrix) shows that $i_3$ has the highest estimated
rating for the active user.

Similar to user-based recommender algorithms, user-bias can be reduced
by first normalizing the user-item rating matrix before computing the
item-to-item similarity matrix.


Item-based CF is more efficient than user-based CF since the
model (reduced similarity matrix) is relatively small ($N \times k$) and
can be fully precomputed. Item-based CF is known to
only produce slightly inferior results compared to user-based
CF and higher order models
which take the joint distribution of
sets of items into account
are possible~\citep{recommender:Deshpande:2004}.
Furthermore, item-based CF is successfully applied
in large scale recommender systems (e.g., by Amazon.com).


\subsection{User and Item-Based CF using 0-1 Data}

Less research
is available for situations where no large amount of detailed
directly elicited rating data is available.
%~\citep[some notable exceptions are][]{recommender:Weiss:2001,recommender:Mild:2003,recommender:Lee:2005,recommender:Pan:2008}.
However, this is a common situation and occurs when
users do not want to directly reveal their preferences by rating an item
(e.g., because it is to time consuming). In
this case preferences can only be inferred by analyzing usage behavior.
For
example, we can easily record in a supermarket setting what items a customer
purchases. However, we do not know why other products were not purchased.
The reason might be one of the following.
\begin{itemize}
\item The customer does not need the product right now.
\item The customer does not know about the product. Such a product
is a good candidate for recommendation.
\item The customer does not like the product. Such a product should
obviously not
be recommended.
\end{itemize}

\cite{recommender:Mild:2003} and \cite{recommender:Lee:2005}
present and evaluate
recommender algorithms for this setting.
The same reasoning is true for
recommending pages of a web site given click-stream data. Here we only have
information about which pages were viewed but not why some pages were not
viewed. This situation leads to
binary data or more exactly to
0-1 data where 1 means that we inferred that
the user has a preference for an item and 0 means that either the user does not like the item or does not know about it.
\cite{recommender:Pan:2008} call
this type of data in the context of collaborative filtering
analogous to similar situations for classifiers
\emph{one-class data} since only the 1-class is pure and contains only
positive examples. The 0-class is a mixture of positive and negative examples.


In the 0-1 case with $r_{jl} \in {0,1}$ where we define:
\begin{equation}
r_{jl}=
\begin{cases}
1& \text{user $u_j$ is known to have a preference for item $i_l$} \\
0& \text{otherwise.}
\end{cases}
\end{equation}

Two strategies to deal with one-class data is to assume all missing ratings
(zeros) are negative examples or to assume that all missing ratings are unknown.
%Here will will follow mostly the first strategy based on the assumption that
%users typically favor only a small fraction of the items and thus most
%items with no rating will be indeed negative examples. However, it has to be
%said that
In addition,
\cite{recommender:Pan:2008} propose strategies which represent a
trade-off between the two extreme strategies based on wighted low rank
approximations of the rating matrix and on negative example sampling which
might improve results across all recommender algorithms.
%This is however outside
%the scope of this paper.

If we assume that users  typically favor only a small fraction of the items and thus most items with no rating will be indeed negative examples.
then we have no missing values and can use the approaches
described above for real valued rating data. However,
if we assume all zeroes are missing values, then this lead
to the problem that we cannot compute similarities using
Pearson correlation or Cosine similarity since the
not missing parts of the vectors only contains ones.
A similarity measure which
only focuses on matching ones
and thus prevents the problem with zeroes
is the \emph{Jaccard index}:
\begin{equation}
\mathrm{sim_{Jaccard}}(\set{X},\set{Y}) =
\frac{|\set{X}\cap \set{Y}|}{|\set{X}\cup \set{Y}|},
\end{equation}
where
$\set{X}$ and $\set{Y}$ are the sets of the items with a 1 in user profiles
$u_a$ and $u_b$, respectively.
The Jaccard index can be used between users for user-based filtering
and between items for item-based filtering as described above.

%For 0-1 data we suggest, following \cite{recommender:Weiss:2001},
%to calculate the following score.
%$$s_{aj} = \sum_{i\in\set{N}(a)}{w_{ai}\, r_{ij}}$$
%This score is not normalized but can be easily used to find the
%top-$N$ items with the highest score.
%\marginpar{More on this!}


\subsection{Recommendations for 0-1 Data Based on Association Rules}
\label{sec:AR}
Recommender systems using association rules
produce recommendations based on a dependency model for items
given by a set of association
rules~\citep{Fu:2000,Mobasher:2001,Geyer-Schulz:2002,Lin:2002,Demiriz:2004}.
The binary profile matrix $\mat{R}$ is seen as a database
where each user is treated as a transaction that contains
the subset of items in $\set{I}$ with a rating of 1.
Hence transaction $k$ is defined as
$\set{T}_k = \{i_j \in \set{I} | r_{jk} = 1\}$ and
the whole transaction data base is
$\set{D} = \{\set{T}_1, \set{T}_2, \ldots, \set{T}_U\}$ where $U$ is the number of users.
To build the dependency model,
a set of association rules $\set{R}$ is mined from
$\mat{R}$. Association rules are rules of the form
$\set{X} \rightarrow \set{Y}$  where $\set{X}, \set{Y} \subseteq \set{I}$
and $\set{X} \cap \set{Y} = \emptyset$.
For the model we only use association rules with a single item in
the right-hand-side of the rule ($|\set{Y}| = 1$).
To select a set of useful association rules,
thresholds on measures of significance and interestingness are used. Two widely applied measures are:
\begin{equation*}
\mathrm{support}(\set{X} \rightarrow \set{Y}) =
\mathrm{support}(\set{X} \cup \set{Y}) =
\mathrm{Freq}(\set{X} \cup \set{Y}) / |\set{D}| = \hat{P}(E_\set{X} \cap E_\set{Y})
\end{equation*}
\begin{equation*}
\mathrm{confidence}(\set{X} \rightarrow \set{Y}) = \mathrm{support}(\set{X} \cup \set{Y}) / \mathrm{support}(\set{X}) = \hat{P}(E_\set{Y}|E_\set{X})
\end{equation*}

$\mathrm{Freq}(\set{I})$ gives the number of transactions
in the data base $\set{D}$ that contains all items in $\set{I}$.
$E_\set{I}$ is the event that the itemset $\set{I}$ is contained in a transaction.

We now require $\mathrm{support}(\set{X} \rightarrow \set{Y}) > s$ and
$\mathrm{confidence}(\set{X} \rightarrow \set{Y}) > c$
and also include a length constraint $|\set{X} \cup \set{Y}|\leq l$.
The set of rules $\set{R}$ that satisfy these constraints form the dependency
model. Although finding all association rules given thresholds
on support and confidence is a hard problem (the model grows in the
worse case exponential with the number of items), algorithms that efficiently
find all rules in most cases are
available~\citep[e.g.,][]{arules:Agrawal:1994,arules:Zaki:2000,arules:Han:2004}. Also model size can be controlled by $l$, $s$ and $c$.


To make a recommendation for an active user $u_a$ given the
set of items $\set{T}_a$ the user likes and the
set of association rules $\set{R}$ (dependency model),
the following steps are necessary:
\begin{enumerate}
\item Find all matching rules $\set{X} \rightarrow \set{Y}$ for
which $\set{X} \subseteq \set{T}_a$
in $\set{R}$.
\item Recommend $N$ unique right-hand-sides ($\set{Y}$) of the matching rules
with the highest confidence (or another measure of interestingness).
\end{enumerate}

The dependency model is very similar to
item-based CF with conditional probability-based
similarity~\citep{recommender:Deshpande:2004}. It can be fully precomputed
and rules with more than one items in the left-hand-side ($\set{X}$),
it incorporates higher order effects between more than two items.
%Mbasher et al [Mobasher et al. 2000] presented an algorithm for recommending
%additional webpages to be visited by a user based on association rules. In this
%approach, the historical information about users and their web-access patterns
%were mined using a frequent itemset discovery algorithm and were used to
%generate a set of high con- fidence association rules. The recommendations were
%computed as the union of the consequent of the rules that were supported by the
%pages visited by the user. Lin et al [Lin et al. 2000] used a similar approach
%but they developed an algorithm that is guaranteed to find association rules
%for all the items in the database. Finally, within the context of using
%association rules to derive top-N recommendations, Demiriz [Demiriz 2001]
%studied the problem of how to weight the different rules that are supported
%by the active user. He presented a method that computes the similarity
%between a rule and the active user's basket as the product of the confi-
%dence of the rule and the Euclidean distance between items in the
%antecedent of the association rule and the items in the user¿s basket. He
%compared this approach both with the item-based scheme described in Section
%4 (based on our preliminary work presented in [Karypis 2001]) and the
%dependency network-based algorithm [Heckman et al. 2000].

\subsection{Other collaborative filtering methods}

Over time several other model-based approaches have been developed.
A popular simple item-based approach is the \emph{Slope One}
algorithm \citep{recommender:Lemire:2005}.
Another family of algorithms is based on latent factors approach using
matrix decomposition~\citep{recommender:Koren:2009}.
More recently, deep learning has become a very popular method for
flexible matrix completion, matrix factorization and collaborative ranking. A
comprehensive survey is presented by~\cite{recommender:Zhang:2019}.

These algorithms are outside the scope of this introductory paper.
%% FIXME: Expand here

\section{Evaluation of Recommender Algorithms}
\label{sec:evaluation}

Evaluation of recommender systems is an important topic and
reviews were presented
by \cite{recommender:Herlocker:2004}
and \cite{recommender:Gunawardana:2009}.
Typically,
given a rating matrix $\mat{R}$,
recommender algorithms are evaluated by first
partitioning the users (rows) in $\mat{R}$ into two
sets $\set{U}_\mathit{train}\, \cup\, \set{U}_\mathit{test} = \set{U}$.
The rows of $\mat{R}$ corresponding to the training users $U_\mathit{train}$
are used to learn the recommender model.
Then each user $u_a \in \set{U}_\mathit{test}$ is seen as an active user.
Before creating recommendations some items are withheld from the
profile $r_{u_a\cdot}$ and it measured
either how well the predicted rating matches the withheld value
or, for top-$N$ algorithms, if the items in the recommended
list are rated highly by the user. Finally, the evaluation measures calculated for all
test users are averaged.

%It is assumed that if
%a recommender algorithm performed better in predicting the withheld items,
%it will also perform better in finding good recommendations for unknown items.

To determine how to split $\set{U}$ into
$\set{U}_\mathit{train}$ and $\set{U}_\mathit{test}$ we can use
several approaches~\citep{recommender:Kohavi:1995}.
\begin{itemize}
\item {\bf Splitting:}
We can randomly assign a predefined proportion of the users
to the training set and all others to the test set.
\item {\bf Bootstrap sampling:}
We can sample from $\set{U}_\mathit{test}$ with replacement
to create the training set and then use the users not in the training set as
the test set. This procedure has the advantage that for smaller data sets
we can create larger training sets and still have users left for testing.
\item {\bf $k$-fold cross-validation:} Here we split $\set{U}$ into $k$ sets
(called folds) of approximately the same size. Then we evaluate $k$ times,
always using one fold for testing and all other folds for leaning. The $k$
results can be averaged. This approach makes sure that each user is at least
once in the test set and the averaging produces more robust results and error
estimates.
\end{itemize}

The items withheld in the test data are randomly chosen.
\cite{recommender:Breese:1998} introduced the four experimental protocols
called {\em Given 2}, {\em Given 5}, {\em Given 10} and {\em All-but-1}. For
the Given $x$ protocols for each user $x$ randomly chosen items are given to
the recommender algorithm and the remaining items are withheld for evaluation.
For All but $x$ the algorithm gets all but $x$ withheld items.

In the following we discuss the evaluation of predicted ratings
and then of top-$N$ recommendation lists.

\subsection{Evaluation of predicted ratings}

A typical way to evaluate a prediction is to compute
the deviation of the prediction from the true value.
This is the basis for the \emph{Mean
Average Error (MAE)}

\begin{equation}
\mathrm{MAE} = \frac{1}{|\set{K}|} \sum_{(j,l)\in \set{K}} |r_{jl} - \hat{r}_{jl}|,
\end{equation}
where $\set{K}$ is the set of all user-item pairings  $(j,l)$
for which we have a predicted rating $\hat{r}_{jl}$ and a known rating
$r_{jl}$ which was not used to learn the recommendation model.

Another popular measure is the \emph{Root Mean Square Error (RMSE)}.
\begin{equation}
\mathrm{RMSE} = \sqrt{\frac{\sum_{(j,l)\in \set{K}} (r_{jl} - \hat{r}_{jl})^2}{|\set{K}|}}
\end{equation}

RMSE penalizes larger errors stronger than MAE and thus is suitable for
situations where small prediction errors are not very important.

\subsection{Evaluation Top-$N$ recommendations}

The items in the predicted top-$N$ lists and the
withheld items  liked by the user (typically determined by a simple threshold
on the actual rating) for all test users $\set{U}_{test}$
can be aggregated into a so called {\em confusion matrix} depicted in
table~\ref{tab_confusion} (see \cite{recommender:Kohavi:1998})
which corresponds
exactly to the outcomes of a classical statistical experiment.
The confusion matrix shows how many of the items recommended in
the top-$N$ lists (column predicted positive; $d+b$)
were withheld items and thus correct recommendations
(cell $d$) and how many where potentially incorrect (cell $b$).
The matrix also shows
how many of the not recommended items
(column predicted negative; $a+c$)
should have actually been recommended since they represent withheld items (cell $c$).

\begin{table}[tbp]
\caption{2x2 confusion matrix
\label{tab_confusion}
}
\center
\begin{tabular}{|c||c|c|}
\hline
{\bf actual / predicted} & {\bf negative}  & {\bf positive} \\
\hline
\hline
{\bf negative} & $a$ & $b$ \\
\hline
{\bf positive} & $c$ & $d$ \\
\hline
\end{tabular}
\end{table}

From the confusion matrix several performance measures can be derived.
For the data mining task of a recommender system
the performance of an algorithm depends on its ability
to learn significant patterns in the data set.
Performance measures used to evaluate these algorithms
have their root in machine learning.
A commonly used measure is
{\em accuracy,} the fraction of correct
recommendations to total possible recommendations.

\begin{equation}
\mathit{Accuracy} = \frac{\mathit{correct\ recommendations}}{\mathit{total\ possible\ recommendations}}
=  \frac{a+d}{a+b+c+d}
\label{accur}
\end{equation}

A common error measure is the {\em mean absolute error} ({\em MAE}, also called {\em mean absolute deviation} or {\em MAD}).
\begin{equation}
\mathit{MAE} = \frac{1}{N}\sum_{i=1}^N{|\epsilon_i|} = \frac{b+c}{a+b+c+d},
\label{mae}
\end{equation}
where $N = a+b+c+d$ is the total number of items
which can be recommended and $|\epsilon_i|$ is the
absolute error of each item.
Since we deal with 0-1 data, $|\epsilon_i|$ can only be zero (in cells $a$ and $d$ in the confusion matrix) or one (in cells $b$ and $c$).
For evaluation recommender algorithms for rating data, the root mean square
error is often used. For 0-1 data it reduces to the square
root of MAE.

Recommender systems help to find items of interest from the set of all
available items. This can be seen as a retrieval task known from
information retrieval. Therefore, standard information retrieval
performance measures are frequently used to evaluate recommender performance.
{\em Precision} and {\em recall} are the best known measures used in
information retrieval \citep{recommender:Salton:1983,recommender:Rijsbergen:1979}.

\begin{equation}
\mathit{Precision} = \frac{\mathit{correctly\ recommended\ items}}{\mathit{total\ recommended\ items}}
= \frac{d}{b+d}
\label{prec}
\end{equation}

\begin{equation}
\mathit{Recall} = \frac{\mathit{correctly\ recommended\ items}}{\mathit{total\ useful\ recommendations}}
= \frac{d}{c+d}
\label{rec}
\end{equation}

%% fallout, generality

Often the number of {\em total\ useful\ recommendations} needed for recall is
unknown since the whole collection would have to be inspected. However, instead
of the actual {\em total\ useful\ recommendations} often the total number of
known useful recommendations is used.  Precision and recall are
conflicting properties, high precision means low recall and vice versa. To find
an optimal trade-off between precision and recall a single-valued measure like
the {\em E-measure} \citep{recommender:Rijsbergen:1979} can be used.
The parameter $\alpha$ controls the
trade-off between precision and recall.
\begin{equation}
\mbox{\textit{E-measure}} = \frac{1}{\alpha(1/\mathit{Precision}) + (1-\alpha)(1/\mathit{Recall})}
\label{e-measure}
\end{equation}

A popular single-valued measure is the {\em F-measure.} It is defined as the
harmonic mean of precision and recall.
\begin{equation}
\mbox{\textit{F-measure}} = \frac{2\, \mathit{Precision}\, \mathit{Recall}}{\mathit{Precision} + \mathit{Recall}} =
\frac{2}{1/\mathit{Precision} + 1/\mathit{Recall}}
\label{f-measure}
\end{equation}
It is
a special case of the E-measure with $\alpha=.5$ which places the same weight
on both, precision and recall.  In the recommender evaluation literature the
F-measure is often referred to as the measure {\em F1.}

Another method used in the literature to compare two classifiers at different
parameter settings is the {\em Receiver Operating Characteristic (ROC)}.
The method was developed for signal detection and goes back to the Swets model
\citep{recommender:Rijsbergen:1979}. The ROC-curve is a plot of the system's
{\em probability of detection} (also called $\mathit{sensitivity}$ or true
positive rate TPR which is equivalent to recall as defined in
formula \ref{rec}) by the {\em probability of false alarm}
(also called false positive rate FPR or
$1-\mathit{specificity}$, where
$\mathit{specificity} = \frac{a}{a+b}$) with
regard to model parameters.  A possible way to compare the efficiency of two
systems is by comparing the size of the area under the ROC-curve, where a
bigger area indicates better performance.
%% \cite{recommender:Schein2005} for CROC


\section{Recommenderlab Infrastructure}
\label{sec:infrastructure}
\pkg{recommenderlab} is implemented using formal classes in
the \proglang{S4} class system.
Figure~\ref{fig:class_diagram} shows the main classes and their relationships.

\begin{figure}
    \centerline{\includegraphics[width=1\textwidth, trim={0 23cm 0 0}, clip]{recommenderlab_classes}}
\caption{UML class diagram for
package~\pkg{recommenderlab}~\citep{misc:Fowler:2004}.}
\label{fig:class_diagram}
\end{figure}

The package uses the abstract \class{ratingMatrix}
to provide a common interface for rating data. \class{ratingMatrix}
implements many methods typically available for matrix-like
objects. For example, \func{dim}, \func{dimnames},
\func{colCounts}, \func{rowCounts},
\func{colMeans}, \func{rowMeans},
\func{colSums} and \func{rowSums}.
Additionally \func{sample} can be used to sample from users (rows) and
\func{image} produces an image plot.

For \class{ratingMatrix} we provide two concrete
implementations \class{realRatingMatrix} and
\class{binaryRatingMatrix} to represent different types of rating
matrices $\mat{R}$.
\class{realRatingMatrix} implements a rating matrix with real valued
ratings stored in sparse format defined in package \pkg{Matrix}.
Sparse matrices in \pkg{Matrix} typically do not store 0s explicitly,
however for \class{realRatingMatrix} we use these sparse matrices
such that instead of 0s, NAs are not explicitly stored.

\class{binaryRatingMatrix} implements a 0-1 rating matrix using the
implementation of \class{itemMatrix} defined in package~\pkg{arules}.
\class{itemMatrix} stores only the ones and internally uses a sparse
representation from package \pkg{Matrix}.  With this class structure
\pkg{recommenderlab} can be easily extended to other forms of rating matrices
with different concepts for efficient storage in the future.

Class \class{Recommender} implements the data structure to store recommendation
models. The creator method

\begin{center}
\code{Recommender(data, method, parameter = NULL)}
\end{center}

takes data as a \class{ratingMatrix},
a method name and some optional parameters for the method and
returns a \class{Recommender} object. Once we have a recommender object,
we can predict top-$N$ recommendations for active users using

\begin{center}
\code{predict(object, newdata, n=10, type=c("topNList", "ratings", "ratingMatrix"), ...)}.

\end{center}

Predict can return either top-$N$ lists (default setting) or predicted ratings.
\code{object} is the recommender
object, \code{newdata} is the data for the active users.
For top-$N$ lists \code{n} is
the maximal number of recommended items in each list and \func{predict} will
return an objects of class \class{topNList} which contains
one top-$N$ list for each active user. For \code{"ratings"}
and \code{"ratingMatrix"}, \code{n} is ignored
and an object of \class{realRatingMatrix} is returned. Each row
contains the predicted ratings for one active user. The difference is,
that for \code{"ratings"}, the items for which a
rating exists in \code{newdata} have a \code{NA} instead of a
predicted/actual ratings.


The actual implementations for the recommendation algorithms are
managed using the registry mechanism provided by package \pkg{registry}.
The registry called \code{recommenderRegistry}
and stores recommendation method names and a short description.
%uses the four fields
%\code{method} (name of the recommendation algorithm), \code{dataType} (name of
%the concrete implementation of \class{ratingMatrix} the algorithm works on),
%\code{fun} (function to create a recommender model as an instance of class
%\class{Recommender}), and \code{description} (verbal description of the
%algorithm) to describe a recommender algorithm.
Generally, the registry mechanism is hidden from the user and the
creator function \func{Recommender} uses it in the background to map
a recommender method name to its implementation. However, the
registry can be directly queried by

\begin{center}
\code{recommenderRegistry\$get\_entries()}
\end{center}

and new recommender algorithms can be added by the user.
We will give and example for this feature
in the examples section of this paper.

To evaluate recommender algorithms package~\pkg{recommenderlab}
provides the infrastructure to create and maintain evaluation schemes
stored as an object of class \class{evaluationScheme} from rating data.
The creator function

\begin{center}
\code{evaluationScheme(data, method="split", train=0.9, k=10, given=3)}
\end{center}

creates the evaluation scheme from a data set using a method
(e.g., simple split, bootstrap sampling, $k$-fold cross validation).
Testing is perfomed by withholding items (parameter \code{given}).
\cite{recommender:Breese:1998} introduced the
four experimental witholding protocols called \emph{Given 2, Given 5, Given 10} and \emph{All-but-1.}
During testing, the \emph{Given $x$} protocol presents the algorithm with
only $x$ randomly chosen items for the test user, and the algorithm
is evaluated by how well it is able to predict the withheld items.
For \emph{All-but-$x$}, a generalization of \emph{All-but-1},
the algorithm sees all but
$x$ withheld ratings for the test user.
\code{given} controls x in the evaluations scheme.
Positive integers result in a Given $x$ protocol, while negative values
produce a All-but-$x$ protocol.

The function \func{evaluate} is then used to
evaluate several recommender algorithms using an evaluation scheme resulting
in a evaluation result list (class~\class{evaluationResultList}) with
one entry (class~\class{evaluationResult}) per algorithm.
Each object of \class{evaluationResult}
contains one or several object of \class{confusionMatrix} depending on the
number of evaluations specified in the \class{evaluationScheme} (e.g., $k$
for $k$-fold cross validation).
With this infrastructure several recommender algorithms can be compared
on a data set with a single line of code.

In the following, we will illustrate the usage of \pkg{recommenderlab}
with several examples.
\section{Examples}
\label{sec:examples}
This fist few example shows how to manage data in recommender lab and then
we create and evaluate recommenders.
First, we load the package.

<<>>=
library("recommenderlab")
@

\subsection{Coercion to and from rating matrices}

For this example we create a small
artificial data set as a matrix.

<<>>=
m <- matrix(sample(c(as.numeric(0:5), NA), 50,
    replace=TRUE, prob=c(rep(.4/6,6),.6)), ncol=10,
    dimnames=list(user=paste("u", 1:5, sep=''),
	item=paste("i", 1:10, sep='')))
m
@

With coercion, the matrix can be easily converted into
a realRatingMatrix object which stores the data in sparse
format (only non-NA values are stored explicitly; NA values are
represented by a dot).

<<>>=
r <- as(m, "realRatingMatrix")
r
getRatingMatrix(r)
@

The realRatingMatrix can be coerced back into a matrix which is
identical to the original matrix.
<<>>=
identical(as(r, "matrix"),m)
@


It can also be coerced into a list
of users with their ratings for closer inspection
or into a data.frame with user/item/rating tuples.
<<>>=
as(r, "list")
head(as(r, "data.frame"))
@

The data.frame version is especially suited for writing rating data to a
file (e.g., by \func{write.csv}).
Coercion from data.frame (user/item/rating tuples)
and list into a sparse rating matrix is also provided.
This way, external rating data can easily be imported into \pkg{recommenderlab}.

\subsection{Normalization}
An important operation for rating matrices is to normalize the entries
to, e.g., centering to remove rating bias by subtracting the row mean from
all ratings in the row. This is can be easily done using
\code{normalize()}.

<<>>=
r_m <- normalize(r)
r_m
getRatingMatrix(r_m)
@

Normalization can be reversed using \code{denormalize()}.
<<>>=
denormalize(r_m)
@

Small portions of rating matrices can be visually inspected using
\code{image()}.

<<eval=FALSE>>=
image(r, main = "Raw Ratings")
image(r_m, main = "Normalized Ratings")
@

<<image1, fig=TRUE, include=FALSE, echo=FALSE>>=
print(image(r, main = "Raw Ratings"))
@
<<image2, fig=TRUE, include=FALSE, echo=FALSE>>=
print(image(r_m, main = "Normalized Ratings"))
@

\begin{figure}
\begin{minipage}[b]{.48\linewidth}
\centerline{\includegraphics[width=\linewidth]{recommenderlab-image1}}
\end{minipage}
\begin{minipage}[b]{.48\linewidth}
\centerline{\includegraphics[width=\linewidth]{recommenderlab-image2}}
\end{minipage}
\caption{Image plot
the artificial rating data before and after normalization.}
\label{fig:image1}
\end{figure}


Figure~\ref{fig:image1} shows the resulting plots.

\subsection{Binarization of data}

A matrix with real valued ratings can be transformed into a 0-1 matrix
with \code{binarize()} and
a user specified threshold (\code{min\_ratings})
on the raw or normalized ratings. In the following only
items with a rating of 4 or higher will become a positive
rating in the new binary rating matrix.
<<>>=
r_b <- binarize(r, minRating=4)
r_b
as(r_b, "matrix")
@


\subsection{Inspection of data set properties}
We will use the data set Jester5k for the rest of this section.  This data set
comes with \pkg{recommenderlab} and contains a sample of 5000 users from the
anonymous ratings data from the Jester Online Joke Recommender System collected
between April 1999 and May 2003~\citep{recommender:Goldberg:2001}. The
data set contains ratings for 100 jokes on a scale from $-10$ to $+10$. All
users in the data set have rated 36 or more jokes.

<<>>=
data(Jester5k)
Jester5k
@

Jester5k contains
\Sexpr{nratings(Jester5k)}~ratings. For the following examples we use
only a subset of the data containing a sample of 1000 users (we set the
random number generator seed for reproducibility).
For random sampling \func{sample} is provided for rating matrices.

<<>>=
set.seed(1234)
r <- sample(Jester5k, 1000)
r
@

This subset still contains \Sexpr{nratings(r)}~ratings.
Next, we inspect the ratings for the first user. We can select
an individual user with the extraction operator.

<<>>=
rowCounts(r[1,])
as(r[1,], "list")
rowMeans(r[1,])
@

The user has rated \Sexpr{rowCounts(r[1,])} jokes, the list shows
the ratings and the user's rating average is \Sexpr{rowMeans(r[1,])} .

Next, we look at several distributions to understand the data better.
\code{getRatings()} extracts a vector with all
non-missing
ratings from a rating matrix.

<<hist1, fig=TRUE, include=FALSE>>=
hist(getRatings(r), breaks=100)
@
\begin{figure}
\centerline{\includegraphics[width=.5\linewidth]{recommenderlab-hist1}}
\caption{Raw rating distribution for as sample of Jester.}
\label{fig:hist1}
\end{figure}

In the histogram in Figure~\ref{fig:hist1}
shoes an interesting distribution
where all negative values occur with a almost identical frequency and
the positive ratings more frequent with a steady decline towards the
rating 10. Since this distribution can be the result of users with strong
rating bias, we look next at the rating distribution after normalization.

<<hist2, fig=TRUE, include=FALSE>>=
hist(getRatings(normalize(r)), breaks=100)
@
<<hist3, fig=TRUE, include=FALSE>>=
hist(getRatings(normalize(r, method="Z-score")), breaks=100)
@

\begin{figure}
\begin{minipage}[b]{.48\linewidth}
\centerline{\includegraphics[width=\linewidth]{recommenderlab-hist2}}
\end{minipage}
\begin{minipage}[b]{.48\linewidth}
\centerline{\includegraphics[width=\linewidth]{recommenderlab-hist3}}
\end{minipage}
\caption{Histogram of normalized ratings using row centering (left) and
Z-score normalization (right).}
\label{fig:hist2}
\end{figure}

Figure~\ref{fig:hist2} shows that the distribution of ratings
ins closer to a normal distribution after row centering and
Z-score normalization additionally reduces the variance to a
range of roughly $-3$ to $+3$ standard deviations. It is interesting to see
that there is a pronounced peak of ratings between zero and two.

Finally, we look at how many jokes each user has rated and what the
mean rating for each Joke is.

<<hist4, fig=TRUE, include=FALSE>>=
hist(rowCounts(r), breaks=50)
@

<<hist5, fig=TRUE, include=FALSE>>=
hist(colMeans(r), breaks=20)
@

\begin{figure}
\begin{minipage}[b]{.48\linewidth}
\centerline{\includegraphics[width=\linewidth]{recommenderlab-hist4}}
\end{minipage}
\begin{minipage}[b]{.48\linewidth}
\centerline{\includegraphics[width=\linewidth]{recommenderlab-hist5}}
\end{minipage}
\caption{Distribution of the number of rated items per user (left) and
of the average ratings per joke (right).}
\label{fig:hist3}
\end{figure}


Figure~\ref{fig:hist3} shows that there are unusually many
users with ratings around 70 and users who have rated all jokes.
The average ratings per joke look closer to a normal distribution with
a mean above 0.


\subsection{Creating a recommender}

A recommender is created using the creator function \func{Recommender}.
Available recommendation methods are stored in a registry.
The registry can be queried. Here we are only interested in
methods for real-valued rating data.

<<>>=
recommenderRegistry$get_entries(dataType = "realRatingMatrix")
@

Next, we create a recommender which generates recommendations
solely on the popularity of items (the number of users who have the item in
their profile). We create a recommender
from the first 1000 users in the Jester5k data set.

<<>>=
r <- Recommender(Jester5k[1:1000], method = "POPULAR")
r
@

The model can be obtained from a recommender using \func{getModel}.
<<>>=
names(getModel(r))
getModel(r)$topN
@

In this case the model has a
top-$N$ list to store the popularity order and further
elements (average ratings, if it used normalization and the
used aggregation function).

Recommendations are generated by \func{predict}
(consistent with its use for other types of models in
\proglang{R}). The result
are recommendations in the form of an object of class~\class{TopNList}.
Here we create top-5 recommendation lists for two
users who were not used to learn the model.

<<>>=
recom <- predict(r, Jester5k[1001:1002], n=5)
recom
@

The result contains two ordered top-$N$ recommendation lists,
one for each user. The recommended items can be inspected as a list.
<<>>=
as(recom, "list")
@

Since the top-$N$ lists are ordered, we can extract sublists of
the best items in the top-$N$. For example, we can get the best 3
recommendations for each list using \func{bestN}.
<<>>=
recom3 <- bestN(recom, n = 3)
recom3
as(recom3, "list")
@


Many recommender algorithms can also predict ratings. This is also
implemented using \func{predict} with the parameter \code{type}
set to \code{"ratings"}.

<<>>=
recom <- predict(r, Jester5k[1001:1002], type="ratings")
recom
as(recom, "matrix")[,1:10]
@

Predicted ratings are returned as an object of \class{realRatingMatrix}.
The prediction contains \code{NA} for the items rated by the active users.
In the example we show the predicted ratings for the first 10 items for
the two active users.

Alternatively, we can also request the complete rating matrix which includes
the original ratings by the user.

<<>>=
recom <- predict(r, Jester5k[1001:1002], type="ratingMatrix")
recom
as(recom, "matrix")[,1:10]
@

\subsection{Evaluation of predicted ratings}
Next, we will look at the evaluation of recommender algorithms.
\pkg{recommenderlab} implements several
standard evaluation methods for recommender systems.
Evaluation starts with creating an evaluation scheme
that determines what and how data is used for training and testing.
Here we create an evaluation scheme which splits the first 1000 users
in Jester5k into a training set (90\%) and a test set (10\%). For
the test set 15 items will be given to the recommender algorithm
and the other items will be held out
for computing the error.

<<>>=
e <- evaluationScheme(Jester5k[1:1000], method="split", train=0.9,
    given=15, goodRating=5)
e
@

We create two recommenders (user-based and item-based collaborative filtering)
using the training data.

<<>>=
r1 <- Recommender(getData(e, "train"), "UBCF")
r1

r2 <- Recommender(getData(e, "train"), "IBCF")
r2
@

Next, we compute predicted ratings for the known part of the test data (15 items
for each user) using
the two algorithms.

<<>>=
p1 <- predict(r1, getData(e, "known"), type="ratings")
p1
p2 <- predict(r2, getData(e, "known"), type="ratings")
p2
@

Finally, we can calculate the error between the prediction and the
unknown part of the test data.

<<>>=
error <- rbind(
  UBCF = calcPredictionAccuracy(p1, getData(e, "unknown")),
  IBCF = calcPredictionAccuracy(p2, getData(e, "unknown"))
)
error
@

In this example user-based collaborative filtering produces a smaller
prediction error.

\subsection{Evaluation of a top-$N$ recommender algorithm}

For this example we create a $4$-fold cross validation scheme
with the the Given-3 protocol, i.e.,
for the test users all but three randomly selected items are withheld
for evaluation.

<<>>=
scheme <- evaluationScheme(Jester5k[1:1000], method="cross", k=4, given=3,
    goodRating=5)
scheme
@

Next we use the created evaluation scheme to evaluate the recommender
method popular.
We evaluate top-1, top-3, top-5, top-10, top-15 and top-20 recommendation lists.

<<>>=
results <- evaluate(scheme, method="POPULAR", type = "topNList",
  n=c(1,3,5,10,15,20))
results
@

The result is an object of class~\class{EvaluationResult} which
contains several confusion matrices. \func{getConfusionMatrix}
will return the confusion matrices for the 4 runs
(we used 4-fold cross evaluation) as a list.
In the following we look at the first element of the list which
represents the first of the 4 runs.


<<>>=
getConfusionMatrix(results)[[1]]
@

For the first run we have 6 confusion matrices represented by rows, one for each
of the six different top-$N$ lists we used for evaluation.
$n$ is the number of recommendations per list. TP, FP, FN and TN are the
entries for
true positives, false positives, false negatives and true negatives in
the confusion matrix. The remaining columns contain precomputed performance
measures. The average for all runs can be obtained
from the evaluation results directly using \func{avg}.

<<>>=
avg(results)
@

Evaluation results can be plotted using \func{plot}. The default
plot is the ROC curve which plots the true positive rate (TPR) against the
false positive rate (FPR).
<<roc1, fig=TRUE, include=FALSE>>=
plot(results, annotate=TRUE)
@
\begin{figure}
\centerline{\includegraphics[scale=1]{recommenderlab-roc1}}
\caption{ROC curve for recommender method POPULAR.}
\label{fig:roc1}
\end{figure}

For the plot where we annotated the curve with the size of the top-$N$ list
is shown in Figure~\ref{fig:roc1}.
By using \code{"prec/rec"} as the second argument, a precision-recall plot
is produced (see Figure~\ref{fig:precrec1}).

<<precrec1, fig=TRUE, include=FALSE>>=
plot(results, "prec/rec", annotate=TRUE)
@
\begin{figure}
\centerline{\includegraphics[scale=1]{recommenderlab-precrec1}}
\caption{Precision-recall plot for method POPULAR.}
\label{fig:precrec1}
\end{figure}

\subsection{Comparing recommender algorithms}
\subsubsection{Comparing top-$N$ recommendations}

The comparison of
several recommender algorithms is one of the main functions of
\pkg{recommenderlab}. For comparison also \func{evaluate} is used.
The only change is to use \func{evaluate} with
a list of algorithms together with their parameters instead of
a single method name. In the following we use the
evaluation scheme created above to compare the five
recommender algorithms: random items, popular items,
user-based CF, item-based CF, and SVD approximation.
Note that when running the following code, the CF based algorithms
are very slow.

For the evaluation we use a ``all-but-5'' scheme. This is indicated by
a negative number for \code{given}.

<<>>=
set.seed(2016)
scheme <- evaluationScheme(Jester5k[1:1000], method="split", train = .9,
  given=-5, goodRating=5)
scheme

algorithms <- list(
  "random items" = list(name="RANDOM", param=NULL),
  "popular items" = list(name="POPULAR", param=NULL),
  "user-based CF" = list(name="UBCF", param=list(nn=50)),
  "item-based CF" = list(name="IBCF", param=list(k=50)),
  "SVD approximation" = list(name="SVD", param=list(k = 50))
)

## run algorithms
results <- evaluate(scheme, algorithms, type = "topNList",
  n=c(1, 3, 5, 10, 15, 20))
@

The result is an object of class~\class{evaluationResultList} for the
five recommender algorithms.
<<>>=
results
@

Individual results can be accessed by list subsetting using an index or
the name specified when calling \func{evaluate}.

<<>>=
names(results)
results[["user-based CF"]]
@


Again \func{plot} can be used to create ROC and precision-recall plots
(see Figures~\ref{fig:roc2} and \ref{fig:precrec2}). Plot accepts most
of the usual graphical parameters like pch, type, lty, etc. In addition
annotate can be used to annotate the points on selected curves with the
list length.

<<roc2, fig=TRUE, include=FALSE>>=
plot(results, annotate=c(1,3), legend="bottomright")
@
\begin{figure}
\centerline{\includegraphics[scale=1]{recommenderlab-roc2}}
\caption{Comparison of ROC curves for several recommender methods for the
given-3 evaluation scheme.}
\label{fig:roc2}
\end{figure}

<<precrec2, fig=TRUE, include=FALSE>>=
plot(results, "prec/rec", annotate=3, legend="topleft")
@
\begin{figure}
\centerline{\includegraphics[scale=1]{recommenderlab-precrec2}}
\caption{Comparison of precision-recall curves
for several recommender methods for the
    given-3 evaluation scheme.}
\label{fig:precrec2}
\end{figure}

For this data set and the given evaluation scheme
popular items and the user-based CF methods clearly
outperform the other methods. In Figure~\ref{fig:roc2} we see that they
dominate (almost completely) the other method since for each length of top-$N$ list they
provide a better combination of TPR and FPR.

\subsubsection{Comparing ratings}

Next, we evaluate not top-$N$ recommendations, but how well the algorithms
can predict ratings.

<<>>=
## run algorithms
results <- evaluate(scheme, algorithms, type = "ratings")
@

The result is again an object of class~\class{evaluationResultList} for the
five recommender algorithms.
<<>>=
results
@


<<real, fig=TRUE, include=FALSE>>=
plot(results, ylim = c(0,100))
@
\begin{figure}
\centerline{\includegraphics[scale=1]{recommenderlab-real}}
\caption{Comparison of RMSE, MSE, and MAE for recommender methods for the
given-3 evaluation scheme.}
\label{fig:real}
\end{figure}

Plotting the results shows a barplot with
the root mean square error, the mean square error and the mean absolute
error (see Figures~\ref{fig:real}).

\subsubsection{Using a 0-1 data set}
For comparison we will check how the algorithms compare given less information.
We convert the data set into 0-1 data and instead of a all-but-5 we use the given-3 scheme.

<<>>=
Jester_binary <- binarize(Jester5k, minRating=5)
Jester_binary <- Jester_binary[rowCounts(Jester_binary)>20]
Jester_binary
scheme_binary <- evaluationScheme(Jester_binary[1:1000],
	method="split", train=.9, k=1, given=3)
scheme_binary
@

<<>>=
results_binary <- evaluate(scheme_binary, algorithms,
  type = "topNList", n=c(1,3,5,10,15,20))
@

Note that SVD does not implement a method for binary data and is thus skipped.

<<roc3, fig=TRUE, include=FALSE>>=
plot(results_binary, annotate=c(1,3), legend="topright")
@
\begin{figure}
\centerline{\includegraphics[scale=1]{recommenderlab-roc3}}
\caption{Comparison of ROC curves for several recommender methods for the
given-3 evaluation scheme.}
\label{fig:roc3}
\end{figure}

From Figure~\ref{fig:roc3} we see that given less information,
the performance of user-based CF suffers the most and the simple
popularity based recommender performs almost a well as item-based CF.

Similar to the examples presented here, it is easy to compare different
recommender algorithms for different data sets or to compare
different algorithm settings (e.g.,
the influence of neighborhood formation using different distance measures or different neighborhood sizes).

%\marginpar{getting data in missing!}

%\subsection{Compare rankings}
%
%Top-$N$ recommendation lists can be considered rankings and can be represented
%as $N$-ary order relations ($\leq$) on the set of all items in the lists (items
%which do not occur in all lists are added to the end of the corresponding
%order).
%
%{\bf Average distance between methods:}
%E.g., the cardinality of the symmetric difference of two relations (the
%		number of tuples contained in exactly one of two relations)
%{\small\citep{Hornik:2008}}
%
%Total number of tuples in relations: 8649 (for 93 items)
%
%	\begin{verbatim}
%	UB   IB   AR Pop N
%	UB       0 2555 2634  3317
%	IB    2555    0 1611  2052
%	AR    2634 1611    0  2636
%	Pop N 3317 2052 2636     0
%	\end{verbatim}
%
%\small
%\begin{verbatim}
%u_a = {Toy Story (1995), Air Force One (1997),
%	Cop Land (1997), Michael (1996), Blade Runner (1982)}
%
%	UB IB AR Top N Consensus*
%	Contact (1997)                 1  4 25     4         4
%	Liar Liar (1997)               2 18 32    10         8
%	Conspiracy Theory (1997)       3 NA NA    NA        NA
%	Star Wars (1977)               4  1  1     1         1
%	Return of the Jedi (1983)      5  2  4     2         2
%	English Patient, The (1996)    6 26 41     6        12
%	Saint, The (1997)              7 NA NA    NA        NA
%	Fargo (1996)                   8  3 13     3         3
%	Mr. Holland's Opus (1995)      9 34 30    39        17
%	Scream (1996)                 10 20 35     7        10
%	\end{verbatim}
%
%	* Consensus method: Condorcet (minimizes the weighted sum
%			of symmetric difference distance)
%
%


%\subsection{ROCR interface}

\subsection{Implementing a new recommender algorithm}

Adding a new recommender algorithm to \pkg{recommenderlab}
is straight forward since
it uses a registry mechanism to manage the algorithms.
To implement the actual recommender algorithm
we need to implement a creator function which takes a training data set,
trains a model and provides a predict function which uses the
model to create recommendations for new data.
The model and the predict function are both encapsulated in an
object of class~\class{Recommender}.

For examples look at the files starting with \code{RECOM} in the
packages \code{R} directory. A good examples is in \code{RECOM_POPULAR.R}.

\section{Conclusion}
\label{sec:conclusion}

%% FIXME

%Being able to use automated recommender systems
%offers a big advantage for online
%retailers and for many other applications. But
%often no extensive data base of rating data is available and
%it makes sense to think about using 0-1 data, which is in many
%cases easier to obtain, instead.
%Unfortunately there is only limited research on collaborative filtering
%based recommender systems using 0-1 data available.
%
In this paper we described the \proglang{R}
extension package~\pkg{recommenderlab}
which is especially geared towards developing and testing
recommender algorithms.
%for 0-1 data.
The package allows to create evaluation schemes following accepted
methods and then use them to evaluate and compare recommender algorithms.
\pkg{recommenderlab} currently includes several standard algorithms and
adding new recommender algorithms to the package is facilitated by the
built in registry mechanism to manage algorithms.
%
%\pkg{recommenderlab} currently includes several algorithms for 0-1 data,
%however, the infrastructure is flexible enough to also
%extend to the more conventional non-binary rating data and its algorithms.
In the future we will add more and more of these algorithms to the package and
we hope that some algorithms will also
be contributed by other researchers.

%Applying collaborative filtering methods to binary and
%especially to 0-1 data is difficult because
%the sparsity of data has an adverse effect on similarity
%calculation/neighborhood formation.
%Also the meaning of $0$ in the data can be either of ``unknown'' or
%``dislike'' and thus it is not clear how to best handle it.
%Measures like the Jaccard index can be used to mitigate this
%problem since they focus mainly on $1$s.
%Overall, the quality of recommendations is extremely data set dependent.

%Also evaluation is biased since the complete set of ``liked'' items
%is unknown. Measured precision is only a lower bound to real precision.

%\begin{itemize}
%\item Many companies face heterogeneous data sources which can best be
%aggregated into binary data.
%\end{itemize}

%\item Model ``dislike'' in binary data {\small\citep{Mobasher:2001}}.
%\item Account for fast changing products
%    (e.g., vintages differ often significantly and change every year).
%\item Investigate recommender engines based on consensus rankings.

\section*{Acknowledgments}
This research was funded in part by the NSF Industry/University Cooperative Research Center for Net-Centric Software \& Systems.


%
%\bibliographystyle{abbrvnat}
%\bibliography{recommenderlab}
%

%\bibliographystyle{abbrvnat}
\bibliography{recommender,recom_talk,data,stuff,association_rules,recommenderlab_usage}

\end{document}