% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/recometrics.R
\name{calc.reco.metrics}
\alias{calc.reco.metrics}
\title{Calculate Recommendation Quality Metrics}
\usage{
calc.reco.metrics(
  X_train,
  X_test,
  A,
  B,
  k = 5L,
  item_biases = NULL,
  as_df = TRUE,
  by_rows = FALSE,
  sort_indices = TRUE,
  precision = TRUE,
  trunc_precision = FALSE,
  recall = FALSE,
  average_precision = TRUE,
  trunc_average_precision = FALSE,
  ndcg = TRUE,
  hit = FALSE,
  rr = FALSE,
  roc_auc = FALSE,
  pr_auc = FALSE,
  all_metrics = FALSE,
  rename_k = TRUE,
  break_ties_with_noise = TRUE,
  min_pos_test = 1L,
  min_items_pool = 2L,
  consider_cold_start = TRUE,
  cumulative = FALSE,
  nthreads = parallel::detectCores(),
  seed = 1L
)
}
\arguments{
\item{X_train}{Training data for user-item interactions, with users denoting rows,
items denoting columns, and values corresponding to confidence scores.
Entries in `X_train` and `X_test` for each user should not intersect (that is,
if an item is the training data as a non-missing entry, it should not be in
the test data as non-missing, and vice versa).

Should be passed as a sparse matrix in CSR format (class `dgRMatrix` from
package `Matrix`, can be converted to that format using
`MatrixExtra::as.csr.matrix`). Items not consumed by the user should not
be present in this matrix.

Alternatively, if there is no training data, can pass `NULL`, in which case it
will look only at the test data.

This matrix and `X_test` are not meant to contain negative values, and if
`X_test` does contain any, it will still be assumed for all metrics other than
NDCG that such items are deemed better for the user than the missing/zero-valued
items (that is, implicit feedback is not meant to signal dislikes).}

\item{X_test}{Test data for user-item interactions. Same format as `X_train`.}

\item{A}{The user factors. If the number of users is 'm' and the number of
factors is 'p', should have dimension `[p, m]` if passing `by_rows=FALSE`
(the default), or dimension `[m, p]` if passing `by_rows=TRUE` (in wich case
it will be internally transposed due to R's column-major storage order). Can
be passed as a dense matrix from base R (class `matrix`), or as a matrix from
package float (class `float32`) - if passed as `float32`, will do the
calculations in single precision (which is faster and uses less memory) and
output the  calculated metrics as `float32` arrays.

It is assumed that the model score for a given item `j` for user `i` is
calculated as the inner product or dot product between the corresponding vectors
\eqn{\mathbf{a}_i \cdot \mathbf{b}_j}{<a[i], b[j]>}
(columns `i` and `j` of `A` and `B`, respectively, when passing
`by_rows=FALSE`), with higher scores meaning that the item is deemed better for
that user, and the top-K recommendations produced by ranking these scores in
descending order.

Alternatively, for evaluation of non-personalized models, can pass `NULL` here
and for `B`, in which case `item_biases` must be passed.}

\item{B}{The item factors, in the same format as `A`.}

\item{k}{The number of top recommendations to consider for the metrics (as
in "precision-at-k" or "P@K").}

\item{item_biases}{Optional item biases/intercepts (fixed base score that is
added to the predictions of each item). If present, it will append them to `B`
as an extra factor while adding a factor of all-ones to `A`.

Alternatively, for non-personalized models which have only item-by-item scores,
can pass `NULL` for `A` and `B` while passing only `item_biases`.}

\item{as_df}{Whether to output the result as a `data.frame`. If passing `FALSE`,
the results will be returned as a list of vectors or matrices depending on
what is passed for `cumulative`. If `A` and `B` are passed as `float32` matrices,
the resulting `float32` arrays will be converted to base R's arrays in order to
be able to create a `data.frame`.}

\item{by_rows}{Whether the latent factors/components are ordered by rows,
in which case they will be transposed beforehand (see documentation for `A`).}

\item{sort_indices}{Whether to sort the indices of the `X` data in case they
are not sorted already. Skipping this step will make it faster and will make
it consume less memory.

If the `X_train` and `X_test` matrices were created using functions from the
`Matrix` package such as `Matrix::spMatrix` or `Matrix::Matrix`, the indices
will always be sorted, but if creating it manually through S4 methods or as the
output of other software, the indices can end up unsorted.}

\item{precision}{Whether to calculate precision metrics or not.}

\item{trunc_precision}{Whether to calculate truncated precision metrics or not.
Note that this is output as a separate metric from "precision" and they are not
mutually exclusive options.}

\item{recall}{Whether to calculate recall metrics or not.}

\item{average_precision}{Whether to calculate average precision metrics or not.}

\item{trunc_average_precision}{Whether to calculate truncated average
precision metrics or not. Note that this is output as a separate metric from
"average_precision" and they are not mutually exclusive options.}

\item{ndcg}{Whether to calculate NDCG (normalized discounted cumulative gain)
metrics or not.}

\item{hit}{Whether to calculate Hit metrics or not.}

\item{rr}{Whether to calculate RR (reciprocal rank) metrics or not.}

\item{roc_auc}{Whether to calculate ROC-AUC (area under the ROC curve) metrics or not.}

\item{pr_auc}{Whether to calculate PR-AUC (area under the PR curve) metrics or not.}

\item{all_metrics}{Passing `TRUE` here is equivalent to passing `TRUE` to all the
calculable metrics.}

\item{rename_k}{If passing `as_df=TRUE` and `cumulative=FALSE`, whether to rename
the 'k' in the resulting column names to the actual value of 'k' that was used
(e.g. "p_at_k" -> "p_at_5").}

\item{break_ties_with_noise}{Whether to add a small amount of noise
`~Uniform(-10^-12, 10^-12)` in order to break ties
at random, in case there are any ties in the ranking. This is not recommended unless
one expects ties (can happen if e.g. some factors are set to all-zeros for some items),
as it has the potential to slightly alter the ranking.}

\item{min_pos_test}{Minimum number of positive entries
(non-zero entries in the test set) that users need to have in
order to calculate metrics for that user.
If a given user does not meet the threshold, the metrics
will be set to `NA`.}

\item{min_items_pool}{Minimum number of items (sum of positive and negative items)
that a user must have in order to
calculate metrics for that user. If a given user does not meet the threshold,
the metrics will be set to `NA`.}

\item{consider_cold_start}{Whether to calculate metrics in situations in
which some user has test data but no positive
(non-zero) entries in the training data. If passing `FALSE` and such cases are
 encountered, the metrics will be set to `NA`.

Will be automatically set to `TRUE` when passing `NULL` for `X_train`.}

\item{cumulative}{Whether to calculate the metrics cumulatively
(e.g. [P@1, P@2, P@3] if passing `k=3`)
for all values up to `k`, or only for a single desired `k`
(e.g. only P@3 if passing `k=3`).}

\item{nthreads}{Number of parallel threads to use.
Parallelization is done at the user level, so passing
more threads than there are users will not result in a speed up. Be aware that, the more
threads that are used, the higher the memory consumption.}

\item{seed}{Seed used for random number generation. Only used when passing
`break_ties_with_noise=TRUE`.}
}
\value{
Will return the calculated metrics on a per-user basis (each user
corresponding to a row):\itemize{
\item If passing `as_df=TRUE` (the default), the result will be a `data.frame`,
with the columns named according to the metric they represent (e.g. "p_at_3",
see below for the other names that they can take). Depending on the value
passed for `rename_k`, the column names might end in "k" or in the number
that was passed for `k` (e.g "p_at_3" or "p_at_k").

If passing `cumulative=TRUE`, they will have names ranging from 1 to `k`.
\item If passing `as_df=FALSE`, the result will be a list with entries named
according to each metric, with `k` as letter rather than number (`p_at_k`,
`tp_at_k`, `r_at_k`, `ap_at_k`, `tap_at_k`, `ndcg_at_k`, `hit_at_k`, `rr_at_k`,
`roc_auc`, `pr_auc`), plus an additional entry with the actual `k`.

The values under each entry will be vectors if passing
`cumulative=FALSE`, or matrices (users corresponding to rows) if passing
`cumulative=TRUE`.
}

The `ROC-AUC` and `PR-AUC` metrics will be named just "roc_auc" and "pr_auc",
since they are calculated for the full ranked predictions without stopping at `k`.
}
\description{
Calculates recommendation quality metrics for implicit-feedback
recommender systems (fit to user-item interactions data such as "number of
times that a user played each song in a music service") that are based on
low-rank matrix factorization or for which predicted scores can be reduced to
a dot product between user and item factors/components.

These metrics are calculated on a per-user basis, by producing a ranking of the
items according to model predictions (in descending order), ignoring the items
that are in the training data for each user. The items that were not consumed
by the user (not present in `X_train` and not present in `X_test`) are considered
"negative" entries, while the items in `X_test` are considered "positive" entries,
and the items present in `X_train` are ignored for these calculations.

The metrics that can be calculated by this function are:\itemize{
\item `P@K` ("precision-at-k"): denotes the proportion of items among the top-K
recommended (after excluding those that were already in the training data)
that can be found in the test set for that user:

\eqn{P@K = \frac{1}{k} \sum_{i=1}^k r_i \in \mathcal{T}}{
P@K = sum(reco[i..k] \%in\% test) / k}

This is perhaps the most intuitive and straightforward metric, but it can
present a lot of variation between users and does not take into account
aspects such as the number of available test items or the specific ranks at
which they are shown.
\item `TP@K` (truncated precision-at-k): a truncated or standardized version
of the precision metric, which will divide instead by the minimum between
`k` and the number of test items:

\eqn{TP@K = \frac{1}{\min\{k, \mathcal{T}\}} \sum_{i=1}^k r_i \in \mathcal{T}}{
TP@K = sum(reco[i..k] \%in\% test) / min(k, length(test))}

\bold{Note:} many papers and libraries call this the "P@K" instead. The
"truncated" prefix is a non-standard nomenclature introduced here to
differentiate it from the P@K metric that is calculated by this and
other libraries.
\item `R@K` ("recall-at-k"): proportion of the test items that are retrieved
in the top-K recommended list. Calculation is the same as precision, but the
division is by the number of test items instead of `k`:

\eqn{R@K = \frac{1}{|\mathcal{T}|} \sum_{i=1}^k r_i \in \mathcal{T}}{
R@K = sum(reco[i..k] \%in\% test) / length(test)
}
\item `AP@K` ("average precision-at-k"): precision and recall look at all the items
in the top-K equally, whereas one might want to take into account also the ranking
within this top-K list, for which this metric comes in handy.
"Average Precision" tries to reflect the precisions that would be obtained at
different recalls:

\eqn{AP@K = \frac{1}{|\mathcal{T}|} \sum_{i=1}^k (r_i \in \mathcal{T}) \cdot P@i}{
AP@K = sum(p_at_k[1..k] * (reco[1..k] \%in\% test)) / length(test))
}

This is a metric which to some degree considers precision, recall, and rank within
top-K. Intuitively, it tries to approximate the are under a precision-recall
tradeoff curve.

The average of this metric across users is known as "Mean Average Precision"
or "MAP@K".

\bold{IMPORTANT:} many authors define AP@K differently, such as dividing by
the minimum between `k` and the number of test items instead, or as the average
for P@1..P@K (either as-is or stopping after already retrieving all the test
items) - here, the second version is offered as different metric instead.
This metric is likely to be a source of mismatches when comparing against
other libraries due to all the different defintions used by different authors.
\item `TAP@K` (truncated average precision-at-k): a truncated version of the
AP@K metric, which will instead divide it by the minimum between `k` and the
number of test items.

Many other papers and libraries call this the "average precision" instead.
\item `NDCG@K` (normalized discounted cumulative gain at K): a ranking metric
calculated by first determining the following:

\eqn{\sum_{i=1}^k \frac{C_i}{log_2(i+1)}}{sum[i=1..K](C[i] / log2(i+1))}

Where \eqn{C_i}{C[i]} denotes the confidence score for an item (taken as the value
in `X_test` for that item), with `i` being the item ranked at a given position
for a given user according to the model. This metric is then standardized by
dividing by the maximum achievable such score given the test data.

Unlike the other metrics:\itemize{
\item It looks not only at the presence or absence of items, but also at their
 confidence score.
\item It can handle data which contains "dislikes" in the form of negative
values (see caveats below).
}

If there are only non-negative values in `X_test`, this metric will be bounded
between zero and one.

A note about negative values: the NDCG metric assumes that all the values are
non-negative. This implementation however can accommodate situations in which
a very small fraction of the items have negative values, in which case:
(a) it will standardize the metric by dividing by a number which does not
consider the negative values in its sum; (b) it will be set to `NA` if there
are no positive values. Be aware however that NDCG loses some of its desirable
properties in the presence of negative values.
\item `Hit@K` ("hit-at-k"): indicates whether any of the top-K recommended items
can be found in the test set for that user. The average across users is typically
referred to as the "Hit Rate".

This is a binary metric (it is either zero or one, but can also be `NA` when
it is not possible to calculate, just like the other metrics).
\item `RR@K` ("reciprocal-rank-at-k"): inverse rank (one divided by the rank)
of the first item among the top-K recommended that is in the test data.
The average across users is typically referred to as the "Mean Reciprocal Rank"
or MRR.

If none of the top-K recommended items are in the test data, will be set to zero.
\item `ROC-AUC` (area under the receiver-operating characteristic curve): see the
\href{https://en.wikipedia.org/wiki/Receiver_operating_characteristic}{Wikipedia entry}
for details. This metric considers the full ranking of items
rather than just the top-K. It is bounded between zero and one, with a value of
0.5 corresponding to a random order and a value of 1 corresponding to a perfect
ordering (i.e. every single positive item has a higher predicted score than every
single negative item).

Be aware that there are different ways of calculating AUC, with some methods
having higher precision than others. This implementation uses a fast formula
which implies dividing two large numbers, and as such might not be as precise
to the same number of decimals as the trapezoidal method used by e.g. scikit-learn.
\item `PR-AUC` (area under the precision-recall curve): while ROC AUC provides an
overview of the overall ranking, one is typically only interested in how well it
retrieves test items within top ranks, and for this the area under the
precision-recall curve can do a better job at judging rankings, albeit the metric
itself is not standardized, and under the worst possible ranking, it does not
evaluate to zero.

The metric is calculated using the fast but not-so-precise rectangular method,
whose formula corresponds to the AP@K metric with K=N. Some papers and libraries
call this the average of this metric the "MAP" or "Mean Average Precision" instead
(without the "@K").
}

Metrics can be calculated for a given value of `k` (e.g. "P@3"), or for
values ranging from 1 to `k` (e.g. ["P@1", "P@2", "P@3"]).

This package does \bold{NOT} cover other more specialized metrics. One might
also want to look at other indicators such as:\itemize{
\item Metrics that look at the rareness of the items recommended
(to evaluate so-called "serendipity").
\item Metrics that look at "discoverability".
\item Metrics that take into account the diversity of the ranked lists.
}
}
\details{
Metrics for a given user will be set to `NA` in the following
situations:\itemize{
\item All the rankeable items have the exact same predicted score.
\item One or more of the predicted scores evaluates to `NA`/`NaN`.
\item There are only negative entries (no non-zero entries in the test data).
\item The number of available items to rank (between positive and negative) is
smaller than the requested `k`, and the metric is not affected by the exact order
within the top-K items (i.e. precision, recall, hit, will be `NA` if there's
`k` or fewer items after discarding those from the training data).
\item There are inconsistencies in the data (e.g. number of entries being greater
than the number of columns in `X`, meaning the matrices do not constitute valid CSR).
\item A user does not meet the minimum criteria set by the configurable parameters
for this function.
\item There are only positive entries (i.e. the user already consumed all the items).
In this case, "NDCG@K" will still be calculated, while the rest will be set
to `NA`.
}

The NDCG@K metric with `cumulative=TRUE` will have lower decimal precision
than with `cumulative=FALSE` when using `float32` inputs - this is extremely
unlikely to be noticeable in typical datasets and small `k`, but for large `k`
and large (absolute) values in `X_test`, it might make a difference after a
couple of decimal points.

Internally, it relies on BLAS function calls, so it's recommended to use
R with an optimized BLAS library such as OpenBLAS or MKL for better speed - see
\href{https://github.com/david-cortes/R-openblas-in-windows}{this link}
for instructions on getting OpenBLAS in R for Windows
(Alternatively, Microsoft's R distribution comes with MKL preinstalled).
}
\examples{
### (See the package vignette for a better example)
library(recometrics)
library(Matrix)
library(MatrixExtra)

### Generating random data
n_users <- 10L
n_items <- 20L
n_factors <- 3L
k <- 4L
set.seed(1)
UserFactors <- matrix(rnorm(n_users * n_factors), nrow=n_factors)
ItemFactors <- matrix(rnorm(n_items * n_factors), nrow=n_factors)
X <- Matrix::rsparsematrix(n_users, n_items, .5, repr="R")
X <- abs(X)

### Generating a random train-test split
data_split <- create.reco.train.test(X, split_type="all")
X_train <- data_split$X_train
X_test <- data_split$X_test

### Calculating these metrics
### (should be poor quality, since data is random)
metrics <- calc.reco.metrics(
    X_train, X_test,
    UserFactors, ItemFactors,
    k=k, as_df=TRUE,
    nthreads=1L
)
print(metrics)
}
