| Title: | Functions and Data Sets for "That's Weird: Anomaly Detection Using R" by Rob J Hyndman |
| Description: | All functions and data sets required for the examples in the book Hyndman (2026) "That's Weird: Anomaly Detection Using R" https://OTexts.com/weird/. All packages needed to run the examples are also loaded. |
| Version: | 2.0.0 |
| Depends: | R (≥ 4.1.0) |
| Imports: | aplpack, broom, cli (≥ 1.0.0), crayon (≥ 1.3.4), dbscan, distributional, dplyr (≥ 0.7.4), evd, ggplot2 (≥ 3.1.1), grDevices, ks, lookout (≥ 2.0.0), mvtnorm, purrr (≥ 0.2.4), RANN, rlang, robustbase, rstudioapi (≥ 0.7), stray, vctrs |
| Suggests: | testthat (≥ 3.0.0), tidyr |
| URL: | https://pkg.robjhyndman.com/weird/, https://github.com/robjhyndman/weird |
| BugReports: | https://github.com/robjhyndman/weird/issues |
| License: | GPL-3 |
| Encoding: | UTF-8 |
| LazyData: | true |
| LazyDataCompression: | xz |
| RoxygenNote: | 7.3.3 |
| Config/testthat/edition: | 3 |
| NeedsCompilation: | no |
| Packaged: | 2026-01-27 01:45:45 UTC; hyndman |
| Author: | Rob Hyndman |
| Maintainer: | Rob Hyndman <Rob.Hyndman@monash.edu> |
| Repository: | CRAN |
| Date/Publication: | 2026-01-27 06:10:03 UTC |
weird: Functions and Data Sets for "That's Weird: Anomaly Detection Using R" by Rob J Hyndman
Description
All functions and data sets required for the examples in the book Hyndman (2026) "That's Weird: Anomaly Detection Using R" https://OTexts.com/weird/. All packages needed to run the examples are also loaded.
Author(s)
Maintainer: Rob Hyndman Rob.Hyndman@monash.edu (ORCID) [copyright holder]
Other contributors:
RStudio [copyright holder]
See Also
Useful links:
Report bugs at https://github.com/robjhyndman/weird/issues
Cricket batting data for international test players
Description
A dataset containing career batting statistics for all international test players (men and women) up to 6 October 2025.
Usage
cricket_batting
Format
A data frame with 3968 rows and 15 variables:
- Player
Player name in form of "initials surname"
- Country
Country played for
- Start
First year of test playing career
- End
Last year of test playing career
- Matches
Number of matches played
- Innings
Number of innings batted
- NotOuts
Number of times not out
- Runs
Total runs scored
- HighScore
Highest score in an innings
- HighScoreNotOut
Was highest score not out?
- Average
Batting average at end of career
- Hundreds
Total number of 100s scored
- Fifties
Total number of 50s scored
- Ducks
Total number of 0s scored
- Gender
"Men" or "Women"
Value
Data frame
Source
References
Rob J Hyndman (2026) "That's weird: Anomaly detection using R", Section 1.4, https://OTexts.com/weird/.
Examples
cricket_batting |>
filter(Innings > 20) |>
select(Player, Country, Matches, Runs, Average, Hundreds, Fifties, Ducks) |>
arrange(desc(Average))
Create distributional object based on a specified density
Description
Creates a distributional object using a density specified as pair of vectors giving (x, f(x)). The density is assumed to be piecewise linear between the points provided, and 0 outside the range of x.
Usage
dist_density(x, density)
Arguments
x |
Numerical vector of ordinates, or a list of such vectors. |
density |
Numerical vector of density values, or a list of such vectors. |
Examples
dist_density(seq(-4, 4, by = 0.01), dnorm(seq(-4, 4, by = 0.01)))
Create distributional object based on a kernel density estimate
Description
Creates a distributional object using a kernel density estimate with a
Gaussian kernel obtained from the kde() function. The bandwidth
can be specified; otherwise the kde_bandwidth() function is used.
The cdf, quantiles and moments are consistent with the kde. Generating
random values from the kde is equivalent to a smoothed bootstrap.
Usage
dist_kde(
y,
h = NULL,
H = NULL,
method = c("normal", "robust", "plugin", "lookout"),
...
)
Arguments
y |
Numerical vector or matrix of data, or a list of such objects. If a list is provided, then all objects should be of the same dimension. e.g., all vectors, or all matrices with the same number of columns. |
h |
Bandwidth for univariate distribution. Ignored if |
H |
Bandwidth matrix for multivariate distribution. If |
method |
The method of bandwidth estimation to use. See |
... |
Other arguments are passed to |
References
Rob J Hyndman (2026) "That's weird: Anomaly detection using R", Section 2.7 and 3.9, https://OTexts.com/weird/.
Examples
dist_kde(c(rnorm(200), rnorm(100, 5)))
dist_kde(cbind(rnorm(200), rnorm(200, 5)))
Wine prices and points
Description
A data set containing data on wines from 44 countries, taken from Wine Enthusiast Magazine during the week of 15 June 2017. The data are downloaded and returned.
Usage
fetch_wine_reviews()
Format
A data frame with 110,203 rows and 8 columns:
- country
Country of origin
- state
State or province of origin
- region
Region of origin
- winery
Name of vineyard that made the wine
- variety
Variety of grape
- points
Points allocated by WineEnthusiast reviewer on a scale of 0-100
- price
Price of a bottle of wine in $US
- year
Year of wine extracted from
title
Value
Data frame
Source
References
Rob J Hyndman (2026) "That's weird: Anomaly detection using R", Section 1.4, https://OTexts.com/weird/.
Examples
## Not run:
wine_reviews <- fetch_wine_reviews()
wine_reviews |>
ggplot(aes(x = points, y = price)) +
geom_jitter(height = 0, width = 0.2, alpha = 0.1) +
scale_y_log10()
## End(Not run)
French mortality rates by age and sex
Description
A data set containing French mortality rates between the years 1816 and 1999, by age and sex.
Usage
fr_mortality
Format
A data frame with 31,648 rows and 4 columns.
Value
Data frame
Source
Human Mortality Database https://www.mortality.org
References
Rob J Hyndman (2026) "That's weird: Anomaly detection using R", Section 1.4, https://OTexts.com/weird/.
Examples
fr_mortality
Bagplot
Description
Produces a bivariate bagplot. A bagplot is analagous to a univariate boxplot, except it is in two dimensions. Like a boxplot, it shows the median, a region containing 50% of the observations, a region showing the remaining observations other than outliers, and any outliers.
Usage
gg_bagplot(data, var1, var2, color = "#00659e", show_points = FALSE, ...)
Arguments
data |
A data frame or matrix containing the data. |
var1 |
The name of the first variable to plot (a bare expression). |
var2 |
The name of the second variable to plot (a bare expression). |
color |
The base color to use for the median. Other colors are generated
as a mixture of |
show_points |
A logical argument indicating if a regular bagplot is required
( |
... |
Other arguments are passed to the |
Value
A ggplot object showing a bagplot or scatterplot of the data.
Author(s)
Rob J Hyndman
References
Rousseeuw, P. J., Ruts, I., & Tukey, J. W. (1999). The bagplot: A bivariate boxplot. The American Statistician, 52(4), 382–387.
Rob J Hyndman (2026) "That's weird: Anomaly detection using R", Section 5.6, https://OTexts.com/weird/.
See Also
Examples
gg_bagplot(n01, v1, v2)
gg_bagplot(n01, v1, v2, show_points = TRUE)
Produce ggplot of densities from distributional objects in 1 or 2 dimensions
Description
Produce ggplot of densities from distributional objects in 1 or 2 dimensions
Usage
gg_density(
object,
prob = seq(9)/10,
hdr = NULL,
show_points = FALSE,
show_mode = FALSE,
show_anomalies = FALSE,
colors = c("#0072b2", "#D55E00", "#009E73", "#CC79A7", "#E69F00", "#56B4E9", "#F0E442",
"#333333"),
alpha = NULL,
jitter = FALSE,
ngrid = 501
)
Arguments
object |
distribution object from the distributional package or
|
prob |
Probability of the HDRs to be drawn. |
hdr |
Character string describing how the HDRs are to be shown. Options
are "none", "fill", "points" and "contours" (the latter only for bivariate plots).
If |
show_points |
If |
show_mode |
If |
show_anomalies |
If |
colors |
Color palette to use. If there are more than
|
alpha |
Transparency of points. Ignored if |
jitter |
For univariate distributions, when |
ngrid |
Number of grid points to use for the density function. |
Details
This function produces a ggplot of a density from a distributional object.
For univariate densities, it produces a line plot of the density function, with
an optional ribbon showing some highest density regions (HDRs) and/or the observations.
For bivariate densities, it produces ah HDR contour plot of the density function, with
the observations optionally shown as points.
The mode can also be drawn as a point.
The combination of hdr = "fill", show_points = TRUE,
show_mode = TRUE, and prob = c(0.5, 0.99) is equivalent to showing
HDR boxplots.
Value
A ggplot object.
Author(s)
Rob J Hyndman
Examples
# Univariate densities
kde <- dist_kde(c(rnorm(500), rnorm(500, 4, .5)))
gg_density(kde,
hdr = "fill", prob = c(0.5, 0.95), color = "#c14b14",
show_mode = TRUE, show_points = TRUE, jitter = TRUE
)
c(dist_normal(), kde) |>
gg_density(hdr = "fill", prob = c(0.5, 0.95))
# Bivariate density
tibble(y1 = rnorm(5000), y2 = y1 + rnorm(5000)) |>
dist_kde() |>
gg_density(show_points = TRUE, alpha = 0.1, hdr = "fill")
Add ggplot layer of densities from distributional objects in 1 dimension
Description
Add ggplot layer of densities from distributional objects in 1 dimension
Usage
gg_density_layer(object, scale = 1, ngrid = 501, ...)
Arguments
object |
distribution object from the distributional package or
|
scale |
Scaling factor for the density function. |
ngrid |
Number of grid points to use for the density function. |
... |
Additional arguments are passed to |
Details
This function adds a ggplot layer of a density from a distributional object. For univariate densities, it adds a line plot of the density function. For bivariate densities, it adds a contour plot of the density function.
Value
A ggplot layer
Author(s)
Rob J Hyndman
Examples
dist_mixture(
dist_normal(-2, 1),
dist_normal(2, 1),
weights = c(1 / 3, 2 / 3)
) |>
gg_density() +
gg_density_layer(dist_normal(-2, 1), linetype = "dashed", scale = 1 / 3) +
gg_density_layer(dist_normal(2, 1), linetype = "dashed", scale = 2 / 3)
HDR plot
Description
Produces a 1d or 2d box plot of HDR regions. The darker regions contain observations with higher probability, while the lighter regions contain points with lower probability. Observations outside the largest HDR are shown as individual points. Anomalies with leave-one-out surprisal probabilities less than 0.005 are optionally shown in black.
Usage
gg_hdrboxplot(
data,
var1,
var2 = NULL,
prob = c(0.5, 0.99),
color = "#0072b2",
show_points = FALSE,
show_anomalies = TRUE,
alpha = NULL,
jitter = TRUE,
ngrid = 501,
...
)
Arguments
data |
A data frame or matrix containing the data. |
var1 |
The name of the first variable to plot (a bare expression). |
var2 |
Optionally, the name of the second variable to plot (a bare expression). |
prob |
A numeric vector specifying the coverage probabilities for the HDRs. |
color |
The base color to use for the mode. Colors for the HDRs are generated by whitening this color. |
show_points |
A logical argument indicating if a regular HDR plot is required
( |
show_anomalies |
A logical argument indicating if the surprisal anomalies should be shown (in black). These are points with leave-one-out surprisal probability values less than 0.005 (using a GPD approximation), and which lie outside the 99% HDR region. |
alpha |
Transparency of points. Ignored if |
jitter |
A logical value indicating if the points should be vertically jittered for the 1d box plots to reduce overplotting. |
ngrid |
Number of grid points to use for the density function. |
... |
Other arguments passed to |
Details
The original HDR boxplot proposed by Hyndman (1996), can be produced
with show_anomalies = FALSE, jitter = FALSE, alpha = 1, and all other
arguments set to their defaults.
Value
A ggplot object showing an HDR plot or scatterplot of the data.
Author(s)
Rob J Hyndman
References
Hyndman, R J (1996) Computing and Graphing Highest Density Regions, The American Statistician, 50(2), 120–126. https://robjhyndman.com/publications/hdr/
Rob J Hyndman (2026) "That's weird: Anomaly detection using R", Section 5.7, https://OTexts.com/weird/.
See Also
Examples
df <- data.frame(x = c(rnorm(1000), rnorm(1000, 5, 1), 10))
gg_hdrboxplot(df, x, show_anomalies = TRUE)
cricket_batting |>
filter(Innings > 20) |>
gg_hdrboxplot(Average)
oldfaithful |>
gg_hdrboxplot(duration, waiting, show_points = TRUE)
GLOSH scores
Description
Compute Global-Local Outlier Score from Hierarchies. This is based
on hierarchical clustering where the minimum cluster size is k. The resulting
outlier score is a measure of how anomalous each observation is.
The function uses dbscan::hdbscan to do the calculation.
Usage
glosh_scores(y, k = 10, ...)
Arguments
y |
Numerical matrix or vector of data |
k |
Minimum cluster size. Default: 5. |
... |
Additional arguments passed to |
Value
Numerical vector containing GLOSH values
Author(s)
Rob J Hyndman
See Also
dbscan::glosh
Examples
y <- c(rnorm(49), 5)
glosh_scores(y)
Statistical tests for anomalies using Grubbs' test and Dixon's test
Description
Grubbs' test (proposed in 1950) identifies possible anomalies in univariate data using z-scores assuming the data come from a normal distribution. Dixon's test (also from 1950) compares the difference in the largest two values to the range of the data. Critical values for Dixon's test have been computed using simulation with interpolation using a quadratic model on logit(alpha) and log(log(n)).
Usage
grubbs_anomalies(y, alpha = 0.05)
dixon_anomalies(y, alpha = 0.05, two_sided = TRUE)
Arguments
y |
numerical vector of observations |
alpha |
size of the test. |
two_sided |
If |
Details
Grubbs' test is based on z-scores, and a point is identified as an
anomaly when the associated absolute z-score is greater than a threshold value.
A vector of logical values is returned, where TRUE indicates an anomaly.
This version of Grubbs' test looks for outliers anywhere in the sample.
Grubbs' original test came in several variations which looked for one outlier,
or two outliers in one tail, or two outliers on opposite tails. These variations
are implemented in the grubbs.test function.
Dixon's test only considers the maximum (and possibly the minimum) as potential outliers.
Value
A logical vector
Author(s)
Rob J Hyndman
References
Grubbs, F. E. (1950). Sample criteria for testing outlying observations. Annals of Mathematical Statistics, 21(1), 27–58.
Dixon, W. J. (1950). Analysis of extreme values. Annals of Mathematical Statistics, 21(4), 488–506.
Rob J Hyndman (2026) "That's weird: Anomaly detection using R", Section 4.4, https://OTexts.com/weird/.
See Also
Examples
x <- c(rnorm(1000), 5:10)
tibble(x = x) |> filter(grubbs_anomalies(x))
tibble(x = x) |> filter(dixon_anomalies(x))
y <- c(rnorm(1000), 5)
tibble(y = y) |> filter(grubbs_anomalies(y))
tibble(y = y) |> filter(dixon_anomalies(y))
Identify anomalies using the Hampel filter
Description
The Hampel filter is designed to find anomalies in time series data using mean absolute deviations in the vicinity of each observation.
Usage
hampel_anomalies(y, bandwidth, k = 3)
Arguments
y |
numeric vector containing time series |
bandwidth |
integer width of the window around each observation |
k |
numeric number of standard deviations to declare an outlier |
Details
First, a moving median is calculated using windows of size
2 * bandwidth + 1. Then the median absolute deviations from
this moving median are calculated in the same moving windows.
A point is declared an anomaly if its MAD is value is more than k standard
deviations. The MAD is converted to a standard deviation using MAD * 1.482602,
which holds for normally distributed data.
The first bandwidth and last bandwidth observations cannot
be declared anomalies.
Value
logical vector identifying which observations are anomalies.
Author(s)
Rob J Hyndman
References
Rob J Hyndman (2026) "That's weird: Anomaly detection using R", Section 9.2, https://OTexts.com/weird/.
Examples
set.seed(1)
df <- tibble(
time = seq(41),
y = c(rnorm(20), 5, rnorm(20))
) |>
mutate(hampel = hampel_anomalies(y, bandwidth = 3, k = 4))
df |> ggplot(aes(x = time, y = y)) +
geom_line() +
geom_point(data = df |> filter(hampel), col = "red")
Table of Highest Density Regions
Description
Compute a table of highest density regions (HDR) for a distributional object.
The HDRs are returned as a tibble with one row per interval and columns:
prob (giving the probability coverage),
density (the value of the density at the boundary of the HDR),
For one dimensional density functions, the tibble also has columns
lower (the lower ends of the intervals), and
upper (the upper ends of the intervals).
Usage
hdr_table(object, prob)
Arguments
object |
Distributional object such as that returned by |
prob |
Vector of probabilities giving the HDR coverage (between 0 and 1) |
Value
A tibble
Author(s)
Rob J Hyndman
Examples
# Univariate HDRs
c(dist_normal(), dist_kde(c(rnorm(100), rnorm(100, 3, 1)))) |>
hdr_table(c(0.5, 0.95))
dist_kde(oldfaithful$duration) |> hdr_table(0.95)
# Bivariate HDRs
dist_kde(oldfaithful[, c("duration", "waiting")]) |> hdr_table(0.90)
Robust bandwidth estimation for kernel density estimation
Description
Bandwidth matrices are estimated using either a robust version of the normal reference rule, or using the approach of Hyndman, Kandanaarachchi & Turner (2026).
Usage
kde_bandwidth(data, method = c("robust", "normal", "plugin", "lookout"), ...)
Arguments
data |
A numeric matrix or data frame. |
method |
A character string giving the method to use. Possibilities are:
|
... |
Additional arguments are ignored unless |
Value
A matrix of bandwidths (or a scalar in the case of univariate data).
Author(s)
Rob J Hyndman
References
Rob J Hyndman, Sevvandi Kandanaarachchi & Katharine Turner (2026) "When lookout sees crackle: Anomaly detection via kernel density estimation", unpublished. https://robjhyndman.com/publications/lookout2.html
Rob J Hyndman (2026) "That's weird: Anomaly detection using R", Section 2.7 and 3.9, https://OTexts.com/weird/.
Examples
# Univariate bandwidth calculation
kde_bandwidth(oldfaithful$duration)
# Bivariate bandwidth calculation
kde_bandwidth(oldfaithful[, c("duration", "waiting")])
Local outlier factors
Description
Compute local outlier factors using k nearest neighbours. A local
outlier factor is a measure of how anomalous each observation is based on
the density of neighbouring points.
The function uses dbscan::lof to do the calculation.
Usage
lof_scores(y, k = 10, ...)
Arguments
y |
Numerical matrix or vector of data |
k |
Number of neighbours to include. Default: 5. |
... |
Additional arguments passed to |
Value
Numerical vector containing LOF values
Author(s)
Rob J Hyndman
References
Rob J Hyndman (2026) "That's weird: Anomaly detection using R", Section 7.3, https://OTexts.com/weird/.
See Also
dbscan::lof
Examples
y <- c(rnorm(49), 5)
lof_scores(y)
Multivariate standard normal data
Description
A synthetic data set containing 1000 observations on 10 variables generated from independent standard normal distributions.
Usage
n01
Format
A data frame with 1000 rows and 10 columns.
Value
Data frame
References
Rob J Hyndman (2026) "That's weird: Anomaly detection using R", Section 1.4, https://OTexts.com/weird/.
Examples
n01
Old faithful eruption data
Description
A data set containing data on recorded eruptions of the Old Faithful Geyser in Yellowstone National Park, Wyoming, USA, from 14 January 2017 to 29 December 2023. Recordings are incomplete, especially during the winter months when observers may not be present.
Usage
oldfaithful
Format
A data frame with 2097 rows and 4 columns:
- time
Time eruption started
- recorded_duration
Duration of eruption as recorded
- duration
Duration of eruption in seconds
- waiting
Time to the following eruption in seconds
Value
Data frame
Source
References
Rob J Hyndman (2026) "That's weird: Anomaly detection using R", Section 1.4, https://OTexts.com/weird/.
Examples
oldfaithful |>
ggplot(aes(x = duration, y = waiting)) +
geom_point()
Anomalies according to Peirce's and Chauvenet's criteria
Description
Peirce's criterion and Chauvenet's criterion were both proposed in the 1800s as a way of determining what observations should be rejected in a univariate sample.
Usage
peirce_anomalies(y)
chauvenet_anomalies(y)
Arguments
y |
numerical vector of observations |
Details
These functions take a univariate sample y and return a logical
vector indicating which observations should be considered anomalies according
to either Peirce's criterion or Chauvenet's criterion.
Value
A logical vector
Author(s)
Rob J Hyndman
References
Peirce, B. (1852). Criterion for the rejection of doubtful observations. The Astronomical Journal, 2(21), 161–163.
Chauvenet, W. (1863). 'Method of least squares'. Appendix to Manual of Spherical and Practical Astronomy, Vol.2, Lippincott, Philadelphia, pp.469-566.
Rob J Hyndman (2026) "That's weird: Anomaly detection using R", Section 4.3, https://OTexts.com/weird/.
Examples
y <- rnorm(1000)
tibble(y = y) |> filter(peirce_anomalies(y))
tibble(y = y) |> filter(chauvenet_anomalies(y))
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
Stray anomalies
Description
Test if observations are anomalies according to the stray algorithm.
Usage
stray_anomalies(y, ...)
Arguments
y |
A vector, matrix, or data frame consisting of numerical variables. |
... |
Other arguments are passed to |
Value
Numerical vector containing logical values indicating if the observation is identified as an anomaly using the stray algorithm.
Author(s)
Rob J Hyndman
References
P D Talagala, R J Hyndman and K Smith-Miles (2021) Anomaly detection in high-dimensional data, Journal of Computational and Graphical Statistics, 30(2), 360-374.
Examples
# Univariate data
y <- c(6, rnorm(49))
stray_anomalies(y)
# Bivariate data
y <- cbind(rnorm(50), c(5, rnorm(49)))
stray_anomalies(y)
Stray scores
Description
Compute stray scores indicating how anomalous each observation is.
Usage
stray_scores(y, ...)
Arguments
y |
A vector, matrix, or data frame consisting of numerical variables. |
... |
Other arguments are passed to |
Value
Numerical vector containing stray scores.
Author(s)
Rob J Hyndman
References
P D Talagala, R J Hyndman and K Smith-Miles (2021) Anomaly detection in high-dimensional data, Journal of Computational and Graphical Statistics, 30(2), 360-374.
Examples
# Univariate data
y <- c(6, rnorm(49))
scores <- stray_scores(y)
threshold <- stray::find_threshold(scores, alpha = 0.01, outtail = "max", p = 0.5, tn = 50)
which(scores > threshold)
Surprisals and surprisal probabilities
Description
A surprisal is given by s = -\log f(y) where f is the
density or probability mass function of the estimated or assumed distribution,
and y is an observation. This is returned by surprisals().
A surprisal probability is the probability of a surprisal at least as extreme
as s. This is returned by surprisals_prob()
Usage
surprisals(object, ...)
surprisals_prob(
object,
approximation = c("none", "gpd", "rank"),
threshold_probability = 0.1,
...
)
Arguments
object |
A model or numerical data set |
... |
Other arguments are passed to the appropriate method. |
approximation |
Character string specifying the method to use in computing the surprisal probabilities. See Details below. |
threshold_probability |
Probability threshold when computing the GPD
approximation. This is the probability below which the GPD is fitted. Only
used if |
Details
The surprisal probabilities may be computed in three different ways.
When
approximation = "none"(the default), the surprisal probabilities are computed using the same distribution that was used to compute the surprisal values. Under this option, surprisal probabilities are equal to 1 minus the coverage probability of the largest HDR that contains each value. Surprisal probabilities smaller than 1e-6 are returned as 1e-6.When
approximation = "gdp", the surprisal probabilities are computed using a Generalized Pareto Distribution fitted to the most extreme surprisal values (those with probability less thanthreshold_probability). For surprisal probabilities greater thanthreshold_probability, the value ofthreshold_probabilityis returned. Under this option, the distribution is used for computing the surprisal values but not for determining their probabilities. Due to extreme value theory, the resulting probabilities should be relatively insensitive to the distribution used in computing the surprisal values.When
approximation = "rank", the surprisal probability of each observation is estimated using the proportion of observations with greater surprisal values; i.e., 1 - rank(s)/n whererank(s)is the rank of the surprisal valuesamong all surprisal values, andnis the number of observations. This is a nonparametric approach that is also insensitive to the distribution used in computing the surprisal values.
Value
A numerical vector containing the surprisals or surprisal probabilities.
Author(s)
Rob J Hyndman
References
Rob J Hyndman (2026) "That's weird: Anomaly detection using R", Chapter 6, https://OTexts.com/weird/.
See Also
For specific methods, see surprisals.numeric() and surprisals.lm(),
Surprisals and surprisal probabilities computed from a model
Description
A surprisal is given by s = -\log f(y) where f is the
density or probability mass function of the estimated or assumed distribution,
and y is an observation. This is returned by surprisals().
A surprisal probability is the probability of a surprisal at least as extreme
as s. This is returned by surprisals_prob()
Usage
## S3 method for class 'lm'
surprisals(object, loo = FALSE, ...)
## S3 method for class 'lm'
surprisals_prob(
object,
approximation = c("none", "gpd", "rank"),
threshold_probability = 0.1,
loo = FALSE,
...
)
## S3 method for class 'gam'
surprisals(object, ...)
## S3 method for class 'gam'
surprisals_prob(
object,
approximation = c("none", "gpd", "rank"),
threshold_probability = 0.1,
...
)
Arguments
object |
A model object such as returned by |
loo |
Should leave-one-out surprisals be computed? For computational
reasons, this is only available for |
... |
Other arguments are ignored. |
approximation |
Character string specifying the method to use in computing the surprisal probabilities. See Details below. |
threshold_probability |
Probability threshold when computing the GPD
approximation. This is the probability below which the GPD is fitted. Only
used if |
Details
The surprisal probabilities may be computed in three different ways.
When
approximation = "none"(the default), the surprisal probabilities are computed using the same distribution that was used to compute the surprisal values. Under this option, surprisal probabilities are equal to 1 minus the coverage probability of the largest HDR that contains each value. Surprisal probabilities smaller than 1e-6 are returned as 1e-6.When
approximation = "gdp", the surprisal probabilities are computed using a Generalized Pareto Distribution fitted to the most extreme surprisal values (those with probability less thanthreshold_probability). For surprisal probabilities greater thanthreshold_probability, the value ofthreshold_probabilityis returned. Under this option, the distribution is used for computing the surprisal values but not for determining their probabilities. Due to extreme value theory, the resulting probabilities should be relatively insensitive to the distribution used in computing the surprisal values.When
approximation = "rank", the surprisal probability of each observation is estimated using the proportion of observations with greater surprisal values; i.e., 1 - rank(s)/n whererank(s)is the rank of the surprisal valuesamong all surprisal values, andnis the number of observations. This is a nonparametric approach that is also insensitive to the distribution used in computing the surprisal values.
Value
A numerical vector containing the surprisals or surprisal probabilities.
Author(s)
Rob J Hyndman
References
Rob J Hyndman (2026) "That's weird: Anomaly detection using R", Chapter 6, https://OTexts.com/weird/.
See Also
For specific methods, see surprisals.numeric() and surprisals.lm(),
Examples
# A linear model (i.e., a conditional Gaussian distribution)
lm_of <- lm(waiting ~ duration, data = oldfaithful)
oldfaithful |>
mutate(
fscore = surprisals_prob(lm_of),
prob = surprisals_prob(lm_of, loo = TRUE),
) |>
ggplot(aes(
x = duration, y = waiting,
color = prob < 0.01
)) +
geom_point()
# A Poisson GLM
glm_breaks <- glm(breaks ~ wool + tension, data = warpbreaks, family = poisson)
warpbreaks |>
mutate(prob = surprisals_prob(glm_breaks)) |>
filter(prob < 0.05)
Surprisals and surprisal probabilities computed from data
Description
A surprisal is given by s = -\log f(y) where f is the
density or probability mass function of the estimated or assumed distribution,
and y is an observation. This is returned by surprisals().
A surprisal probability is the probability of a surprisal at least as extreme
as s. This is returned by surprisals_prob()
Usage
## S3 method for class 'numeric'
surprisals(object, distribution = dist_kde(object, ...), loo = FALSE, ...)
## S3 method for class 'matrix'
surprisals(object, distribution = dist_kde(object, ...), loo = FALSE, ...)
## S3 method for class 'data.frame'
surprisals(object, distribution = dist_kde(object, ...), loo = FALSE, ...)
## S3 method for class 'numeric'
surprisals_prob(
object,
approximation = c("none", "gpd", "rank"),
threshold_probability = 0.1,
distribution = dist_kde(object, ...),
loo = FALSE,
...
)
## S3 method for class 'matrix'
surprisals_prob(
object,
approximation = c("none", "gpd", "rank"),
threshold_probability = 0.1,
distribution = dist_kde(object, ...),
loo = FALSE,
...
)
## S3 method for class 'data.frame'
surprisals_prob(
object,
approximation = c("none", "gpd", "rank"),
threshold_probability = 0.1,
distribution = dist_kde(object, ...),
loo = FALSE,
...
)
Arguments
object |
A numerical data set (either a vector, matrix, or a data.frame containing only numerical columns). |
distribution |
A distribution object. By default, a kernel density
estimate is computed from the data |
loo |
Should leave-one-out surprisals be computed? |
... |
Other arguments are passed to the appropriate method. |
approximation |
Character string specifying the method to use in computing the surprisal probabilities. See Details below. For a multivariate data set, it needs to be set to either "gpd" or "rank". |
threshold_probability |
Probability threshold when computing the GPD
approximation. This is the probability below which the GPD is fitted. Only
used if |
Details
The surprisal probabilities may be computed in three different ways.
When
approximation = "none"(the default), the surprisal probabilities are computed using the same distribution that was used to compute the surprisal values. Under this option, surprisal probabilities are equal to 1 minus the coverage probability of the largest HDR that contains each value. Surprisal probabilities smaller than 1e-6 are returned as 1e-6.When
approximation = "gdp", the surprisal probabilities are computed using a Generalized Pareto Distribution fitted to the most extreme surprisal values (those with probability less thanthreshold_probability). For surprisal probabilities greater thanthreshold_probability, the value ofthreshold_probabilityis returned. Under this option, the distribution is used for computing the surprisal values but not for determining their probabilities. Due to extreme value theory, the resulting probabilities should be relatively insensitive to the distribution used in computing the surprisal values.When
approximation = "rank", the surprisal probability of each observation is estimated using the proportion of observations with greater surprisal values; i.e., 1 - rank(s)/n whererank(s)is the rank of the surprisal valuesamong all surprisal values, andnis the number of observations. This is a nonparametric approach that is also insensitive to the distribution used in computing the surprisal values.
Value
A numerical vector containing the surprisals or surprisal probabilities.
Author(s)
Rob J Hyndman
References
Rob J Hyndman (2026) "That's weird: Anomaly detection using R", Chapter 6, https://OTexts.com/weird/.
See Also
Examples
# Univariate data
tibble(
y = c(5, rnorm(49)),
p_kde = surprisals_prob(y, loo = TRUE),
p_normal = surprisals_prob(y, distribution = dist_normal()),
p_zscore = 2 * (1 - pnorm(abs(y)))
)
tibble(
y = n01$v1,
prob1 = surprisals_prob(y),
prob2 = surprisals_prob(y, loo = TRUE),
prob3 = surprisals_prob(y, distribution = dist_normal()),
prob4 = surprisals_prob(y, distribution = dist_normal(), approximation = "gpd")
) |>
arrange(prob1)
# Bivariate data
tibble(
x = rnorm(50),
y = c(5, rnorm(49)),
prob = surprisals_prob(cbind(x, y), approximation = "gpd")
)
oldfaithful |>
mutate(
s = surprisals(cbind(duration, waiting), loo = TRUE),
p = surprisals_prob(cbind(duration, waiting), loo = TRUE, approximation = "gpd")
) |>
arrange(p)