AnthropMMD
was primarily designed as an R-Shiny
application (Santos 2018), that can be
launched with the following instruction:
Starting with the version 3.0.0, it can also be used from the command line, to make it convenient in a context of reproducible research.
AnthropMMD
from the command lineThe MMD is a dissimilarity measure among groups of individuals described by binary (presence-absence) traits (Sjøvold 1973, 1977; Harris and Sjøvold 2004). The MMD formula only requires to know the sample sizes and relative trait frequencies within each group: providing the raw individual data is not mandatory.
Usually, the user will have recorded the data in one of the the two following formats.
Most of the time, the data will be formatted as a classical dataframe
with \(n\) rows (one per individual)
and \(p+1\) columns (the first column
must be a group indicator; the \(p\)
other columns correspond to the \(p\)
binary traits observed). Each trait has two possible values: \(1\) if present, \(0\) if absent (missing values are allowed).
Rownames are optional but colnames (i.e., header
s) are
mandatory.
An artificial dataset, toyMMD
, is available in the
package to give an example of such a format:
data(toyMMD)
head(toyMMD)
#> Group Trait1 Trait2 Trait3 Trait4 Trait5 Trait6 Trait7 Trait8 Trait9
#> A_1 GroupA 1 NA 1 1 NA 0 NA 1 0
#> A_2 GroupA NA NA 1 NA NA NA NA NA 1
#> A_3 GroupA NA 1 1 0 0 1 0 1 NA
#> A_4 GroupA 1 1 0 0 0 NA 1 1 0
#> A_5 GroupA 1 1 1 1 NA 0 1 NA 1
#> A_6 GroupA NA 1 0 0 NA 1 0 NA 1
It is not necessary for the Trait
s to be manually
converted as factors prior to the analysis. toyMMD
includes
nine traits with many missing values, and the individuals belong to five
groups:
str(toyMMD)
#> 'data.frame': 200 obs. of 10 variables:
#> $ Group : Factor w/ 5 levels "GroupA","GroupB",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ Trait1: int 1 NA NA 1 1 NA 1 NA NA NA ...
#> $ Trait2: int NA NA 1 1 1 1 NA 1 NA 1 ...
#> $ Trait3: int 1 1 1 0 1 0 1 0 1 NA ...
#> $ Trait4: int 1 NA 0 0 1 0 1 1 1 0 ...
#> $ Trait5: int NA NA 0 0 NA NA NA 0 0 NA ...
#> $ Trait6: int 0 NA 1 NA 0 1 NA NA 1 NA ...
#> $ Trait7: int NA NA 0 1 1 0 NA NA NA 0 ...
#> $ Trait8: int 1 NA 1 1 NA NA NA NA 1 NA ...
#> $ Trait9: int 0 1 NA 0 1 1 NA 0 NA NA ...
If the data were recorded as a raw binary dataset, they
must be converted into a table of relative frequencies
prior to the MMD analysis. The R function binary_to_table()
with the argument relative = TRUE
can perform this
operation:
tab <- binary_to_table(toyMMD, relative = TRUE)
tab
#> Trait1 Trait2 Trait3 Trait4 Trait5 Trait6 Trait7 Trait8 Trait9
#> N_GroupA 10.000 22.000 36.000 39.000 20.000 23.000 19.000 10.000 19.000
#> N_GroupB 8.000 21.000 21.000 20.000 19.000 17.000 17.000 13.000 20.000
#> N_GroupC 27.000 18.000 27.000 25.000 19.000 26.000 31.000 17.000 36.000
#> N_GroupD 15.000 35.000 30.000 39.000 29.000 30.000 35.000 14.000 19.000
#> N_GroupE 15.000 21.000 21.000 24.000 22.000 20.000 22.000 11.000 38.000
#> Freq_GroupA 0.800 0.727 0.806 0.615 0.000 0.652 0.316 0.800 0.526
#> Freq_GroupB 0.875 0.571 0.952 0.600 0.000 0.882 1.000 0.692 0.500
#> Freq_GroupC 0.630 0.611 0.593 0.560 0.000 0.538 0.645 0.588 0.750
#> Freq_GroupD 0.333 0.886 0.633 0.359 0.069 0.400 0.857 0.429 0.579
#> Freq_GroupE 0.467 0.571 0.714 0.625 0.000 0.800 0.500 0.364 0.447
Sometimes, in particular if the data were extracted from research articles rather than true exprimentation, the user will only know the sample sizes and absolute trait frequencies in each group. As previously stated, those data are perfectly sufficient to compute MMD values. Here is an example of such a table:
data(absolute_freqs)
print(absolute_freqs)
#> Trait1 Trait2 Trait3 Trait4 Trait5 Trait6 Trait7 Trait8 Trait9
#> N_GroupA 10 22 36 39 20 23 19 10 19
#> N_GroupB 8 21 21 20 19 17 17 13 20
#> N_GroupC 27 18 27 25 19 26 31 17 36
#> N_GroupD 15 35 30 39 29 30 35 14 19
#> N_GroupE 15 21 21 24 22 20 22 11 38
#> Freq_GroupA 8 16 29 24 0 15 6 8 10
#> Freq_GroupB 7 12 20 12 0 15 17 9 10
#> Freq_GroupC 17 11 16 14 0 14 20 10 27
#> Freq_GroupD 5 31 19 14 2 12 30 6 11
#> Freq_GroupE 7 12 15 15 0 16 11 4 17
If there are \(k\) groups in the
data, the first \(k\) rows (whose names
must start with an N_
prefix, followed by the group labels)
indicates the sample sizes of each trait in each group, and the last
\(k\) rows indicates the
absolute trait frequencies. Such a table can be directly
imported by the user, but must be converted to
relative trait frequencies prior to the analysis, for instance
by using the function table_relfreq()
:
tab <- table_relfreq(absolute_freqs)
print(tab)
#> Trait1 Trait2 Trait3 Trait4 Trait5 Trait6 Trait7 Trait8 Trait9
#> N_GroupA 10.000 22.000 36.000 39.000 20.000 23.000 19.000 10.000 19.000
#> N_GroupB 8.000 21.000 21.000 20.000 19.000 17.000 17.000 13.000 20.000
#> N_GroupC 27.000 18.000 27.000 25.000 19.000 26.000 31.000 17.000 36.000
#> N_GroupD 15.000 35.000 30.000 39.000 29.000 30.000 35.000 14.000 19.000
#> N_GroupE 15.000 21.000 21.000 24.000 22.000 20.000 22.000 11.000 38.000
#> Freq_GroupA 0.800 0.727 0.806 0.615 0.000 0.652 0.316 0.800 0.526
#> Freq_GroupB 0.875 0.571 0.952 0.600 0.000 0.882 1.000 0.692 0.500
#> Freq_GroupC 0.630 0.611 0.593 0.560 0.000 0.538 0.645 0.588 0.750
#> Freq_GroupD 0.333 0.886 0.633 0.359 0.069 0.400 0.857 0.429 0.579
#> Freq_GroupE 0.467 0.571 0.714 0.625 0.000 0.800 0.500 0.364 0.447
Due to rounding issues, it is clearly not advised for the user to submit directly a table of relative frequencies. If the absolute frequencies do not perfectly sum up to 1 for each trait, an error will be triggered.
A table of relative frequencies is the starting point of all subsequent analyses, but it should be computed from the data loaded in R; it should not be loaded itself.
A quick summary:
binary_to_table(relative = TRUE)
as shown above;table_relfreq()
.AnthropMMD
proposes a built-in feature for trait
selection, in order to discard the traits that could be observed on too
few individuals, or that do not show enough variability among groups.
The function select_traits()
allows such a selection
according to several strategies (Harris and
Sjøvold 2004; Santos 2018).
For instance, with the following instruction, we can discard the traits that could be observed on at least 10 individuals per group, and that exhibit no significant variability in frequencies among groups according to Fisher’s exact tests:
tab_selected <- select_traits(tab, k = 10, strategy = "keepFisher")
tab_selected$filtered
#> Trait2 Trait3 Trait4 Trait6 Trait7 Trait9
#> N_GroupA 22.000 36.000 39.000 23.000 19.000 19.000
#> N_GroupB 21.000 21.000 20.000 17.000 17.000 20.000
#> N_GroupC 18.000 27.000 25.000 26.000 31.000 36.000
#> N_GroupD 35.000 30.000 39.000 30.000 35.000 19.000
#> N_GroupE 21.000 21.000 24.000 20.000 22.000 38.000
#> Freq_GroupA 0.727 0.806 0.615 0.652 0.316 0.526
#> Freq_GroupB 0.571 0.952 0.600 0.882 1.000 0.500
#> Freq_GroupC 0.611 0.593 0.560 0.538 0.645 0.750
#> Freq_GroupD 0.886 0.633 0.359 0.400 0.857 0.579
#> Freq_GroupE 0.571 0.714 0.625 0.800 0.500 0.447
Trait1
(which could be observed on only 8 individuals in
Group B), Trait5
and Trait8
(whose variation
in frequencies was not significant among groups) were removed from the
data.
Once the trait selection has been performed, the MMD can be computed
with the mmd()
function. With the following instruction,
the MMD is computed using Anscombe’s angular transformation of trait
frequencies:
mmd.result <- mmd(tab_selected$filtered, angular = "Anscombe")
mmd.result
#> $MMDMatrix
#> GroupA GroupB GroupC GroupD GroupE
#> GroupA 0.0000 0.4502 0.0831 0.2784 0.0000
#> GroupB 0.0532 0.0000 0.3434 0.3650 0.2563
#> GroupC 0.0456 0.0513 0.0000 0.0984 0.0596
#> GroupD 0.0434 0.0487 0.0411 0.0000 0.2821
#> GroupE 0.0481 0.0540 0.0469 0.0435 0.0000
#>
#> $MMDSym
#> GroupA GroupB GroupC GroupD GroupE
#> GroupA 0.0000 0.450 0.0831 0.2784 0.0000
#> GroupB 0.4502 0.000 0.3434 0.3650 0.2563
#> GroupC 0.0831 0.343 0.0000 0.0984 0.0596
#> GroupD 0.2784 0.365 0.0984 0.0000 0.2821
#> GroupE 0.0000 0.256 0.0596 0.2821 0.0000
#>
#> $MMDSignif
#> GroupA GroupB GroupC GroupD GroupE
#> GroupA NA "0.45" "0.083" "0.278" "0"
#> GroupB "*" NA "0.343" "0.365" "0.256"
#> GroupC "NS" "*" NA "0.098" "0.06"
#> GroupD "*" "*" "*" NA "0.282"
#> GroupE "NS" "*" "NS" "*" NA
#>
#> $MMDpval
#> GroupA GroupB GroupC GroupD GroupE
#> GroupA NA 0.4502 0.0831 0.2784 0.0000
#> GroupB 0.0000 NA 0.3434 0.3650 0.2563
#> GroupC 0.0513 0.0000 NA 0.0984 0.0596
#> GroupD 0.0001 0.0000 0.0277 NA 0.2821
#> GroupE 0.6073 0.0018 0.0506 0.0001 NA
#>
#> attr(,"class")
#> [1] "anthropmmd_result"
$MMDMatrix
, follows the
presentation adopted in most research articles (Donlon 2000): the true MMD values are indicated
above the diagonal, and their standard deviations are indicated below
the diagonal.*
in the component $MMDSignif
.$MMDSym
is a symmetrical matrix of MMD
values, with a null diagonal. This matrix of dissimilarities can be used
to perform a multidimensional scaling or a hierarchical clustering to
visualize the distances among the groups.$MMDpval
. Theoretical details for this test of significance
can be found in de Souza and Houghton
(1977).Although mmd.result$MMDSym
can perfectly be passed to
your favourite function to produce a MDS plot, AnthropMMD
also proposes a built-in generic function for such a graphical
representation: plot()
.
The MDS can be computed using the classical
stats::cmdscale()
function (and the produces a metric MDS),
or several variants of MDS algorithms implemented in the R package
smacof
.
For instance, we plot here the MDS coordinates computed with one variant of SMACOF algorithms:
par(cex = 0.8)
plot(x = mmd.result, method = "interval",
gof = TRUE, axes = TRUE, xlim = c(-1.2, 0.75))
The argument gof = TRUE
displays goodness of fits
statistics for the MDS configurations directly on the plot.
Starting with AnthropMMD 4.0.0
, a bootstrap method
introduced by Daniel Fidalgo and co-workers (Fidalgo, Hubbe, and Wesolowski 2021; Fidalgo,
Wesolowski, and Hubbe 2022) is now available. Please note
that:
Let’s start again from the dataset toyMMD
. The first
step is to perform the resampling and the MMD computations using the
separate function mmd_boot()
. Fidalgo, Wesolowski, and Hubbe (2022) suggest a
value of B = 100
bootstrap experiments, it is thus the
default value for the corresponding argument in this function. Below,
for this simple example, we will use a value of B = 50
. Be
careful: the computation time (in the current implementation) will grow
quite fast for high values of B
.
Note that mmd_boot()
takes care itself of the trait
selection. All arguments usually submitted to
select_traits()
can also be passed to
mmd_boot()
; see example below.
## Load the example data once again:
data(toyMMD)
## Compute MMD among bootstrapped samples:
set.seed(2023) # set seed for reproducibility
resboot <- mmd_boot(
data = toyMMD,
B = 50, # number of bootstrap samples
angular = "Anscombe",
strategy = "keepFisher", # strategy for trait selection
k = 10 # minimal number of observations required per trait
)
Now you can simply plot()
the result of your
computations, and set various parameters to customize your plot:
## MDS plot for bootstrapped samples:
plot(
x = resboot,
method = "interval", # algorithm used for MDS computation
level = 0.95, # confidence level for the contour lines
gof = TRUE # display goodness of fit statistic
)
This plot displays the contour lines of a 2D kernel density estimation, as in Figure 3 from Fidalgo, Wesolowski, and Hubbe (2022).