An overview of AnthropMMD

Frédéric Santos, <frederic.santos[at]u-bordeaux.fr>

1 Graphical user interface

AnthropMMD was primarily designed as an R-Shiny application (Santos 2018), that can be launched with the following instruction:

start_mmd()

Starting with the version 3.0.0, it can also be used from the command line, to make it convenient in a context of reproducible research.

2 Using AnthropMMD from the command line

2.1 Data formatting

The MMD is a dissimilarity measure among groups of individuals described by binary (presence-absence) traits (Sjøvold 1973, 1977; Harris and Sjøvold 2004). The MMD formula only requires to know the sample sizes and relative trait frequencies within each group: providing the raw individual data is not mandatory.

Usually, the user will have recorded the data in one of the the two following formats.

2.1.1 Raw binary data

Most of the time, the data will be formatted as a classical dataframe with \(n\) rows (one per individual) and \(p+1\) columns (the first column must be a group indicator; the \(p\) other columns correspond to the \(p\) binary traits observed). Each trait has two possible values: \(1\) if present, \(0\) if absent (missing values are allowed). Rownames are optional but colnames (i.e., headers) are mandatory.

An artificial dataset, toyMMD, is available in the package to give an example of such a format:

data(toyMMD)
head(toyMMD)
#>      Group Trait1 Trait2 Trait3 Trait4 Trait5 Trait6 Trait7 Trait8 Trait9
#> A_1 GroupA      1     NA      1      1     NA      0     NA      1      0
#> A_2 GroupA     NA     NA      1     NA     NA     NA     NA     NA      1
#> A_3 GroupA     NA      1      1      0      0      1      0      1     NA
#> A_4 GroupA      1      1      0      0      0     NA      1      1      0
#> A_5 GroupA      1      1      1      1     NA      0      1     NA      1
#> A_6 GroupA     NA      1      0      0     NA      1      0     NA      1

It is not necessary for the Traits to be manually converted as factors prior to the analysis. toyMMD includes nine traits with many missing values, and the individuals belong to five groups:

str(toyMMD)
#> 'data.frame':    200 obs. of  10 variables:
#>  $ Group : Factor w/ 5 levels "GroupA","GroupB",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ Trait1: int  1 NA NA 1 1 NA 1 NA NA NA ...
#>  $ Trait2: int  NA NA 1 1 1 1 NA 1 NA 1 ...
#>  $ Trait3: int  1 1 1 0 1 0 1 0 1 NA ...
#>  $ Trait4: int  1 NA 0 0 1 0 1 1 1 0 ...
#>  $ Trait5: int  NA NA 0 0 NA NA NA 0 0 NA ...
#>  $ Trait6: int  0 NA 1 NA 0 1 NA NA 1 NA ...
#>  $ Trait7: int  NA NA 0 1 1 0 NA NA NA 0 ...
#>  $ Trait8: int  1 NA 1 1 NA NA NA NA 1 NA ...
#>  $ Trait9: int  0 1 NA 0 1 1 NA 0 NA NA ...

If the data were recorded as a raw binary dataset, they must be converted into a table of relative frequencies prior to the MMD analysis. The R function binary_to_table() with the argument relative = TRUE can perform this operation:

tab <- binary_to_table(toyMMD, relative = TRUE)
tab
#>             Trait1 Trait2 Trait3 Trait4 Trait5 Trait6 Trait7 Trait8 Trait9
#> N_GroupA    10.000 22.000 36.000 39.000 20.000 23.000 19.000 10.000 19.000
#> N_GroupB     8.000 21.000 21.000 20.000 19.000 17.000 17.000 13.000 20.000
#> N_GroupC    27.000 18.000 27.000 25.000 19.000 26.000 31.000 17.000 36.000
#> N_GroupD    15.000 35.000 30.000 39.000 29.000 30.000 35.000 14.000 19.000
#> N_GroupE    15.000 21.000 21.000 24.000 22.000 20.000 22.000 11.000 38.000
#> Freq_GroupA  0.800  0.727  0.806  0.615  0.000  0.652  0.316  0.800  0.526
#> Freq_GroupB  0.875  0.571  0.952  0.600  0.000  0.882  1.000  0.692  0.500
#> Freq_GroupC  0.630  0.611  0.593  0.560  0.000  0.538  0.645  0.588  0.750
#> Freq_GroupD  0.333  0.886  0.633  0.359  0.069  0.400  0.857  0.429  0.579
#> Freq_GroupE  0.467  0.571  0.714  0.625  0.000  0.800  0.500  0.364  0.447

2.1.2 Tables of sample sizes and absolute frequencies

Sometimes, in particular if the data were extracted from research articles rather than true exprimentation, the user will only know the sample sizes and absolute trait frequencies in each group. As previously stated, those data are perfectly sufficient to compute MMD values. Here is an example of such a table:

data(absolute_freqs)
print(absolute_freqs)
#>             Trait1 Trait2 Trait3 Trait4 Trait5 Trait6 Trait7 Trait8 Trait9
#> N_GroupA        10     22     36     39     20     23     19     10     19
#> N_GroupB         8     21     21     20     19     17     17     13     20
#> N_GroupC        27     18     27     25     19     26     31     17     36
#> N_GroupD        15     35     30     39     29     30     35     14     19
#> N_GroupE        15     21     21     24     22     20     22     11     38
#> Freq_GroupA      8     16     29     24      0     15      6      8     10
#> Freq_GroupB      7     12     20     12      0     15     17      9     10
#> Freq_GroupC     17     11     16     14      0     14     20     10     27
#> Freq_GroupD      5     31     19     14      2     12     30      6     11
#> Freq_GroupE      7     12     15     15      0     16     11      4     17

If there are \(k\) groups in the data, the first \(k\) rows (whose names must start with an N_ prefix, followed by the group labels) indicates the sample sizes of each trait in each group, and the last \(k\) rows indicates the absolute trait frequencies. Such a table can be directly imported by the user, but must be converted to relative trait frequencies prior to the analysis, for instance by using the function table_relfreq():

tab <- table_relfreq(absolute_freqs)
print(tab)
#>             Trait1 Trait2 Trait3 Trait4 Trait5 Trait6 Trait7 Trait8 Trait9
#> N_GroupA    10.000 22.000 36.000 39.000 20.000 23.000 19.000 10.000 19.000
#> N_GroupB     8.000 21.000 21.000 20.000 19.000 17.000 17.000 13.000 20.000
#> N_GroupC    27.000 18.000 27.000 25.000 19.000 26.000 31.000 17.000 36.000
#> N_GroupD    15.000 35.000 30.000 39.000 29.000 30.000 35.000 14.000 19.000
#> N_GroupE    15.000 21.000 21.000 24.000 22.000 20.000 22.000 11.000 38.000
#> Freq_GroupA  0.800  0.727  0.806  0.615  0.000  0.652  0.316  0.800  0.526
#> Freq_GroupB  0.875  0.571  0.952  0.600  0.000  0.882  1.000  0.692  0.500
#> Freq_GroupC  0.630  0.611  0.593  0.560  0.000  0.538  0.645  0.588  0.750
#> Freq_GroupD  0.333  0.886  0.633  0.359  0.069  0.400  0.857  0.429  0.579
#> Freq_GroupE  0.467  0.571  0.714  0.625  0.000  0.800  0.500  0.364  0.447

2.1.3 Important remark

Due to rounding issues, it is clearly not advised for the user to submit directly a table of relative frequencies. If the absolute frequencies do not perfectly sum up to 1 for each trait, an error will be triggered.

A table of relative frequencies is the starting point of all subsequent analyses, but it should be computed from the data loaded in R; it should not be loaded itself.

A quick summary:

  • if you load a raw binary dataframe, convert it to a table of relative frequencies with the function binary_to_table(relative = TRUE) as shown above;
  • if you load a table of absolute frequencies, convert it to a table of relative frequencies with the function table_relfreq().

2.2 Trait selection

AnthropMMD proposes a built-in feature for trait selection, in order to discard the traits that could be observed on too few individuals, or that do not show enough variability among groups. The function select_traits() allows such a selection according to several strategies (Harris and Sjøvold 2004; Santos 2018).

For instance, with the following instruction, we can discard the traits that could be observed on at least 10 individuals per group, and that exhibit no significant variability in frequencies among groups according to Fisher’s exact tests:

tab_selected <- select_traits(tab, k = 10, strategy = "keepFisher")
tab_selected$filtered
#>             Trait2 Trait3 Trait4 Trait6 Trait7 Trait9
#> N_GroupA    22.000 36.000 39.000 23.000 19.000 19.000
#> N_GroupB    21.000 21.000 20.000 17.000 17.000 20.000
#> N_GroupC    18.000 27.000 25.000 26.000 31.000 36.000
#> N_GroupD    35.000 30.000 39.000 30.000 35.000 19.000
#> N_GroupE    21.000 21.000 24.000 20.000 22.000 38.000
#> Freq_GroupA  0.727  0.806  0.615  0.652  0.316  0.526
#> Freq_GroupB  0.571  0.952  0.600  0.882  1.000  0.500
#> Freq_GroupC  0.611  0.593  0.560  0.538  0.645  0.750
#> Freq_GroupD  0.886  0.633  0.359  0.400  0.857  0.579
#> Freq_GroupE  0.571  0.714  0.625  0.800  0.500  0.447

Trait1 (which could be observed on only 8 individuals in Group B), Trait5 and Trait8 (whose variation in frequencies was not significant among groups) were removed from the data.

2.3 Compute the mean measure of divergence

Once the trait selection has been performed, the MMD can be computed with the mmd() function. With the following instruction, the MMD is computed using Anscombe’s angular transformation of trait frequencies:

mmd.result <- mmd(tab_selected$filtered, angular = "Anscombe")
mmd.result
#> $MMDMatrix
#>        GroupA GroupB GroupC GroupD GroupE
#> GroupA 0.0000 0.4502 0.0831 0.2784 0.0000
#> GroupB 0.0532 0.0000 0.3434 0.3650 0.2563
#> GroupC 0.0456 0.0513 0.0000 0.0984 0.0596
#> GroupD 0.0434 0.0487 0.0411 0.0000 0.2821
#> GroupE 0.0481 0.0540 0.0469 0.0435 0.0000
#> 
#> $MMDSym
#>        GroupA GroupB GroupC GroupD GroupE
#> GroupA 0.0000  0.450 0.0831 0.2784 0.0000
#> GroupB 0.4502  0.000 0.3434 0.3650 0.2563
#> GroupC 0.0831  0.343 0.0000 0.0984 0.0596
#> GroupD 0.2784  0.365 0.0984 0.0000 0.2821
#> GroupE 0.0000  0.256 0.0596 0.2821 0.0000
#> 
#> $MMDSignif
#>        GroupA GroupB GroupC  GroupD  GroupE 
#> GroupA NA     "0.45" "0.083" "0.278" "0"    
#> GroupB "*"    NA     "0.343" "0.365" "0.256"
#> GroupC "NS"   "*"    NA      "0.098" "0.06" 
#> GroupD "*"    "*"    "*"     NA      "0.282"
#> GroupE "NS"   "*"    "NS"    "*"     NA     
#> 
#> $MMDpval
#>        GroupA GroupB GroupC GroupD GroupE
#> GroupA     NA 0.4502 0.0831 0.2784 0.0000
#> GroupB 0.0000     NA 0.3434 0.3650 0.2563
#> GroupC 0.0513 0.0000     NA 0.0984 0.0596
#> GroupD 0.0001 0.0000 0.0277     NA 0.2821
#> GroupE 0.6073 0.0018 0.0506 0.0001     NA
#> 
#> attr(,"class")
#> [1] "anthropmmd_result"

2.4 Graphical representations of MMD

2.4.1 Multidimensional scaling

Although mmd.result$MMDSym can perfectly be passed to your favourite function to produce a MDS plot, AnthropMMD also proposes a built-in generic function for such a graphical representation: plot().

The MDS can be computed using the classical stats::cmdscale() function (and the produces a metric MDS), or several variants of MDS algorithms implemented in the R package smacof.

For instance, we plot here the MDS coordinates computed with one variant of SMACOF algorithms:

par(cex = 0.8)
plot(x = mmd.result, method = "interval", 
     gof = TRUE, axes = TRUE, xlim = c(-1.2, 0.75))

The argument gof = TRUE displays goodness of fits statistics for the MDS configurations directly on the plot.

2.4.2 Hierarchical clustering

AnthropMMD does not propose a built-in function for hierarchical clustering, but such a plot can easily be obtained with the usual R functions.

library(cluster)
par(cex = 0.8)
plot(agnes(mmd.result$MMDSym), which.plots = 2, main = "Dendrogram of MMD dissimilarities")

3 Alternative: analysis using bootstrapped samples

Starting with AnthropMMD 4.0.0, a bootstrap method introduced by Daniel Fidalgo and co-workers (Fidalgo, Hubbe, and Wesolowski 2021; Fidalgo, Wesolowski, and Hubbe 2022) is now available. Please note that:

Let’s start again from the dataset toyMMD. The first step is to perform the resampling and the MMD computations using the separate function mmd_boot(). Fidalgo, Wesolowski, and Hubbe (2022) suggest a value of B = 100 bootstrap experiments, it is thus the default value for the corresponding argument in this function. Below, for this simple example, we will use a value of B = 50. Be careful: the computation time (in the current implementation) will grow quite fast for high values of B.

Note that mmd_boot() takes care itself of the trait selection. All arguments usually submitted to select_traits() can also be passed to mmd_boot(); see example below.

## Load the example data once again:
data(toyMMD)
## Compute MMD among bootstrapped samples:
set.seed(2023) # set seed for reproducibility
resboot <- mmd_boot(
    data = toyMMD,
    B = 50, # number of bootstrap samples
    angular = "Anscombe",
    strategy = "keepFisher", # strategy for trait selection
    k = 10 # minimal number of observations required per trait
)

Now you can simply plot() the result of your computations, and set various parameters to customize your plot:

## MDS plot for bootstrapped samples:
plot(
    x = resboot,
    method = "interval", # algorithm used for MDS computation
    level = 0.95, # confidence level for the contour lines
    gof = TRUE # display goodness of fit statistic
)

This plot displays the contour lines of a 2D kernel density estimation, as in Figure 3 from Fidalgo, Wesolowski, and Hubbe (2022).

References

de Souza, Peter, and Philip Houghton. 1977. “The Mean Measure of Divergence and the Use of Non-Metric Data in the Estimation of Biological Distances.” Journal of Archaeological Science 4 (2): 163–69. https://doi.org/10.1016/0305-4403(77)90063-2.
Donlon, D. A. 2000. “The Value of Infracranial Nonmetric Variation in Studies of Modern Homo Sapiens: An Australian Focus.” American Journal of Physical Anthropology 113 (3): 349–68. https://doi.org/10.1002/1096-8644(200011)113:3<349::AID-AJPA6>3.0.CO;2-2.
Fidalgo, Daniel, Mark Hubbe, and Veronica Wesolowski. 2021. “Population History of Brazilian South and Southeast Shellmound Builders Inferred Through Dental Morphology.” American Journal of Physical Anthropology 176 (2): 192–207. https://doi.org/10.1002/ajpa.24342.
Fidalgo, Daniel, Veronica Wesolowski, and Mark Hubbe. 2022. “Biological Affinities of Brazilian Pre-Colonial Coastal Communities Explored Through Bootstrapped Biodistances of Dental Non-Metric Traits.” Journal of Archaeological Science 138: 105545. https://doi.org/10.1016/j.jas.2022.105545.
Harris, E. F., and T. Sjøvold. 2004. Calculations of Smith’s Mean Measure of Divergence for Intergroup Comparisons Using Nonmetric Data.” Dental Anthropology 17: 83–93.
Santos, Frédéric. 2018. AnthropMMD: An R package with a graphical user interface for the mean measure of divergence.” American Journal of Physical Anthropology 165 (1): 200–205. https://doi.org/10.1002/ajpa.23336.
Sjøvold, T. 1973. “Occurrence of Minor Non-Metrical Variants in the Skeleton and Their Quantitative Treatment for Population Comparisons.” Homo 24: 204–33.
———. 1977. Non-metrical divergence between skeletal populations: the theoretical foundation and biological importance of C.A.B. Smith’s mean measure of divergence.” Ossa 1: 1–133.