ZIBR: Zero-Inflated Beta regression model with Random effects

The longitudinal microbiome compositional data are highly skewed, bounded in \([0,1)\), and often sparse with many zeros. In addition, the observations from repeated measures are correlated. We propose a two-part zero-inflated Beta regression model with random effects (ZIBR) for testing the association between microbial abundance and clinical covariates for longitudinal microbiome data. The model includes a logistic component to model presence/absence of the microbe in samples and a Beta component to model non-zero microbial abundance and each component includes a random effect to take into account the correlation among repeated measurements on the same subject.

The ZIBR model combines the logistic regression and Beta regression in one model. Each regression part includes random effects to account for correlations acorss time points. We call these two regressions in ZIBR model as logistic component and Beta component. These two components model two different aspects of the data. The logistic component models presence/absence of the microbe and Beta component models non-zero microbial abundance.

Accordingly, we can test three biologically relevant null hypotheses:
- H0: α_j = 0. This is to test the coefficients in the logistic component, if the covariates are associated with the bacterial taxon by affecting its presence or absence;
- H0: β_j = 0. This is to test the coefficients in the Beta component, if the taxon is associated with the covariates by showing different abundances;
- H0: α_j = 0 and β_j = 0 for each covariate X_j and Z_j. This is to joinly test the coefficients in both logistic and Beta components, if the covariates affect the taxon both in terms of presence/absence and its non-zero abundance.

Simulated Data

The following function will simulate some data according to the zero-inflated beta random effect model. We specify the covariates in the logistic component (X) and covariates in the Beta component (Z) to be the same (i.e set Z=X).

sim <- simulate_zero_inflated_beta_random_effect_data(
  subject_n = 100, time_n = 5,
  X = as.matrix(c(rep(0, 50 * 5), rep(1, 50 * 5))),
  Z = as.matrix(c(rep(0, 50 * 5), rep(1, 50 * 5))),
  alpha = as.matrix(c(-0.5, 1)),
  beta = as.matrix(c(-0.5, 0.5)),
  s1 = 1, s2 = 0.8,
  v = 5,
  sim_seed = 100
)
names(sim)
#>  [1] "Y"           "X"           "Z"           "b"           "c"          
#>  [6] "u"           "v"           "alpha"       "beta"        "s1"         
#> [11] "s2"          "subject_ind" "time_ind"

The simulation function returns the the bacterial abundance (Y) simulated according to the above parameter settings. This function also returns the other variables such as X, Z, alpha, beta etc. that we just specified. It also returns two variables subject.ind and time.ind, which are subject IDs and time points for each subject.

We can run the zibr function to fit the zero-inflated beta random effect model on the simulated data.

zibr_fit <- zibr(
  logistic_cov = sim$X, beta_cov = sim$Z, Y = sim$Y,
  subject_ind = sim$subject_ind, time_ind = sim$time_ind
)
zibr_fit
#> $logistic_est_table
#>             Estimate      Pvalue
#> intersept -0.5719003 0.006957793
#> var1       0.8280726 0.005470065
#> 
#> $logistic_s1_est
#> [1] 1.068014
#> 
#> $beta_est_table
#>             Estimate       Pvalue
#> intersept -0.5930906 4.181627e-05
#> var1       0.5917456 1.699453e-03
#> 
#> $beta_s2_est
#> [1] 0.6303862
#> 
#> $beta_v_est
#> [1] 4.734064
#> 
#> $loglikelihood
#> [1] NA
#> 
#> $joint_p
#> [1] NA

If we use the same covariates for logistic and beta components, ZIBR can jointly test the covariate in both components. In the following example, we use the same covariate X in both components. We can obtain the joint.p as the pvalues of the joint test.

zibr_fit <- zibr(
  logistic_cov = sim$X, beta_cov = sim$X, Y = sim$Y,
  subject_ind = sim$subject_ind, time_ind = sim$time_ind
)
zibr_fit
#> $logistic_est_table
#>             Estimate      Pvalue
#> intersept -0.5719003 0.006957793
#> var1       0.8280726 0.005470065
#> 
#> $logistic_s1_est
#> [1] 1.068014
#> 
#> $beta_est_table
#>             Estimate       Pvalue
#> intersept -0.5930906 4.181627e-05
#> var1       0.5917456 1.699453e-03
#> 
#> $beta_s2_est
#> [1] 0.6303862
#> 
#> $beta_v_est
#> [1] 4.734064
#> 
#> $loglikelihood
#> [1] -286.5855
#> 
#> $joint_p
#>         var1 
#> 0.0001533285

Real Data

Let’s try anohter example on the real data. I will use a dataset from a longitudinal human microbiome study containing the bacterial abundance and clinical information from Lewis and Chen et al.. I only include the abundance from one genus.

head(ibd, n = 20)
#>     Sample Subject Time Treatment    Abundance
#> 1  5001-01    5001    1         0 0.000000e+00
#> 2  5001-02    5001    2         0 0.000000e+00
#> 3  5001-03    5001    3         0 0.000000e+00
#> 4  5001-04    5001    4         0 0.000000e+00
#> 5  5002-01    5002    1         0 3.961764e-03
#> 6  5002-02    5002    2         0 1.008613e-02
#> 7  5002-03    5002    3         0 0.000000e+00
#> 8  5002-04    5002    4         0 0.000000e+00
#> 9  5003-01    5003    1         0 2.545083e-03
#> 10 5003-02    5003    2         0 6.901096e-03
#> 11 5003-03    5003    3         0 3.038140e-04
#> 12 5003-04    5003    4         0 1.739832e-02
#> 13 5006-01    5006    1         0 2.052549e-03
#> 14 5006-02    5006    2         0 4.636235e-04
#> 15 5006-03    5006    3         0 3.403491e-04
#> 16 5006-04    5006    4         0 2.840756e-03
#> 17 5007-01    5007    1         0 1.634215e-03
#> 18 5007-02    5007    2         0 1.061013e-03
#> 19 5007-03    5007    3         0 5.522727e-05
#> 20 5007-04    5007    4         0 1.986368e-01

Note that each subject has four time points. In the Treatment column, 0 is for antiTNF treatment and 1 is for EEN treatment. Abundance column is the relative abundance for Eubacterium. The abundance is in the range of \([0,1)\).

The current model can not handle missing data. That is, each subject must have the same number of time points. If any time point is missing in your data, you can (1) remove some other time points so that all subject have the same time points (2) impute the missing data, for example, use the mean or median value from other subjects at the same time point in the same covariate group to replace the missing value. I’m currently working on the missing data problem and hope that our model can handle missing data soon.

We can run the zibr function to the real data. Here, I’m interested in comparing the two treatments and use treatment as the only covariate in both logistic and beta component. Depending on the scientific questions you are interested in, you can also include time and treament-time interaction in the covariates.

zibr_fit <- zibr(
  logistic_cov = data.frame(Treatment = ibd$Treatment),
  beta_cov = data.frame(Treatment = ibd$Treatment),
  Y = ibd$Abundance, subject_ind = ibd$Subject,
  time_ind = ibd$Time
)
zibr_fit
#> $logistic_est_table
#>            Estimate       Pvalue
#> intersept 2.6973234 6.171054e-06
#> Treatment 0.2000181 8.877107e-01
#> 
#> $logistic_s1_est
#> [1] 3.278764
#> 
#> $beta_est_table
#>             Estimate    Pvalue
#> intersept -2.7860540 0.0000000
#> Treatment -0.3233427 0.2063002
#> 
#> $beta_s2_est
#> [1] 0.5389234
#> 
#> $beta_v_est
#> [1] 7.622384
#> 
#> $loglikelihood
#> [1] 321.9018
#> 
#> $joint_p
#> Treatment 
#> 0.4454947

Neither of the logsitic or beta component show significant pvalues. This indicates the abundance of this genus is not different between the two treatments, considering all time points.

Real Data (from paper)

This follows the analysis from the paper.

### Load the raw data
PLEASE.file <- "https://raw.githubusercontent.com/chvlyl/PLEASE/master/1_Data/Raw_Data/MetaPhlAn/PLEASE/G_Remove_unclassfied_Renormalized_Merge_Rel_MetaPhlAn_Result.xls" # nolint
PLEASE.raw <- read.table(PLEASE.file,
  sep = "\t", header = TRUE, row.names = 1,
  check.names = FALSE, stringsAsFactors = FALSE
)
taxa.raw <- t(PLEASE.raw)

### Make sure you load the data correctly
cat("samples", "taxa", dim(taxa.raw), "\n")
#> samples taxa 335 105
taxa.raw[1:3, 1:3]
#>         g__Bacteroides g__Ruminococcus g__Faecalibacterium
#> 5001-01     0.42722310     0.000000000                   0
#> 5001-02     0.23188470     0.000000000                   0
#> 5001-03     0.02041114     0.006130341                   0

### Load total non-human read counts
human.read.file <- "https://raw.githubusercontent.com/chvlyl/PLEASE/master/1_Data/Raw_Data/MetaPhlAn/Human_Reads/please_combo_human_reads.xls" # nolint
human.read <- read.table(human.read.file,
  sep = "\t", header = TRUE,
  row.names = 1, stringsAsFactors = FALSE
)

### Filter low depth samples (low non human reads)
low.depth.samples <- subset(human.read, NonHumanReads < 10000)
low.depth.samples[, 1:5]
#>         NonHumanReads TotalReads HumanReads    HumanPer GroupFcp
#> 5010-02          1014       1104         90  8.15217391    TNF.N
#> 5018-03          6101       6121         20  0.32674400   Diet.R
#> 5023-04          1954       2679        725 27.06233669    TNF.R
#> 6007-01          1809       4249       2440 57.42527654    TNF.R
#> 7001-02          9965       9971          6  0.06017451   Diet.N
#> 7001-03           566        613         47  7.66721044   Diet.N
#> 7001-04           626        637         11  1.72684458   Diet.N
#> 7002-04           683        696         13  1.86781609   Diet.N
#> 7003-02          4681       5719       1038 18.15002623  Diet.NR
#> 7003-03          4935       5049        114  2.25787285  Diet.NR
#> 7009-02          1895       1916         21  1.09603340   Diet.N
#> 7009-03          5319       5361         42  0.78343593   Diet.N
#> 7010-02          2375       2400         25  1.04166667   Diet.N
#> 7010-03          6312       6495        183  2.81755196   Diet.N

### Delete these samples from PLEASE data.
rownames(taxa.raw)[which(rownames(taxa.raw) %in% rownames(low.depth.samples))]
#> [1] "5018-03" "7001-02" "7003-02" "7003-03" "7009-03" "7010-03"

### Before deletion
dim(taxa.raw)
#> [1] 335 105

### After deletion
taxa.raw <- taxa.raw[-which(rownames(taxa.raw) %in% rownames(low.depth.samples)), ]
dim(taxa.raw)
#> [1] 329 105


### Filter low abundant bacterial data
filter.index1 <- apply(taxa.raw, 2, function(X) {
  sum(X > 0) > 0.4 * length(X)
})
filter.index2 <- apply(taxa.raw, 2, function(X) {
  quantile(X, 0.9) > 1
})
taxa.filter <- taxa.raw[, filter.index1 & filter.index2]
taxa.filter <- 100 * sweep(taxa.filter, 1, rowSums(taxa.filter), FUN = "/")
cat("after filter:", "samples", "taxa", dim(taxa.filter), "\n")
#> after filter: samples taxa 329 18
cat(colnames(taxa.filter), "\n")
#> g__Bacteroides g__Ruminococcus g__Faecalibacterium g__Bifidobacterium g__Escherichia g__Clostridium g__Dialister g__Eubacterium g__Roseburia g__Streptococcus g__Dorea g__Parabacteroides g__Lactobacillus g__Veillonella g__Haemophilus g__Alistipes g__Collinsella g__Coprobacillus
head(rowSums(taxa.filter))
#> 5001-01 5001-02 5001-03 5001-04 5002-01 5002-02 
#>     100     100     100     100     100     100

###
taxa.data <- taxa.filter
dim(taxa.data)
#> [1] 329  18

#### Load sample information
sample.info.file <- "https://raw.githubusercontent.com/chvlyl/PLEASE/master/1_Data/Processed_Data/Sample_Information/2015_02_13_Processed_Sample_Information.csv" # nolint
sample.info <- read.csv(sample.info.file, row.names = 1)

#### create covariates,
#### Time, antiTNF+EEN
reg.cov <-
  data.frame(Sample = rownames(taxa.data), stringsAsFactors = FALSE) %>%
  left_join(add_rownames(sample.info, var = "Sample"), by = "Sample") %>%
  dplyr::filter(Treatment.Specific != "PEN") %>%
  dplyr::select(Sample, Time, Subject, Response, Treatment.Specific) %>%
  group_by(Subject) %>%
  summarise(count = n()) %>%
  dplyr::filter(count == 4) %>%
  dplyr::select(Subject) %>%
  left_join(add_rownames(sample.info, var = "Sample"), by = "Subject") %>%
  mutate(Treat = ifelse(Treatment.Specific == "antiTNF", 1, 0)) %>%
  dplyr::select(Sample, Subject, Time, Response, Treat) %>%
  dplyr::mutate(Subject = paste("S", Subject, sep = "")) %>%
  dplyr::mutate(Time = ifelse(Time == "1", 0,
                              ifelse(Time == "2", 1,
                                     ifelse(Time == "3", 4,
                                            ifelse(Time == "4", 8, NA))))) %>%
  dplyr::mutate(Time.X.Treatment = Time * Treat) %>%
  as.data.frame()
#> Warning: `add_rownames()` was deprecated in dplyr 1.0.0.
#> i Please use `tibble::rownames_to_column()` instead.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: `add_rownames()` was deprecated in dplyr 1.0.0.
#> i Please use `tibble::rownames_to_column()` instead.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

### take out first time point
reg.cov.t1 <- subset(reg.cov, Time == 0)
rownames(reg.cov.t1) <- reg.cov.t1$Subject
reg.cov.t234 <- subset(reg.cov, Time != 0)
reg.cov.t234 <- data.frame(
  baseline.sample = reg.cov.t1[reg.cov.t234$Subject, "Sample"],
  baseline.subject = reg.cov.t1[reg.cov.t234$Subject, "Subject"],
  reg.cov.t234,
  stringsAsFactors = FALSE
)

#### Fit ZIBR model on the real data
spe.all <- colnames(taxa.data)
p.species.list.zibr <- list()
p.species.list.lme <- list()
for (spe in spe.all) {
  # spe = 'g__Collinsella'
  ###### create covariates
  X <- data.frame(
    Baseline = taxa.data[reg.cov.t234$baseline.sample, spe] / 100,
    # reg.cov.t234[,c('log.days','Delivery','Delivery.X.log.days')]
    reg.cov.t234[, c("Time", "Treat")]
  )
  rownames(X) <- reg.cov.t234$Sample
  Z <- X
  subject.ind <- reg.cov.t234$Subject
  time.ind <- reg.cov.t234$Time
  ####
  # cat(spe,'\n')
  Y <- taxa.data[reg.cov.t234$Sample, spe] / 100
  # cat('Zeros/All',sum(Y==0),'/',length(Y),'\n')
  ####################
  ## fit linear random effect model with arcsin square transformation on Y
  tdata <- data.frame(Y.tran = asin(sqrt(Y)), X, SID = subject.ind)
  lme.fit <- nlme::lme(Y.tran ~ Baseline + Time + Treat, random = ~ 1 | SID, data = tdata)
  coef.mat <- summary(lme.fit)$tTable[-1, c(1, 5)]
  p.species.list.lme[[spe]] <- coef.mat[, 2]
  ###################
  if (sum(Y > 0) < 10 || sum(Y == 0) < 10 || max(Y) < 0.01) {
    p.species.list.zibr[[spe]] <- p.species.list.lme[[spe]]
    next
  } else {
    est <- zibr(
      logistic_cov = X, beta_cov = Z, Y = Y,
      subject_ind = subject.ind,
      time_ind = time.ind,
      quad_n = 30, verbose = TRUE
    )
    p.species.list.zibr[[spe]] <- est$joint.p
  }
  # break
}


#### adjust p values
p.species.zibr <- t(as.data.frame(p.species.list.zibr))
p.species.zibr.adj <-
  add_rownames(as.data.frame(p.species.zibr), var = "Species") %>% mutate_each(funs(p.adjust(., "fdr")), -Species)
#> Warning: `funs()` was deprecated in dplyr 0.8.0.
#> i Please use a list of either functions or lambdas:
#> 
#> # Simple named list: list(mean = mean, median = median)
#> 
#> # Auto named with `tibble::lst()`: tibble::lst(mean, median)
#> 
#> # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: `mutate_each_()` was deprecated in dplyr 0.7.0.
#> i Please use `across()` instead.
#> i The deprecated feature was likely used in the dplyr package.
#>   Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: `add_rownames()` was deprecated in dplyr 1.0.0.
#> i Please use `tibble::rownames_to_column()` instead.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

# write.csv(p.species.zibr.adj,file=paste('4_Results/Real_Data_Estimation_Results_antiTNF_EEN_ZIBR.csv',sep=''))


p.species.lme <- t(as.data.frame(p.species.list.lme))
p.species.lme.adj <-
  add_rownames(as.data.frame(p.species.lme), var = "Species") %>% mutate_each(funs(p.adjust(., "fdr")), -Species)
#> Warning: `funs()` was deprecated in dplyr 0.8.0.
#> i Please use a list of either functions or lambdas:
#> 
#> # Simple named list: list(mean = mean, median = median)
#> 
#> # Auto named with `tibble::lst()`: tibble::lst(mean, median)
#> 
#> # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: `add_rownames()` was deprecated in dplyr 1.0.0.
#> i Please use `tibble::rownames_to_column()` instead.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

# write.csv(p.species.lme.adj,file=paste('4_Results/Real_Data_Estimation_Results_antiTNF_EEN_LME.csv',sep=''))

####
intersect(
  p.species.lme.adj[p.species.lme.adj$Treat < 0.05, "Species"],
  p.species.zibr.adj[p.species.zibr.adj$Treat < 0.05, "Species"]
)
#> # A tibble: 0 x 1
#> # i 1 variable: Species <chr>
setdiff(
  p.species.lme.adj[p.species.lme.adj$Treat < 0.05, "Species"],
  p.species.zibr.adj[p.species.zibr.adj$Treat < 0.05, "Species"]
)
#> # A tibble: 7 x 1
#>   Species            
#>   <chr>              
#> 1 g__Ruminococcus    
#> 2 g__Faecalibacterium
#> 3 g__Bifidobacterium 
#> 4 g__Dialister       
#> 5 g__Streptococcus   
#> 6 g__Haemophilus     
#> 7 g__Alistipes
setdiff(
  p.species.zibr.adj[p.species.zibr.adj$Treat < 0.05, "Species"],
  p.species.lme.adj[p.species.lme.adj$Treat < 0.05, "Species"]
)
#> # A tibble: 0 x 1
#> # i 1 variable: Species <chr>

intersect(
  p.species.lme.adj[p.species.lme.adj$Time < 0.05, "Species"],
  p.species.zibr.adj[p.species.zibr.adj$Time < 0.05, "Species"]
)
#> # A tibble: 0 x 1
#> # i 1 variable: Species <chr>
setdiff(
  p.species.lme.adj[p.species.lme.adj$Time < 0.05, "Species"],
  p.species.zibr.adj[p.species.zibr.adj$Time < 0.05, "Species"]
)
#> # A tibble: 1 x 1
#>   Species       
#>   <chr>         
#> 1 g__Eubacterium
setdiff(
  p.species.lme.adj[p.species.lme.adj$Time < 0.05, "Species"],
  p.species.zibr.adj[p.species.zibr.adj$Time < 0.05, "Species"]
)
#> # A tibble: 1 x 1
#>   Species       
#>   <chr>         
#> 1 g__Eubacterium