A method to simulate pooled sequencing data (Pool-seq) is implemented
in the R language in the package poolHelper
. The aim of
this package is to provide users with a tool to chose the appropriate
pool size and depth of coverage when conducting experiments that require
pool sequencing. This vignette serves as a introduction, explaining how
the different functions of the package can be used to assess the impact
of different sequencing parameters.
At the end, we also included a section with details of specific functions. At that section, users can find a detailed step-by-step description of how to simulate Pool-seq data. The various subsections describe how to simulate the total depth of coverage and then partition that coverage among different pools and individuals. There is also a subsection describing how the number of reads with the reference allele can be computed according to the genotype of a given individual.
library(poolHelper)
With the poolHelper
package, users can evaluate the
effect of different pool errors, pool sizes and depths of coverage on
the allele frequencies. The frequencies obtained with Pool-seq are
compared to the allele frequencies computed directly from genotypes.
Briefly, we use scrm
to simulate a single population at
equilibrium and obtain polymorphic sites for each simulated locus. Then,
we compute allele frequencies by counting the total number of derived
alleles per site and dividing that by the total number of gene copies.
After obtaining the allele frequencies computed directly from genotypes,
we simulate Pool-seq data and obtain the Pool-seq allele frequencies.
Details on this procedure can be found in the last section of this
vignette.
We then use the mae
function from the
Metrics
package to compute the average absolute difference
between the Pool-seq allele frequencies and the ones obtained directly
from the genotypes. Mean Absolute Error (MAE) is calculated as the sum
of absolute errors divided by the sample size.
As mentioned, the main goal of the package poolHelper
is
to provide users with a tool to aid in the experimental design of pooled
sequencing. Researchers interested in Pool-seq are concerned in
obtaining accurate estimates of allelic frequencies, while keeping the
costs down. Thus, it is important to have an idea of how accurate the
allele frequencies can be when using different pool sizes or sequencing
at different mean coverage values. In the following sections we detail
how the poolHelper
package can help users in answering
those questions.
One important aspect to consider is whether DNA extraction should be
done using multiple batches of individuals, combining several of them
into larger pools for library preparation and sequencing, or using a
single batch of individuals. By using the maePool
function
we can check, under different conditions, what is the effect of using
multiple or a single batch of individuals.
The pools
input argument allows the user to simulate a
single pool, by creating a list with a single integer or multiple pools,
by creating a list with a vector containing various entries. The
maePool
function assumes that each entry of that vector is
the size, in number of diploids individuals, of a given pool.
# create a list with a single pool of 100 individuals
<- list(100)
pools # compute average absolute difference between allele frequencies
<- maePool(nDip = 100, nloci = 1000, pools = pools, pError = 100, sError = 0.01,
onePool mCov = 100, vCov = 250, min.minor = 0)
# create a list with 10 pools, each with 10 individuals
<- list(rep(10, 10))
pools # compute average absolute difference between allele frequencies
<- maePool(nDip = 100, nloci = 1000, pools = pools, pError = 100, sError = 0.01,
tenPool mCov = 100, vCov = 250, min.minor = 0)
# combine both
<- rbind(onePool, tenPool)
final # convert the number of individuals in the pool to a factor
$nPools <- as.factor(final$nPools)
final
# load the ggplot package
library(ggplot2)
# MAE value in the y-axis and the number of individuals in the pool in the
# x-axis
ggplot(final, aes(x = nPools, y = absError)) + geom_boxplot() + theme_classic()
In this example, we can see the effect of using a single or multiple
pools when a sample of 100 individuals is sequenced at a mean coverage
of 100x and for a given pool error. By varying the pError
and mCov
input arguments, users can evaluate the effect of
using a single or multiple pools at various pool error values and at
different coverages.
Another fundamental decision is what mean coverage should we try to
obtain when sequencing a pool of individuals. By using the
maeFreqs
function we can look at the average absolute
difference between genotype allele frequencies and Pool-seq allele
frequencies obtained using different mean coverages.
# create a vector with various mean coverages
<- c(20, 50, 100)
mCov # create a vector with the variance of the coverage
<- c(100, 250, 500)
vCov # compute average absolute difference between allele frequencies
<- maeFreqs(nDip = 100, nloci = 1000, pError = 100, sError = 0.01, mCov, vCov,
mydf min.minor = 0)
# convert the mean coverage into a factor
$mean <- as.factor(mydf$mean)
mydf# boxplot the MAE value in the y-axis and the coverage in the x-axis
ggplot(mydf, aes(x = mean, y = absError)) + geom_boxplot() + theme_classic()
Note that the mCov
input argument is a vector with
various mean coverage values. The maeFreqs
function
computes the average absolute difference for each user-defined coverage.
Additionally, vCov
should also be a vector, with each entry
being the variance of the corresponding coverage in mCov
.
In this example, we can see the effect of sequencing a sample of 100
individuals at 20x, 50x or 100x mean coverage. By varying the
mCov
or pError
input arguments, users can
evaluate the impact of different mean coverages at various pool error
values.
It is also important to define the number of individuals to sequence
or, in other words, the pool size. The maeFreqs
function
can also be used to compute the average absolute difference between the
allele frequencies computed from genotypes and Pool-seq allele
frequencies obtained with different pool sizes.
# create a vector with various mean coverages
<- c(10, 50, 100)
nDip
# compute average absolute difference between allele frequencies
<- maeFreqs(nDip = nDip, nloci = 1000, pError = 100, sError = 0.01, mCov = 100,
mydf vCov = 250, min.minor = 0)
# convert the number of individuals into a factor
$nDip <- as.factor(mydf$nDip)
mydf# boxplot the MAE value in the y-axis and the coverage in the x-axis
ggplot(mydf, aes(x = nDip, y = absError)) + geom_boxplot() + theme_classic()
As you can see, by varying the nDip
input argument, we
can evaluate what is the optimal pool size. In this example, we can see
the effect of sequencing a sample of 10, 50 or 100 individuals at 100x
coverage. For this coverage and pool error value, it is clear that
doubling the pool size, from 50 to 100 individuals, does not lead to a
significant decrease in the average absolute difference between allele
frequencies. Note that the maeFreqs
function assumes that
only a single pool was used to sequence the population and so, for this
example, a single pool of 10, 50 or 100 individuals was used.
The maeFreqs
function can also be used to simultaneously
test different combinations of parameters. By varying the
mCov
, pError
and/or nDip
input
arguments, the impact of multiple combinations of those parameters can
be quickly assessed. The maeFreqs
function will simulate
all possible combinations of those parameters and compute the average
absolute difference between allele frequencies.
# create a vector with various mean coverages
<- c(20, 50, 100)
mCov # create a vector with the variance of the coverage
<- c(100, 250, 500)
vCov # create a vector with various pool errors
<- c(5, 100, 250)
pError
# compute average absolute difference between allele frequencies
<- maeFreqs(nDip = 100, nloci = 1000, pError, sError = 0.01, mCov, vCov, min.minor = 0)
mydf
# convert the mean coverage into a factor
$mean <- as.factor(mydf$mean)
mydf# convert the pooling error to a factor
$PoolError <- as.factor(mydf$PoolError)
mydf
# boxplot the MAE value in the y-axis and the pool error in the x-axis
# producing one boxplot for each of the different coverages
ggplot(mydf, aes(x = PoolError, y = absError, fill = mean)) +
geom_boxplot() + theme_classic()
In this example, the number of sampled individuals was kept constant, meaning that the population was always sequenced using a pool of 100 individuals. Those 100 individuals were sequenced at 20x, 50x or 100x mean coverage and assuming a pool error value of 5%, 100% or 250%. By selecting multiple combinations of parameters, users can select a sequencing design that minimizes the average absolute difference between allele frequencies or get an idea of how much mismatch to expect in their Pool-seq data.
In this section and until the end of the vignette, we go over the steps required to simulate Pool-seq data and give details on some of the specific functions included in the package.
The simulateCoverage
function can be used to simulate
the total depth of coverage at each site. The mean
and
variance
input arguments of the function represent,
respectively the mean coverage and the variance of the coverage to
simulate. nLoci
represents the number of independent loci
to simulate and nSNPs
is the number of polymorphic sites to
simulate per locus.
# simulate number of reads for one population
<- simulateCoverage(mean = 50, variance = 250, nSNPs = 100, nLoci = 1)
reads # display the structure of the reads object
str(reads)
#> List of 1
#> $ : int [1, 1:100] 56 54 43 59 42 70 52 21 49 43 ...
As you can see, the resulting output is a list with one entry because
nLoci = 1
. That entry is a vector with
length = 100
because that was the number of
nSNPs
. We can also use this function to simulate the
coverage of multiple populations at the same time. To do that, the
mean
and variance
input arguments of the
function should be vectors. The function will assume that each entry of
those vectors is the mean and variance of a different population. For
instance, in the next example we set mcov <- c(50, 100)
,
meaning that we wish to simulate two populations, the first with a mean
coverage of 50x and the second with a mean coverage of 100x.
# create a vector with the mean coverage of each population
<- c(50, 100)
mcov # create a vector with the variance of the coverage for each population
<- c(250, 500)
vcov # simulate number of reads for two populations
<- simulateCoverage(mean = mcov, variance = vcov, nSNPs = 100, nLoci = 1)
reads # display the structure of the reads object
str(reads)
#> List of 1
#> $ : int [1:2, 1:100] 46 90 87 103 48 123 44 118 75 159 ...
Now, the output of the function is slightly different. We still have
a single locus (nLoci = 1
) and 100 sites on that locus
(nSNPs = 100
) but now that one list entry is a matrix with
two rows. Each row is the coverage per site for one population. Thus,
the first row is the coverage for the first population of the
mcov
input argument and the second row is the coverage for
the second population in that argument. If mcov
had the
mean coverage for more populations, the logic would remain the same.
The difference in the mean coverage of the two population can be
quickly visualized. In the following we simulate two populations, one
with 50x mean coverage and the other with 100x. We set
nSNPs = 10000
and visualize the coverage distribution using
a histogram.
# create a vector with the mean coverage of each population
<- c(50, 100)
mcov # create a vector with the variance of the coverage for each population
<- c(250, 500)
vcov # simulate number of reads for two populations
<- simulateCoverage(mean = mcov, variance = vcov, nSNPs = 10000, nLoci = 1)
reads # plot the coverage of the first population
hist(reads[[1]][1,], col = rgb(0,0,1,1/4), xlim = c(0, 200), main = "", xlab = "")
# add the coverage of the second population
hist(reads[[1]][2,], col = rgb(1,0,0,1/4), add = TRUE)
The coverage distribution of the population simulated with a mean of 50x is shown in blue and the distribution of the 100x population is shown in red.
It is also possible to remove sites with low or high coverage by
using the remove_by_reads
function. This function will
completely remove any site from the data (in this instance, the site
will be removed from both populations). Sites will be removed if their
coverage is below the minimum
allowed or if it is above the
maximum
allowed. In the next bit, we use the
reads
simulated before and remove all sites with a coverage
below 25x and above 150x.
# check the minimum and maximum coverage before removal
<- range(unlist(reads))
x # remove sites with coverage below 25x and above 150x
<- remove_by_reads(nLoci = 1, reads = reads, minimum = 25, maximum = 150)
reads # display the structure of the reads object after removal
str(reads)
#> List of 1
#> $ : int [1:2, 1:9497] 68 96 26 90 35 93 92 122 38 106 ...
# check the minimum and maximum coverage after removal
range(unlist(reads))
#> [1] 25 150
Accordingly, the minimum simulated coverage before running the
remove_by_reads
function was 6
and the maximum
was 198
but after removal of sites with a coverage below
25x and above 150x, the minimum and maximum coverage are, obviously,
25
and 150
respectively. It is also clear that
we no longer have nSNPs = 10000
in the data.
It is also possible to simulate the contribution of each pool, assuming that a single population was sequenced using multiple pools. Before computing the actual number of reads contributed by each pool, we first need to simulate the proportion of contribution.
To do this, we use the poolProbs
function. The
nPools
input argument of this function should represent the
number of pools used to sequence the population, while the
vector_np
contains the number of individuals per pool.
Thus, in the following example
vector_np = c(10, 10, 10, 10)
means that four pools were
used to sequence the population, each comprised of 10 individuals. The
pError
input argument defines the degree of pooling error.
Briefly, this pooling error controls the dispersion of the pool
contribution, centered around the expected value. Higher values of
pError
lead to a higher dispersion and thus, the
contributions will vary more between pools. In other words, with higher
values of pError
some pools will contribute a lot of reads
and others will not contribute much.
In the next chunk, we see the difference in proportion of
contribution when 4 pools of 10 individuals were used to sequence a
single population and the pooling error is either low
(pError = 5
) or high (pError = 250
). We can
also assess the impact of different pool sizes by including one pool
with 100 individuals instead of only 10.
# four pools with low sequencing error
poolProbs(nPools = 4, vector_np = c(10, 10, 10, 10), nSNPs = 6, pError = 5)
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 0.2509828 0.2625226 0.2384353 0.2527059 0.2592638 0.2384960
#> [2,] 0.2578772 0.2364968 0.2495056 0.2426504 0.2521399 0.2538894
#> [3,] 0.2523269 0.2356066 0.2439813 0.2435967 0.2303962 0.2547097
#> [4,] 0.2388131 0.2653741 0.2680778 0.2610470 0.2582001 0.2529049
# four pools with high sequencing error
poolProbs(nPools = 4, vector_np = c(10, 10, 10, 10), nSNPs = 6, pError = 250)
#> Warning: pError was too high. It was replaced by 163.205080756888
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 1.051242e-48 2.104075e-04 1.000000e+00 0.137867858 5.725909e-30
#> [2,] 9.956928e-01 3.576426e-01 7.515615e-20 0.620643503 1.000000e+00
#> [3,] 8.749371e-27 6.420946e-01 1.140421e-18 0.237039291 4.314123e-29
#> [4,] 4.307191e-03 5.240561e-05 4.474503e-21 0.004449349 4.419424e-24
#> [,6]
#> [1,] 1.000000e+00
#> [2,] 3.329394e-50
#> [3,] 4.650061e-15
#> [4,] 1.926231e-18
# four pools but one is much larger
poolProbs(nPools = 4, vector_np = c(10, 100, 10, 10), nSNPs = 6, pError = 5)
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 0.07959090 0.07783151 0.07675677 0.08192707 0.07870818 0.07745896
#> [2,] 0.76288746 0.77333682 0.77538613 0.76463441 0.77378727 0.76763061
#> [3,] 0.08131869 0.07527085 0.07655418 0.07657703 0.07360999 0.07461381
#> [4,] 0.07620295 0.07356082 0.07130292 0.07686149 0.07389457 0.08029662
The output of the poolProbs
function is a matrix with
the proportion of contribution for each pool. Each row of the matrix
corresponds to a different pool and each column is a different site. You
can see that in the first example, the proportion of contribution is
roughly the same for all pools. The next example is similar but with
pError = 250
. With this higher pool error, it is clear that
some pools have a higher proportion of contribution and others have a
smaller. Thus, with higher pool errors, the proportion of contribution
is no longer the same for all pools.
This also happens when pool error is low but one of the pools is much larger. In the last example, the second pool has 100 individuals, while the other pools only have 10. In this instance, it is clear the the proportion of contribution of the larger pool is always higher.
After computing the proportion of contribution of each pool, this can
be used to simulate the actual number of reads contributed by each pool.
To do this, we use the pReads
function.
This functions requires as input argument the total number of pools
used to sequence the population (nPools
), a vector with the
total coverage
per site and the probabilities of
contribution computed with the poolProbs
function
(probs
). In the next chunk, we simulate coverage for 10
SNPs of a single population, compute the probability of contribution for
4 pools used to sequence that population and then simulate the actual
number of reads per pool.
# simulate total coverage per site
<- unlist(simulateCoverage(mean = 100, variance = 250, nSNPs = 10, nLoci = 1))
reads # compute the proportion of contribution of each pool
<- poolProbs(nPools = 4, vector_np = rep(10, 4), nSNPs = 10, pError = 5)
probs # simulate the contribution in actual read numbers
<- poolReads(nPools = 4, coverage = reads, probs = probs)
pReads # output the number of reads per pool and per site
pReads#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 24 27 19 25 19 36 25 37 22 24
#> [2,] 23 17 24 33 22 29 27 29 24 24
#> [3,] 19 15 20 25 24 32 29 34 32 32
#> [4,] 30 17 28 30 22 27 30 32 33 21
It is clear that, when pool error is quite low
(pError = 5
in the previous chunk), the number of reads
contributed by each pool is quite similar. Thus, the total coverage of
any given site is well distributed among all pools. On the other hand,
if pool error is high (pError = 250
in the next chunk).
# simulate total coverage per site
<- unlist(simulateCoverage(mean = 100, variance = 250, nSNPs = 10, nLoci = 1))
reads # compute the proportion of contribution of each pool
<- poolProbs(nPools = 4, vector_np = rep(10, 4), nSNPs = 10, pError = 250)
probs #> Warning: pError was too high. It was replaced by 163.205080756888
# simulate the contribution in actual read numbers
<- poolReads(nPools = 4, coverage = reads, probs = probs)
pReads # output the number of reads per pool and per site
pReads#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 0 90 4 0 81 5 0 0 0 39
#> [2,] 0 0 0 0 0 101 0 0 0 6
#> [3,] 87 0 92 0 0 31 0 0 96 31
#> [4,] 0 0 0 97 0 0 113 111 0 0
Then the contributions are more uneven. In this instance, there are some sites where one or two pools contribute most of the reads while the remaining pools have few or even zero reads. Thus, the total coverage is not very well distributed among all pools when pool error is higher.
The difference between low or high pool errors can be (roughly) inspected with a histogram. In the next chunk we simulate the total coverage and then use the same coverage to compute the contribution of each pool, using either a low or a high pool error. We then plot the distribution of the number of reads contributed by each pool.
# simulate total coverage per site
<- simulateCoverage(mean = 100, variance = 250, nSNPs = 10000, nLoci = 1)
reads # unlist to create a vector with the coverage
<- unlist(reads)
reads
# compute the proportion of contribution of each pool
<- poolProbs(nPools = 4, vector_np = rep(10, 4), nSNPs = 10000, pError = 5)
probs # simulate the contribution in actual read numbers
<- poolReads(nPools = 4, coverage = reads, probs = probs)
low.pReads
# compute the proportion of contribution of each pool
<- poolProbs(nPools = 4, vector_np = rep(10, 4), nSNPs = 10000, pError = 250)
probs #> Warning: pError was too high. It was replaced by 163.205080756888
# simulate the contribution in actual read numbers
<- poolReads(nPools = 4, coverage = reads, probs = probs)
high.pReads
# create the plot of the contribution with low pool error
<- hist(unlist(low.pReads), plot = FALSE)
h1 # create the plot of the contribution with high pool error
<- hist(unlist(high.pReads), plot = FALSE)
h2 # get the maximum x-value from the two plots
<- max(h1[["breaks"]], h2[["breaks"]])
xmax # and the maximum y-value
<- max(h1[["counts"]], h2[["counts"]])
ymax # set the color for the contribution computed with low pool error
<- rgb(0, 0, 1, 1/4)
col1 # set the color for the contribution computed with high pool error
<- rgb(1, 0, 0, 1/4)
col2
# plot the contribution computed with low pool error
plot(h1, col = col1, xlim = c(0, xmax), ylim = c(0, ymax), main = "", xlab = "")
# add the plot of the contribution computed with high pool error
plot(h2, col = col2, add = TRUE)
The distribution of the contribution computed with a low pool error is shown in blue and the distribution computed with a high pool error in red. It is clear that high pool errors lead to more variation in the contribution of each pool towards the total coverage of the population. In particular, the number of pools that contribute zero (or close to zero) reads increases when the pool error is high.
After computing the number of reads contributed by each pool, the next step involves simulating the number of reads contributed by each individual inside their pool. For instance, if a pool of 10 individuals was used to sequence a population, how many reads were contributed by each of those 10 individuals?
As for the pools, the first step requires computing the probability
of contribution of each individual. This can be done with the
indProbs
function. This np
input argument of
this function corresponds to the total number of individuals in the
pool, while the nSNPs
is the number of sites to simulate.
As before, the pError
represents the degree of pooling
error and higher values of pError
mean that some
individuals will contribute more reads than others.
In the next chunk, we examine the probability of contribution of 10 individuals, sequenced at 6 sites, when pooling error is quite low.
# compute the probability of contribution of each individual
indProbs(np = 10, nSNPs = 6, pError = 5)
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 0.10207943 0.09653788 0.11138571 0.10678804 0.09239712 0.10306432
#> [2,] 0.10352625 0.09081170 0.10582074 0.10620813 0.09675435 0.09821943
#> [3,] 0.10610745 0.09522770 0.09441745 0.09347440 0.10339808 0.09696297
#> [4,] 0.09470248 0.09175536 0.09129124 0.09096539 0.09736175 0.10524494
#> [5,] 0.09929265 0.10535360 0.10193489 0.09981682 0.09586010 0.10747931
#> [6,] 0.09506207 0.10184497 0.10610080 0.10582677 0.10666903 0.10051568
#> [7,] 0.09225355 0.10730948 0.10584477 0.09362741 0.10060737 0.09967834
#> [8,] 0.10060575 0.10105249 0.09634304 0.09846990 0.10235740 0.09392058
#> [9,] 0.10300783 0.10372678 0.09497862 0.10982374 0.10147758 0.10075905
#> [10,] 0.10336255 0.10638005 0.09188273 0.09499941 0.10311723 0.09415538
In this example, the probability of contribution is very similar
across individuals. In fact, the probability is around 0.1 for each
individual, meaning that, in a situation with low pooling error, all
individuals should contribute equally. If we simulate the same
conditions, but increasing the pooling error (pError = 150
)
we should see a different result. Note that we use the
round
function so that the the output is not printed in
scientific notation. This is just to make it easier to visualize the
differences.
# compute the probability of contribution of each individual
round(indProbs(np = 10, nSNPs = 5, pError = 150), digits = 5)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.03504 0.16325 0.00180 0.35141 0.00027
#> [2,] 0.87196 0.50819 0.00000 0.00001 0.00001
#> [3,] 0.02411 0.00031 0.00207 0.09831 0.04599
#> [4,] 0.00613 0.00560 0.24676 0.07768 0.02374
#> [5,] 0.02836 0.06174 0.33332 0.00466 0.00155
#> [6,] 0.00186 0.00092 0.00030 0.17244 0.00003
#> [7,] 0.00107 0.08814 0.06410 0.01574 0.07624
#> [8,] 0.03015 0.11658 0.24018 0.07302 0.70946
#> [9,] 0.00118 0.00000 0.00094 0.04692 0.11023
#> [10,] 0.00014 0.05526 0.11052 0.15981 0.03248
With this higher pooling error, it is evident that the probability of contribution is not the same across all individuals. Some individuals have a much higher probability of contribution while others have a probability of contribution very close to zero.
The probabilities of contribution of each individual can then be used
to simulate the total number of reads contributed by each individual,
using the indReads
function. This function requires as
input argument the total number of individuals sequenced in that pool
(np
), a vector with the total coverage
of that
particular pool per site and probabilities of contribution computed with
the indProbs
function (probs
).
In the next chunk, we start by simulating the total coverage per site. This total coverage is then partitioned among the different pools to obtain the total coverage per pool. Finally, we simulate the contribution of the 10 individuals sequenced at one of the pools towards the total coverage of that pool. All these steps are done assuming a low pooling error.
# simulate total coverage per site
<- unlist(simulateCoverage(mean = 100, variance = 250, nSNPs = 12, nLoci = 1))
reads # compute the proportion of contribution of each pool
<- poolProbs(nPools = 4, vector_np = rep(10, 4), nSNPs = 12, pError = 5)
probs # simulate the contribution in actual read numbers
<- poolReads(nPools = 4, coverage = reads, probs = probs)
pReads # compute the proportion of contribution of each pool
<- indProbs(np = 10, nSNPs = 12, pError = 5)
probs # simulate the contribution in actual read numbers of each individual
indReads(np = 10, coverage = pReads[1,], probs = probs)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#> [1,] 1 2 1 2 2 1 3 3 3 0 5 4
#> [2,] 3 5 3 1 2 3 0 4 3 2 1 2
#> [3,] 3 1 3 1 5 1 2 1 1 2 0 2
#> [4,] 1 2 3 2 4 0 0 2 1 3 7 2
#> [5,] 3 6 1 3 4 5 2 4 2 3 3 1
#> [6,] 7 2 3 2 4 3 3 0 3 2 1 2
#> [7,] 3 6 5 0 0 3 0 4 4 3 3 8
#> [8,] 5 6 1 1 3 3 0 6 5 2 1 2
#> [9,] 1 2 1 4 6 6 2 4 3 3 3 3
#> [10,] 1 4 2 2 3 3 3 1 2 3 1 3
It is clear that, when pool error is low (pError = 5
),
each individual contributes roughly the same number of reads towards the
total coverage of the pool. Thus, the overall dispersion is quite low.
If we repeat the same steps, changing only the pooling error to a much
higher value:
# simulate total coverage per site
<- unlist(simulateCoverage(mean = 100, variance = 250, nSNPs = 12, nLoci = 1))
reads # compute the proportion of contribution of each pool
<- poolProbs(nPools = 4, vector_np = rep(10, 4), nSNPs = 12, pError = 150)
probs # simulate the contribution in actual read numbers
<- poolReads(nPools = 4, coverage = reads, probs = probs)
pReads # compute the proportion of contribution of each pool
<- indProbs(np = 10, nSNPs = 12, pError = 150)
probs # simulate the contribution in actual read numbers of each individual
indReads(np = 10, coverage = pReads[1,], probs = probs)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#> [1,] 0 2 0 0 0 14 0 0 0 9 0 0
#> [2,] 0 4 0 1 0 0 0 0 0 21 0 0
#> [3,] 0 8 9 0 0 46 0 19 0 3 0 0
#> [4,] 0 21 0 0 0 11 0 26 0 1 0 0
#> [5,] 0 5 0 0 0 3 0 53 0 35 0 0
#> [6,] 0 0 5 0 0 0 0 1 0 0 0 0
#> [7,] 0 0 11 0 0 32 0 3 0 2 0 0
#> [8,] 0 0 0 3 0 0 0 0 0 0 0 0
#> [9,] 0 16 0 1 0 0 0 0 0 0 0 0
#> [10,] 0 1 21 0 0 1 0 0 0 13 0 0
We see that in this instance, there is much more dispersion and the individuals do not contribute the same number of reads. While some individuals do not contribute a single reads towards the total pool coverage, others contribute too many.
Following the computation of the number of reads contributed by each
individual, we should simulate how many of those reads have the
reference allele versus how many have the alternative allele. For a
single population this can be done using the
computeReference
function.
This function requires as input argument the individual contribution
i.e. the number of reads that each individual contributes and the
sequencing error - error
. The sequencing error is defined
as a error rate - the higher the error, the more likely it is for an
individual that is homozygous for the reference allele (coded as 0 in
the genotypes
matrix) to contribute reads with the
alternative allele. Note that this function also requires as input
argument the genotypes
of the individuals. Given that we
did not simulate genotypes in this vignette, we are going to create a
matrix of genotypes where half the individuals are homozygous for the
reference allele and the other half is homozygous for the alternative
allele (coded as 2 in the genotypes
matrix).
In the next chunk we go over all the previous steps, simulating the total coverage for one population, then partitioning that over all pools and computing the contribution of each individual in one of those pools. At the end, we simulate how many of those individually contributed reads have the reference allele.
# simulate total coverage per site
<- unlist(simulateCoverage(mean = 100, variance = 250, nSNPs = 12, nLoci = 1))
reads # compute the proportion of contribution of each pool
<- poolProbs(nPools = 4, vector_np = rep(10, 4), nSNPs = 12, pError = 5)
probs # simulate the contribution in actual read numbers
<- poolReads(nPools = 4, coverage = reads, probs = probs)
pReads # compute the proportion of contribution of each pool
<- indProbs(np = 10, nSNPs = 12, pError = 5)
probs # simulate the contribution in actual read numbers of each individual
<- indReads(np = 10, coverage = pReads[1,], probs = probs)
iReads # create fake genotypes - half the matrix is 0 and the other half is 2
<- rbind(matrix(0, nrow = 5, ncol = 12), matrix(2, nrow = 5, ncol = 12))
geno # simulate the number of reference reads
computeReference(genotypes = geno, indContribution = iReads, error = 0.001)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#> [1,] 1 0 4 5 4 1 4 6 2 3 0 2
#> [2,] 3 1 1 2 5 1 4 2 6 3 4 3
#> [3,] 6 3 2 2 6 5 2 7 3 1 2 2
#> [4,] 2 0 2 3 5 0 0 4 3 2 3 3
#> [5,] 5 2 1 2 4 1 1 2 1 1 1 4
#> [6,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [7,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [8,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [9,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [10,] 0 0 0 0 0 0 0 0 0 0 0 0
There is a clear division between the number of reads with the
reference allele for the first 5 individuals (coded as 0 in the
genotypes
matrix) and the remaining 5 individuals (coded as
2 in the genotypes
matrix). This is expected because the
error
was small. If we increased the error
,
then we would expect to see some reference allele reads contributed by
individuals that are homozygous for the other allele.