geneHapR is designed for gene haplotype statistics,
phenotype association and visualization.
Dataset required for haplotype statistic, visualization and phenotype association and the import function were listed in Table 1.
The genotype dataset is essential for haplotype identification and could be supplied in VCF, FASTA, P.link, HAPMAP and table format. The annotation were used for variants filtration and prepare schematic diagram.
Detailed information of individuals include phenotype data, group/category information and geo-coordinates. The phenotype data was used for comparison between different haplotypes. The group /category information was used for pie plot with haplotype network (eg. the second column in Table 4). And the geo-coordinates only used for demonstration of geographical distribution and include two columns: longitude and latitude (eg. the third and fourth column in Table 4).
Table 1: The required format of dataset and import functions for geneHapR
| Dataset | File format | Import function | 
|---|---|---|
| Genotype (necessary) | VCF: *.vcf, *.vcf.gz; Sequences: *.fa, .fasta; p.link: (*.ped & *.map); hmp: *.hmp; table (eg. Table2): .txt, *.csv | import_vcf(); import_seqs(); import_plink.pedmap(); import_hmp(); read.table(), read.csv() | 
| Annotation (optional) | GFF: .gff, .gff3, BED4/BED6 (eg. Table3): *.bed | import_gff() import_bed() | 
| Accession information (optional) | table (eg. Table4): .txt, .csv | import_AccINFO() | 
Table 2 is an example of genotypic data in table format: The first five column are fixed as chromosome name (CHROM), position (POS), reference nucleotide (REF), alter nucleotide (ALT) and additional information (INFO). Accession genotype should be in followed columns. “-” will be treated as Indel. “.” and “N” will be treated as missing data. Field in additional information column should be in format “tag=value”, and separated by semicolon “;”. Heterozygote should be looks like “A/G” or “A|G”.
Table 2: Table format of the genotypic dataset
| CHR | POS | REF | Alt | INFO | C001 | C002 | C003 | … | 
|---|---|---|---|---|---|---|---|---|
| Chr7 | 9154754 | T | C | CDS=G>A;AA=V>G | T | T | T | … | 
| Chr7 | 9154664 | G | T | CDS=A>C | G | G | G | … | 
| Chr7 | 9154489 | C | G | CDS=C>G | C | C | C | … | 
| Chr7 | 9154469 | G | A | CDS=T>C | G | G | G | … | 
| ︙ | ︙ | ︙ | ︙ | ︙ | ︙ | ︙ | ︙ | 
Table 3 is an example of annotation file in BED6 format. As described at UCSC, the BED6 file contains 6 columns: 1) chromosome name, 2) chromosome start, 3) chromosome end, 4) name, 5) score and 6) strand. The BED4 contains the first 4 column of BED6.
BE NOTE THAT: the fourth column was used to define the name and types, which were separated by a space. For example, the first line of Table 3 indicates that: the genomic interval from 9154280 (exclude) to 9154821 (include) on Chr7 chromosome is CDS of “LOC_Os07g15770.1” and the strand is “negative”.
Table 3: An annotation example in BED6 format
| # CHROM | START | END | GENEID TYPE | . | STRAND | 
|---|---|---|---|---|---|
| Chr7 | 9154380 | 9154821 | LOC_Os07g15770.1·CDS | . | - | 
| Chr7 | 9152403 | 9152730 | LOC_Os07g15770.1·CDS | . | - | 
Note: the red dot in fourth column indicate a space.
Table 4 is an example of detailed information of individuals, includes group/category, geo-coordinates and phenotype data. First column are names of accessions/individuals, phenotypic information are listed in followed columns.
Table 4: An example of accession detailed information dataset
| id | Subpopulation | Longitude | Latitude | Grain length | Grain width | Grain thickness | 
|---|---|---|---|---|---|---|
| C001 | Indica | 121 | 14.6 | 8.5 | 2.9 | 1.96 | 
| C002 | Intermediate | 121 | 14.6 | 10.2 | 2.63 | 1.96 | 
| C003 | Japonica | 51.3 | 35.45 | 8.75 | 3.32 | 2.12 | 
| C004 | Japonica | 116.28 | 39.54 | 7.83 | 3.22 | 2.08 | 
| C005 | Japonica | 121 | 14.6 | 10.47 | 3 | 1.95 | 
| C006 | Indica | 116.28 | 39.54 | 8.1 | 2.47 | 1.69 | 
The main results are hapResult and
hapSummary class in R, consist of a matrix which could be
divided into three parts as shown in Fig.1, and some additional
attributes.
Cartoon representation of hapResult and hapSummary contents
Part I consists of only one column. And the first four lines were
fixed as CHROM (chromosome name), POS (position), INFO (additional
information) and ALLELE (allele). And followed lines are names of each
haplotype. Part II consists of at least one column, contains site
information (first four lines) and genotypes (followed lines). The part
III of hapResult consists of one column named as Accession,
while hapSummary consists of two columns named as Accession
and freq (frequency of each haplotype).
The differences between hapResult and
hapSummary is that each line of hapResult
indicate an accession/individual, and each line in
hapSummary indicate a haplotype.
geneHapR is schemed to submit to CRAN. If accepted, this
package could be installed with
install.packages("geneHapR"). geneHapR has not
published yet, if you use geneHapR in your study, please
contact Zhang RenLiang
(Maintainer) (email: zhang_renliang@163.com) or Jia GuanQing
(jiaguanqing@caas.cn)
install.packages("geneHapR")The first step is library the geneHapR packages. I will
use the test data inside this package as an example for how to perform
statistics of a gene/range, visualization and phenotype association
analysis.
library(geneHapR)There are two options to conduct a gene haplotype analysis starts from a VCF file or DNA sequences file. Thus a VCF file or DNA sequences file is necessary. However, the GFF, phenos and accession groups are strongly recommend for visualization and phenotype associations.
The import functions takes file path as input.
import_vcf() could import VCF file with surfix of “.vcf”
and “.vcf.gz”. import_gff() import file format default as
“GFF” and import_seqs() file format default as “fasta”.
# import vcf file
vcf <- import_vcf("your_vcf_file_path.vcf")
# import gziped vcf file
vcf <- import_vcf("your_vcf_file_path.vcf.gz")plink <- import_plink.pedmap(mapfile = "p_link.map", pedfile = "p_link.ped", 
                             sep_ped = "\t", sep_map = "\t")
plink <- import_plink.pedmap(root = "p_link", sep_ped = "\t", sep_map = "\t")# import GFFs
gff <- import_gff("your_gff_file_path.gff", format = "GFF")# import GFFs
bed <- import_bed("your_gff_file_path.bed")# import DNA sequences in fasta format
seqs <- import_seqs("your_DNA_seq_file_path.fa", format = "fasta")# import phynotype data
pheno <- import_AccINFO("your_pheno_file_path.txt")
pheno##    GrainWeight.2021 GrainWeight.2022
## C1            16.76            18.76
## C2             6.66             8.66
## C3             7.80            16.30
## C4            19.73            23.73
## C5            11.95            16.95
## C6            12.43            30.45# import accession group/location information
AccINFO <- import_AccINFO("accession_group_file_path.txt")##     Sensitive             Type latitude longitude
## C1  Sensitive Mordern cultivar   65.216    33.677
## C2 Resistance Mordern cultivar   65.216    33.677
## C3        Mid Mordern cultivar   89.941    24.218
## C4  Sensitive Mordern cultivar   89.941    24.218
## C5 Resistance Mordern cultivar   89.941    24.218
## C6        Mid Mordern cultivar   25.231    42.761Be aware that the phenotype and accession group are effectively
tables. There are more than one ways to import a table format file with
R.
Be Note that: a. the accession/individual names
located in first column; b. the first row contents
phenotype/accession_group names; c. NA is allowed, it’s not
a wise option to replace NA by 0.
eg.
# import pheno from space ' ' delimed table
pheno <- read.table("your_pheno_file_path.csv", header = TRUE, row.names = 1, comment.char = "#")
# import pheno from ',' delimed table
pheno <- read.csv("your_pheno_file_path.csv", header = TRUE, comment.char = "#")There is a little work need to be done before haplotype calculations: (1) VCF filtration and (2) DNA sequences alignment.
There are three modes to filter a vcfR object after
import VCF into ‘R’: a. by position; b. by annotation; c. by both of
them.
# filter VCF by position
vcf_f1 <- filter_vcf(vcf, mode = "POS",
                     Chr = "scaffold_1",
                     start = 4300, end = 5890)
# filter VCF by annotation
vcf_f2 <- filter_vcf(vcf, mode = "type",
                     gff = gff,
                     type = "CDS")
# filter VCF by position and annotation
vcf_f3 <- filter_vcf(vcf, mode = "both",
                     Chr = "scaffold_1",
                     start = 4300, end = 5890,
                     gff = gff,
                     type = "CDS")It’s a time consuming work to import and manipulate a very large file
with ‘R’ on personal computer. It’ll be more efficiency to extract the
target ranges from origin VCF with filterLargeVCF() before
import. If your VCF file is just a few ‘MB’, this step was not necessary
at all.
Note: if extract more than one ranges, length of
output file names (VCFout) must be equal with
Chr and POS.
# new VCF file will be saved to disk
# extract a single gene/range from a large vcf
filterLargeVCF(VCFin = "Ori.vcf.gz",
               VCFout = "filtered.vcf.gz",
               Chr = "scaffold_8",
               POS = c(19802,24501),
               override = TRUE)
# extract multi genes/ranges from large vcf
filterLargeVCF(VCFin = "Ori.vcf.gz",          # surfix should be .vcf.gz or .vcf
               VCFout = c("filtered1.vcf.gz", # surfix should be .vcf.gz or .vcf
                          "filtered2.vcf.gz", 
                          "filtered3.vcf.gz"),
               Chr = c("scaffold_8",
                       "scaffold_8",
                       "scaffold_7"),
               POS = list(c(19802,24501), 
                          c(27341,28949),
                          c(38469,40344)),
               override = TRUE)               # if TRUE, existed file will be override without warningp.link <- filter_plink.pedmap(p.link, mode = "POS", 
                              Chr = "Chr08", start = 25947258, end = 25948258)The origin DNA sequences must be aligned and trimmed due to haplotype
calculation need all sequences have same length. Those operations could
be done with geneHapR. I still suggest users align and trim
DNA sequences with Mega software and then save the
result as FASTA format before import them into ‘R’.
# sequences alignment
seqs <- allignSeqs(seqs, quiet = TRUE)
# sequences trim
seqs <- trimSeqs(seqs, minFlankFraction = 0.1)
seqshap <- filter_hap(hapSummary,
                  rm.mode = c("position", "accession", "haplotype", "freq"),
                  position.rm = c(4879, 4950),
                  accession.rm = c("C1", "C9"),
                  haplotype.rm = c("H009", "H008"),
                  freq.min = 5)As mentioned before, haplotype could be calculated from VCF or
sequences with vcf2hap() or seqs2hap(). The
genotype of most sites should be known and homozygous, still, a few site
are unknown or heterozygous due to chromosome variant or error cased by
sequencing or SNP calling or gaps or other reasons. It’s a hard decision
whether to drop accessions/individuals contains heterozygous or unknown
sites for every haplotype analysis. Hence, I leave the choice to
users.
Calculate haplotype result from VCF.
hapResult <- vcf2hap(vcf,
                     hapPrefix = "H",
                     hetero_remove = TRUE,
                     na_drop = TRUE)
hapResult## Accessions:
##  All:      37
##  Removed: 4  C12, C16, C29, C33
##  Remain:  33
## 
## hap frequences:
## H001 H002 H003 H004 H005 H006 H007 H008 H009 
##   10    8    4    4    2    2    1    1    1 
## 
## Options:
##   hapPrefix: H
##       CHROM: scaffold_1
##         POS: 4300-6856
##  hetero_remove: YES
##   NA_remove: YES
## 
## # A tibble: 37 × 11
##    Hap    `4300` `6856` `5209` `5213` `4910` `4879` `4345` `4950` `5037` Acces…¹
##    <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  
##  1 CHR    scaff… scaff… scaff… scaff… scaff… scaff… scaff… scaff… scaff… ""     
##  2 POS    4300   6856   5209   5213   4910   4879   4345   4950   5037   ""     
##  3 INFO   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   ""     
##  4 ALLELE G/C    A/G    A/AC   C/G    GCCTA… T/A,G  T/A,GG T/AA,… A/AA,… ""     
##  5 H001   G      A      A      C      GCCTA  T      T      T      A      "C8"   
##  6 H001   G      A      A      C      GCCTA  T      T      T      A      "C9"   
##  7 H001   G      A      A      C      GCCTA  T      T      T      A      "C11"  
##  8 H001   G      A      A      C      GCCTA  T      T      T      A      "C14"  
##  9 H001   G      A      A      C      GCCTA  T      T      T      A      "C18"  
## 10 H001   G      A      A      C      GCCTA  T      T      T      A      "C25"  
## # … with 27 more rows, and abbreviated variable name ¹AccessionCalculate haplotype result from aligned DNA sequences.
hapResult <- seqs2hap(seqs,
                      Ref = names(seqs)[1],
                      hapPrefix = "H",
                      hetero_remove = TRUE,
                      na_drop = TRUE,
                      maxGapsPerSeq = 0.25)hapResultBefore visualization, there were a few details need to be adjusted. eg. add annotations and adjust position of “ATG”
hapResultWhile hapResult was calculated from vcfR
object, the INFO was taken from @fix
field. The VCF INFO may missing some annotations. or contents
format was inappropriate to display. Further more, INFO
contents nothing if hapResult was generated from sequences.
Here, we can introduce/replace the origin INFO by
addINFO().
Note that: length of values must be
equal with number of sites.
Let’s see how mant sites contains in the hapResult.
# Chech number of sites conclude in hapResult
sites(hapResult)## [1] 9Now we replace the old INFO field with new tag named as “PrChange”.
# add annotations to INFO field
hapResult <- addINFO(hapResult,
                     tag = "PrChange",
                     values = rep(c("C->D", "V->R", "G->N"),3),
                     replace = TRUE)Here, we add a tag named as “CDSChange” followed the old INFO.
# To replace the origin INFO by set 'replace' as TRUE
hapResult <- addINFO(hapResult,
                     tag = "CDSChange",
                     values = rep(c("C->A", "T->C", "G->T"),3),
                     replace = FALSE)This function was only used to adjust the position of “ATG” to 0 and hence convert the gene on negative strand to positive strand.
Be note that: GFF and hapResult need to adjust position of ATG with the same parameters.
# set ATG position as zero in gff
newgff <- gffSetATGas0(gff = gff, hap = hapResult,
                       geneID = "test1G0387",
                       Chr = "scaffold_1",
                       POS = c(4300, 7910)) 
# set position of ATG as zero in hapResult/hapSummary
newhap <- hapSetATGas0(gff = gff, hap = hapResult,
                       geneID = "test1G0387",
                       Chr = "scaffold_1",
                       POS = c(4300, 7910))hapResult summary and visualizationOnce we have the hapResult object, can we summarize and
visualize our hapResult by interact with annotations and
phenotypes.
Now, we have the hapResult object with INFOs we want
display in next step. The hap_summary() function convert
the object of hapResult class, which is a long table
format, into a short table belong to hapSummary class. In
hapResult each row represent a accession, while each row
represents a hap in hapSummary.
hapSummary <- hap_summary(hapResult)
hapSummary## 
## Accssions: 33
## Sites: 9
##   Indels: 5
##   SNPs:   4
## 
## Haplotypes: 9 
##   H001   10  C8, C9, C11, C14, C18, C25, ... 
##   H002   8   C5, C15, C17, C19, C22, C32, ...    
##   H003   4   C6, C20, C23, C37   
##   H004   4   C7, C13, C24, C30   
##   H005   2   C4, C21 
##   H006   2   C10, C27    
##   H007   1   C1  
##   H008   1   C2  
##   H009   1   C3  
##   
## Options:
##  hapPrefix:  H
##  CHROM:  scaffold_1
##  POS:    4300-6856
##  hetero_remove:  YES
##  NA_remove:  YES
## 
## # A tibble: 13 × 12
##    Hap    `4300` `6856` `5209` `5213` `4910` `4879` `4345` `4950` `5037` Acces…¹
##    <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  
##  1 CHR    scaff… scaff… scaff… scaff… scaff… scaff… scaff… scaff… scaff… ""     
##  2 POS    4300   6856   5209   5213   4910   4879   4345   4950   5037   ""     
##  3 INFO   PrCha… PrCha… PrCha… PrCha… PrCha… PrCha… PrCha… PrCha… PrCha… ""     
##  4 ALLELE G/C    A/G    A/AC   C/G    GCCTA… T/A,G  T/A,GG T/AA,… A/AA,… ""     
##  5 H001   G      A      A      C      GCCTA  T      T      T      A      "C8;C9…
##  6 H002   C      A      A      G      A      T      T      T      A      "C5;C1…
##  7 H003   G      A      AC     G      A      A      A      AA     AA     "C6;C2…
##  8 H004   G      A      A      G      A      T      T      T      A      "C7;C1…
##  9 H005   G      A      A      C      GCCTA  G      T      T      GGG    "C4;C2…
## 10 H006   G      G      A      G      A      T      T      T      A      "C10;C…
## 11 H007   C      A      A      G      A      G      GG     GG     CC     "C1"   
## 12 H008   G      G      A      G      A      T      GG     T      CC     "C2"   
## 13 H009   G      G      A      G      A      T      T      GG     GGG    "C3"   
## # … with 1 more variable: freq <int>, and abbreviated variable name ¹AccessionLet’s see how to visualization of our haplotype results.
At first let’s display the hapSummary as a table. In
this table like figure we can see all the variants and their positions,
haplotypes and their frequencies.
plotHapTable(hapSummary)Also we can add an annotation, “CDSChange”, to the table by assign
the INFO_tag. It’s your responsibility to verify whether
the INFO_tag was existed in the INFO field.
# add one annotation
plotHapTable(hapSummary,
             hapPrefix = "H",
             INFO_tag = "CDSChange", 
             tag_name = "CDS",
             displayIndelSize = 1, 
             angle = 45,
             replaceMultiAllele = TRUE,
             ALLELE.color = "grey90")Now let’s add another INFO_tag named as “PrChange”.
# add multi annotation
plotHapTable(hapSummary,
             hapPrefix = "H",
             INFO_tag = c("CDSChange", "PrChange"),
             displayIndelSize = 1, 
             angle = 45,
             replaceMultiAllele = TRUE,
             ALLELE.color = "grey90")Parameter tag_name was used to replace the character if
INFO_tag was too long.
# add multi annotation
plotHapTable(hapSummary,
             hapPrefix = "H",
             INFO_tag = c("CDSChange", "PrChange"),
             tag_name = c("CDS", "Pr"),
             displayIndelSize = 1, 
             angle = 45,
             replaceMultiAllele = TRUE,
             ALLELE.color = "grey90")I think it’s a good idea to figure out where are the variants by marking them on gene model.
displayVarOnGeneModel(gff,
                      hapSummary,
                      Chr = "scaffold_1",
                      startPOS = 4300, endPOS = 7910,
                      type = "pin", cex = 0.7,
                      CDS_h = 0.05, fiveUTR_h = 0.02, threeUTR_h = 0.01)hapNet calculation and visualizationThe hapNet could be generated from object of
hapSummary class. The accession group information could be
attached in this step.
hapNet <- get_hapNet(hapSummary,
                     AccINFO = AccINFO,
                     groupName = "Type")Once we have the hapNet object, we can plot it with
‘R’.
# plot haploNet
plotHapNet(hapNet,
           size = "freq",                   # circle size
           scale = "log2",                 # scale circle with 'log10(size + 1)'
           cex = 0.8,                       # size of hap symbol
           col.link = 2,                    # link colors
           link.width = 2,                  # link widths
           show.mutation = 2,               # mutation types one of c(0,1,2,3)
           legend = c(-12.5, 7))        # legend positionNow we get the haplotype result. There is a new question emerged: how did those main haplotypes distributed, are they related to geography?
# library(mapdata)
# library(maptools)
hapDistribution(hapResult,
                AccINFO = AccINFO,
                LON.col = "longitude",
                LAT.col = "latitude", 
                hapNames = c("H001", "H002", "H003"), 
                legend = TRUE)Finally, let’s see which haplotype has superiority at particular area by interact with phynotype.
Here are two options, merged or separated, to organized the heatmap
of p-values and violin plot. The figure as an object of
ggplot2, which means user could add/modified figure
elements with ggplot2.
Here is an example for merged arrangement:
results <-hapVsPheno(hapResult,
                     hapPrefix = "H",
                     title = "This is title",
                     mergeFigs = TRUE,
                     pheno = pheno,
                     phenoName = "GrainWeight.2021",
                     minAcc = 3)
plot(results$figs)An example for separated plot:
results <- hapVsPheno(hap = hapResult,
                      hapPrefix = "H",
                      title = "This is title",
                      pheno = pheno,
                      phenoName = "GrainWeight.2021",
                      minAcc = 3,
                      mergeFigs = FALSE)
plot(results$fig_pvalue)plot(results$fig_Violin)I believe the function of hapVsPhenos() will be useful
there are a lot of phenotype need to be associated with haplotype
results.
Note that: the pheno name will be added between the file name and surfix.
hapVsPhenos(hapResult,
            pheno,
            outPutSingleFile = TRUE,
            hapPrefix = "H",
            title = "Seita.0G000000",
            file = "mypheno.tiff",
            width = 12,
            height = 8,
            res = 300)