The R
package PGRdup
was developed as a
tool to aid genebank managers in the identification of probable
duplicate accessions from plant genetic resources (PGR) passport
databases.
This package primarily implements a workflow designed to fetch groups or sets of germplasm accessions with similar passport data particularly in fields associated with accession names within or across PGR passport databases.
The functions in this package are primarily built using the following R packages:
The package can be installed from CRAN as follows:
# Install from CRAN
install.packages('PGRdup', dependencies=TRUE)
:
The development version can be installed from github as follows# Install development version from Github
::install_github("aravind-j/PGRdup") devtools
The series of steps involve in the workflow along with the associated functions are are illustrated below:
Function(s) :
DataClean
MergeKW
MergePrefix
MergeSuffix
Use these functions for the appropriate data standardisation of the relevant fields in the passport databases to harmonize punctuation, leading zeros, prefixes, suffixes etc. associated with accession names.
Function(s) :
KWIC
Use this function to extract the information in the relevant fields as keywords or text strings in the form of a searchable Keyword in Context (KWIC) index.
Function(s) :
ProbDup
Execute fuzzy, phonetic and semantic matching of keywords to identify probable duplicate sets either within a single KWIC index or between two indexes using this function. For fuzzy matching the levenshtein edit distance is used, while for phonetic matching, the double metaphone algorithm is used. For semantic matching, synonym sets or ‘synsets’ of accession names can be supplied as an input and the text strings in such sets will be treated as being identical for matching. Various options to tweak the matching strategies used are also available in this function.
Function(s) :
DisProbDup
ReviewProbDup
ReconstructProbDup
Inspect, revise and improve the retrieved sets using these functions.
If considerable intersections exist between the initially identified
sets, then DisProbDup
may be used to get the disjoint sets.
The identified sets may be subjected to clerical review after
transforming them into an appropriate spreadsheet format which contains
the raw data from the original database(s) using
ReviewProbDup
and subsequently converted back using
ReconstructProbDup
.
Function(s) :
ValidatePrimKey
DoubleMetaphone
ParseProbDup
AddProbDup
SplitProbDup
MergeProbDup
ViewProbDup
KWCounts
read.genesys
Use these helper functions if needed. ValidatePrimKey
can be used to check whether a column chosen in a data.frame as the
primary primary key/ID confirms to the constraints of absence of
duplicates and NULL values.
DoubleMetaphone
is an implementation of the Double
Metaphone phonetic algorithm in R
and is used for phonetic
matching.
ParseProbDup
and AddProbDup
work with
objects of class ProbDup
. The former can be used to parse
the probable duplicate sets in a ProbDup
object to a
data.frame
while the latter can be used to add these sets
data fields to the passport databases. SplitProbDup
can be
used to split an object of class ProbDup
according to set
counts. MergeProbDup
can be used to merge together two
objects of class ProbDup
. ViewProbDup
can be
used to plot the summary visualizations of probable duplicate sets
retrieved in an object of class ProbDup
.
KWCounts
can be used to compute keyword counts from PGR
passport database fields(columns), which can give a rough indication of
the completeness of the data.
read.genesys
can be used to import PGR data in a Darwin
Core - germplasm zip archive downloaded from genesys database into the R
environment.
fread
to rapidly read large files instead of read.csv
or
read.table
in base
.R
-database
interface packages to get the required table as a
data.frame
in R
.ProbDup
function can be memory hungry with large
passport databases. In such cases, ensure that the system has sufficient
memory for smooth functioning (See ?ProbDup
).For a detailed tutorial (vignette) on how to used this package type:
browseVignettes(package = 'PGRdup')
The vignette for the latest version is also available online.
To know whats new in this version type:
news(package='PGRdup')
r-devel-linux-x86_64-debian-clang | |
r-devel-linux-x86_64-debian-gcc | |
r-devel-linux-x86_64-fedora-clang | |
r-devel-linux-x86_64-fedora-gcc | |
r-patched-linux-x86_64 | |
r-release-linux-x86_64 |
r-devel-windows-x86_64 | |
r-release-windows-x86_64 | |
r-oldrel-windows-x86_64 |
r-release-macos-x86_64 | |
r-oldrel-macos-x86_64 |
PGRdup
To cite the methods in the package use:
citation("PGRdup")
To cite the R package 'PGRdup' in publications use:
Aravind, J., Radhamani, J., Kalyani Srinivasan, Ananda Subhash, B., and Tyagi, R. K. (). PGRdup:
Discover Probable Duplicates in Plant Genetic Resources Collections. R package version 0.2.3.9,
https://github.com/aravind-j/PGRdup,https://cran.r-project.org/package=PGRdup.
A BibTeX entry for LaTeX users is
@Manual{,
title = {PGRdup: Discover Probable Duplicates in Plant Genetic Resources Collections},
author = {J. Aravind and J. Radhamani and {Kalyani Srinivasan} and B. {Ananda Subhash} and Rishi Kumar Tyagi},
note = {R package version 0.2.3.9 https://github.com/aravind-j/PGRdup, https://cran.r-project.org/package=PGRdup},
}
This free and open-source software implements academic research by the authors and co-workers. If you use
it, please support the project by citing the package.