The aim of the dataset_df
is to create semantically
richer datasets.
small_country_dataset <- dataset_df(
country_name = defined(
c("AD", "LI"),
label = "Country name",
definition = "http://data.europa.eu/bna/c_6c2bb82d",
namespace = "https://www.geonames.org/countries/$1/"),
gdp = defined(
c(3897, 7365),
label = "Gross Domestic Product",
unit = "million dollars",
definition = "http://data.europa.eu/83i/aa/GDP"),
dataset_bibentry = dublincore(
title = "Small Country Dataset",
creator = person("Jane", "Doe"),
publisher = "Example Inc.")
)
The defined
class columns retain the important semantic
information about the label, exact definition link, and the unit of
measure:
And the dataset_df
object has a the important semantic
information about the dataset as a whole, with pre-filled default values
for not yes assigned or announced information. The metadata of the
entire dataset conforms to the two most universally used definitions,
Dublin Core (DCTERMS) and DataCite (DataCite).
Each standard Dublinc Core/DataCite metadata field has a simple interface function, for example, you can add the language of the dataset with the following assignment command:
And to review the descriptive data of the entire dataset as a whole:
print(get_bibentry(small_country_dataset), "bibtex" )
#> @Misc{,
#> title = {Small Country Dataset},
#> author = {Jane Doe},
#> identifier = {:tba},
#> publisher = {Example Inc.},
#> year = {:tba},
#> relation = {:unas},
#> format = {:unas},
#> rights = {:tba},
#> type = {DCMITYPE:Dataset},
#> datasource = {:unas},
#> coverage = {:unas},
#> language = {eng},
#> }
The wbdataset package is an extension of the dataset, which in turn is an R package that helps to exchange, publish and combine datasets more easily by improving their semantics. The wbdataset extends the usability of dataset by connecting the Wikibase API with the R statistical environment. It is in an early, exploratory stage, and it is not yet on CRAN.
# Select the following country profiles from Wikidata:
wikidata_countries <- c(
"http://www.wikidata.org/entity/Q756617", "http://www.wikidata.org/entity/Q347",
"http://www.wikidata.org/entity/Q3908", "http://www.wikidata.org/entity/Q1246")
# Retrieve their labels into a dataset called 'European countries':
wikidata_countries_df <- get_item(qid = wikidata_countries,
language ="en",
title = "European countries",
creator = person("Daniel", "Antal"))
# Add properties from Wikidata as attributes and variables of this dataset:
ds <- wikidata_countries_df %>%
left_join_column(
label = "ISO 3166-1 alpha-2 code",
property = "P297" ) %>%
left_join_column(
property = "P1566",
label = "Geonames ID",
namespace = "https://www.geonames.org/")
We are planning an extension of the dataset package to conform the datacube definition used by statistical data repositories and exchanges, which will require a management of the multi-dimensional datasets and the data structure definition of the Statistical Data and Metadata eXchange. This feature was part of the early release of dataset, but as reviewers pointed out, they were out of focus from the core functionality.
Our package is aiming to work seamlessly with the rOpenSci rdflib package. For this purpose, we are working on a simple turtle package that writes the bibentry about the dataset, and the dataset contents into an RDF-annotated text file, which can be then used or stored with rdflib.
The tuRtle package is being co-developed with dataset, and
will serialise the data and metadata of the dataset_df
objects.
library(tuRtle)
tdf <- data.frame (s = c("eg:01","eg:02", "eg:01", "eg:02", "eg:01" ),
p = c("a", "a", "eg-var:", "eg-var:", "rdfs:label"),
o = c("qb:Observation",
"qb:Observation",
"\"1\"^^<xs:decimal>",
"\"2\"^^<xs:decimal>",
'"Example observation"')
)
knitr::kable(tdf)
s | p | o |
---|---|---|
eg:01 | a | qb:Observation |
eg:02 | a | qb:Observation |
eg:01 | eg-var: | “1”^^ |
eg:02 | eg-var: | “2”^^ |
eg:01 | rdfs:label | “Example observation” |