Create Datasets that are Easy to Share Exchange and Extend

library(dataset)

The aim of the dataset_df is to create semantically richer datasets.

small_country_dataset <- dataset_df(
  country_name = defined(
    c("AD", "LI"),
    label = "Country name",
    definition = "http://data.europa.eu/bna/c_6c2bb82d", 
    namespace = "https://www.geonames.org/countries/$1/"), 
  gdp = defined(
    c(3897, 7365), 
    label = "Gross Domestic Product", 
    unit = "million dollars", 
    definition = "http://data.europa.eu/83i/aa/GDP"),
  dataset_bibentry = dublincore(
    title = "Small Country Dataset", 
    creator = person("Jane", "Doe"), 
    publisher = "Example Inc.")
  )

The defined class columns retain the important semantic information about the label, exact definition link, and the unit of measure:

var_label(small_country_dataset$gdp)
#> [1] "Gross Domestic Product"
var_unit(small_country_dataset$gdp)
#> [1] "million dollars"
var_definition(small_country_dataset$gdp)
#> [1] "http://data.europa.eu/83i/aa/GDP"

And the dataset_df object has a the important semantic information about the dataset as a whole, with pre-filled default values for not yes assigned or announced information. The metadata of the entire dataset conforms to the two most universally used definitions, Dublin Core (DCTERMS) and DataCite (DataCite).

Each standard Dublinc Core/DataCite metadata field has a simple interface function, for example, you can add the language of the dataset with the following assignment command:

language(small_country_dataset) <- "en"

And to review the descriptive data of the entire dataset as a whole:

print(get_bibentry(small_country_dataset), "bibtex" )
#> @Misc{,
#>   title = {Small Country Dataset},
#>   author = {Jane Doe},
#>   identifier = {:tba},
#>   publisher = {Example Inc.},
#>   year = {:tba},
#>   relation = {:unas},
#>   format = {:unas},
#>   rights = {:tba},
#>   type = {DCMITYPE:Dataset},
#>   datasource = {:unas},
#>   coverage = {:unas},
#>   language = {eng},
#> }

Wikidata

The wbdataset package is an extension of the dataset, which in turn is an R package that helps to exchange, publish and combine datasets more easily by improving their semantics. The wbdataset extends the usability of dataset by connecting the Wikibase API with the R statistical environment. It is in an early, exploratory stage, and it is not yet on CRAN.

devtools::install_github("dataobservatory-eu/wbdataset")
# Select the following country profiles from Wikidata:
wikidata_countries <- c(
  "http://www.wikidata.org/entity/Q756617", "http://www.wikidata.org/entity/Q347",
  "http://www.wikidata.org/entity/Q3908",   "http://www.wikidata.org/entity/Q1246")
  
# Retrieve their labels into a dataset called 'European countries':
wikidata_countries_df <- get_item(qid = wikidata_countries,
                                  language ="en",
                                  title = "European countries",
                                  creator = person("Daniel", "Antal"))
                                  
# Add properties from Wikidata as attributes and variables of this dataset:                             
ds <- wikidata_countries_df %>%
  left_join_column( 
    label = "ISO 3166-1 alpha-2 code", 
    property = "P297" ) %>%
  left_join_column( 
    property = "P1566", 
    label = "Geonames ID",
    namespace = "https://www.geonames.org/") 
    

Further data platforms

We are planning an extension of the dataset package to conform the datacube definition used by statistical data repositories and exchanges, which will require a management of the multi-dimensional datasets and the data structure definition of the Statistical Data and Metadata eXchange. This feature was part of the early release of dataset, but as reviewers pointed out, they were out of focus from the core functionality.

Serialise to RDF

Our package is aiming to work seamlessly with the rOpenSci rdflib package. For this purpose, we are working on a simple turtle package that writes the bibentry about the dataset, and the dataset contents into an RDF-annotated text file, which can be then used or stored with rdflib.

The tuRtle package is being co-developed with dataset, and will serialise the data and metadata of the dataset_df objects.

remotes::install_github("dataobservatory-eu/tuRtle", build = FALSE)
library(tuRtle)
tdf <- data.frame (s = c("eg:01","eg:02",  "eg:01", "eg:02", "eg:01" ),
                   p = c("a", "a", "eg-var:", "eg-var:", "rdfs:label"),
                   o = c("qb:Observation",
                         "qb:Observation",
                         "\"1\"^^<xs:decimal>",
                         "\"2\"^^<xs:decimal>", 
                         '"Example observation"')
                   )

knitr::kable(tdf)
s p o
eg:01 a qb:Observation
eg:02 a qb:Observation
eg:01 eg-var: “1”^^
eg:02 eg-var: “2”^^
eg:01 rdfs:label “Example observation”
example_file<- file.path(tempdir(), "example_ttl.ttl")
ttl_write(tdf=tdf, ttl_namespace=NULL, file_path=example_file)

readLines(example_file)