The dataverse package is the official R client for Dataverse 4 data repositories. The package enables data search, retrieval, and deposit with any Dataverse installation, thus allowing R users to integrate public data sharing into the reproducible research workflow.
In addition to this introduction, the package contains additional vignettes covering:
They can be accessed from CRAN or from
within R using vignettes(package = "dataverse").
The dataverse client package can be installed from CRAN, and you can find the latest development version and report any issues on GitHub:
Dataverse has some terminology that is worth quickly reviewing before
showing how to work with Dataverse in R. Dataverse is an application
that can be installed in many places. As a result,
dataverse can work with any installation but you need
to specify which installation you want to work with. This can be set by
default with an environment variable, DATAVERSE_SERVER:
This should be the Dataverse server, without the “https” prefix or the “/api” URL path, etc. The package attempts to compensate for any malformed values, though.
Within a given Dataverse installation, organizations or individuals can create objects that are also called “Dataverses”. These Dataverses can then contain other dataverses, which can contain other dataverses, and so on. They can also contain datasets which in turn contain files. You can think of Harvard’s Dataverse as a top-level installation, where an institution might have a dataverse that contains a subsidiary dataverse for each researcher at the organization, who in turn publishes all files relevant to a given study as a dataset.
You can search for and retrieve data without a Dataverse account for that a specific Dataverse installation. For example, to search for data files or datasets that mention “ecological inference”, we can just do:
The search vignette describes this functionality in more detail.
To retrieve a data file, we need to investigate the dataset being returned and look at what files it contains using a variety of functions:
The most practical of these is likely
get_dataframe_by_name() which imports the object directly
as a dataframe. get_file() is more primitive, and calls a
raw vector.
Recall that, because datasets in Dataverse are a collection
of files rather than a single csv file, for example, the
get_dataset() function does not return data but rather
information about a Dataverse dataset.
The download vignette describes this functionality in more detail.
For “native” Dataverse features (such as user account controls) or to
create and publish a dataset, you will need an API key linked to a
Dataverse installation account. Instructions for obtaining an account
and setting up an API key are available in the Dataverse
User Guide. (Note: if your key is compromised, it can be regenerated
to preserve security.) Once you have an API key, this should be stored
as an environment variable called DATAVERSE_KEY. It can be
set within R using:
where examplekey12345 should be replaced with your own
key.
With that set, you can easily create a new dataverse, create a dataset within that dataverse, push files to the dataset, and release it, using functions such as
As of dataverse version 0.3.0, we recommended the Python
client (https://github.com/gdcc/pyDataverse) for these
upload and maintenance functions.