datagovindia is a wrapper around >80,000 APIS of the Government of India’s open data platform data.gov.in. Here is a small guide to take you thorugh the package. Primarily,the functionality is centered around three aspects :
The APIs from the portal are scraped every week to update a list of all APIs and the information attached to them like sector, source, field names etc. The website data.gov.in provides a search functionality through string searches and drop down menus but these are very limited. The functions in this package allows one to have more robust string based searches.
A user can search by API title, description, organization type, organization (ministry), sector and sources. Briefly there are two types of functions here, the first lets the user get a list of all available and unique organization type, organization (ministry), sector and sources and the other lets one “search” by these criteria and more.
Here is a demonstration of the former (getting only the first few values)
###List of organizations (or ministries)
get_list_of_organizations() %>% 
  head
#> [1] "Ministry of Environment and Forests"             
#> [2] "Central Pollution Control Board"                 
#> [3] "Ministry of Home Affairs"                        
#> [4] "Department of Home"                              
#> [5] "Registrar General and Census Commissioner, India"
#> [6] "Ministry of Agriculture and Farmers Welfare"###List of sectors 
get_list_of_sectors() %>% 
  head
#> [1] "Industrial Air Pollution" "Census and Surveys"      
#> [3] "Census"                   "Statistics"              
#> [5] "Agriculture"              "Agricultural Marketing"Once you have an idea about what you want to look for in the API, search queries can be constructed using titles, descriptions as well as the categories explored earlier. A data.frame with information of APIs matching the search keywords is returned. Multiple search functions can be applied over each other utilising the data.frame structure of the result.
| index_name | title | description | org_type | org | sector | source | created_date | updated_date | 
|---|---|---|---|---|---|---|---|---|
| 583f10fa-a19e-4a08-85f1-69dcf64438f4 | Details of Number of industries inspected and Directions issued under Section 5 of Environment (Protection) Act, 1986 by Central Pollution Control Board (CPCB) since 2016-17 till 14.06.2019 (From: Ministry of Environment, Forest and Climate Change) | Details of Number of industries inspected and Directions issued under Section 5 of Environment (Protection) Act, 1986 by Central Pollution Control Board (CPCB) since 2016-17 till 14.06.2019 (From: Ministry of Environment, Forest and Climate Change) | Central | Rajya Sabha | All | data.gov.in | 2021-03-04T06:52:31Z | 2021-03-12T17:56:27Z | 
| b8e4ff80-ec3c-439c-aebb-f27eabe410b3 | State/UT-wise Number of Complying and Non-Complying Locations w.r.t. Heavy Metals According Central Pollution Control Board (CPCB) during 2017 (From : Ministry of Environment, Forest and Climate Change) | State/UT-wise Number of Complying and Non-Complying Locations w.r.t. Heavy Metals According Central Pollution Control Board (CPCB) during 2017 (From : Ministry of Environment, Forest and Climate Change) | Central | Rajya Sabha | All | data.gov.in | 2021-03-04T06:37:26Z | 2021-03-04T06:37:26Z | 
##Multiple Criteria
dplyr::intersect(search_api_by_title(title_contains = "pollution"),
                 search_api_by_organization(organization_name_contains = "pollution"))| index_name | title | description | org_type | org | sector | source | created_date | updated_date | 
|---|---|---|---|---|---|---|---|---|
| 0579cf1f-7e3b-4b15-b29a-87cf7b7c7a08 | Details of Comprehensive Environmental Pollution Index (CEPI) Scores and Status of Moratorium in Critically Polluted Areas (CPAs) in India | NA | Central | Ministry of Environment and Forests|Central Pollution Control Board | Industrial Air Pollution|Water Quality|Natural Resources|Environment and Forest | data.gov.in | 2017-06-08T16:36:24Z | 2018-11-30T02:35:16Z | 
Once you have found the right API for your use, take a a note of the “index_name” of that API, for example, “0579cf1f-7e3b-4b15-b29a-87cf7b7c7a08” corresponds to the API for “Details of Comprehensive Environmental Pollution Index (CEPI) Scores and Status of Moratorium in Critically Polluted Areas (CPAs) in India”. index_name will be essential for both getting to know more about the API or to even get data from it.
There are two functions in this section, one to get API information, the other to get a available “field” names and types of the chosen API (using it’s index_name obtained above).
| index_name | title | description | org_type | org | sector | source | created_date | updated_date | 
|---|---|---|---|---|---|---|---|---|
| 0579cf1f-7e3b-4b15-b29a-87cf7b7c7a08 | Details of Comprehensive Environmental Pollution Index (CEPI) Scores and Status of Moratorium in Critically Polluted Areas (CPAs) in India | NA | Central | Ministry of Environment and Forests|Central Pollution Control Board | Industrial Air Pollution|Water Quality|Natural Resources|Environment and Forest | data.gov.in | 2017-06-08T16:36:24Z | 2018-11-30T02:35:16Z | 
Fields are essentially the variables in the dataset obtained from the API. Knowing the fields before querying for the data will be essential to preform tasks such as filtering, sorting and subsetting the data obtained from the API’s server.
| id | name | type | 
|---|---|---|
| document_id | document_id | double | 
| status_of_moratorium | Status of Moratorium | keyword | 
| industrial_cluster_area | Industrial Cluster / Area | keyword | 
| state | State | keyword | 
| cepi_score_2009 | CEPI SCORE-2009 | double | 
| cepi_score_2011 | CEPI SCORE-2011 | double | 
| cepi_score_2013 | CEPI SCORE-2013 | double | 
| resource_uuid | resource_uuid | keyword | 
The id of these fields is going to be useful while querying the data.
The function get_api_data is really the powerhouse in this package which allows one to do things over and above a manually constructed API query can do by utilizing the data.frame structure of the underlying data. It allows the user to filter, sort, select variables and to decide how much of the data to extract. The website can itself filter on only one field with one value at a time but one command through the wrapper can make multiple requests and append the results from these requests at the same time.
But before we dive into data extraction, we first need to validate our API key relieved from data.gov.in. To get the key, you need to register first register and then get the key from your “My Account” page after logging in. More instruction can be found on this official guide. Once you get your API key, you can validate it as follows (only need to do this ocne per session) :
##Using a sample key
register_api_key("579b464db66ec23bdd000001cdd3946e44ce4aad7209ff7b23ac571b")
#> Connected to the internet
#> The API key is valid and you won't have to set it againOnce you have your key registered, you are ready to extract data from a chosen API. Here is what each argument means :
To recap, first find the API you want using the search functions, get the index_name of the API from the results, optionally take a look at the fields present in the data of the API and then use the get_api_data function to extract the data. Suppose we choose the API “Real time Air Quality Index from various location” with index_ name 3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69. First we will look at which fields are available to construct the right query.
Suppose We want to get the data from only 2 cities Chandigarh and Gurugram and pollutants PM10 and NO2. We will let all fields to be returned (dataset columns).
We will use a sample key from the website for this demonstration.
register_api_key("579b464db66ec23bdd0000019fc84f43ca52437351b43702f5998234")
#> Connected to the internet
#> The API key is valid and you won't have to set it againWe now look at the fields available to play with.
| id | name | type | 
|---|---|---|
| document_id | document_id | double | 
| id | id | double | 
| country | country | keyword | 
| state | state | keyword | 
| city | city | keyword | 
| station | station | keyword | 
| last_update | last_update | date | 
| pollutant_id | pollutant_id | keyword | 
| pollutant_min | pollutant_min | double | 
| pollutant_max | pollutant_max | double | 
| pollutant_avg | pollutant_avg | double | 
| pollutant_unit | pollutant_unit | keyword | 
| resource_uuid | resource_uuid | keyword | 
We accordingly select the city and pollution_id fields for constructing our query. Note that we use only field id. To finally query the data.
get_api_data(api_index="3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69",
             results_per_req=10,filter_by=c(city="Gurugram,Chandigarh",
                                            polutant_id="PM10,NO2"),
             field_select=c(),
             sort_by=c('state','city'))
#> Connected to the internet
#> url-https://api.data.gov.in/resource/3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69?api-key=579b464db66ec23bdd0000019fc84f43ca52437351b43702f5998234&format=json&offset=0&limit=10&filters[city]=Gurugram&filters[polutant_id]=PM10
#> gave the API a rest
#> url-https://api.data.gov.in/resource/3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69?api-key=579b464db66ec23bdd0000019fc84f43ca52437351b43702f5998234&format=json&offset=0&limit=10&filters[city]=Chandigarh&filters[polutant_id]=PM10
#> gave the API a rest
#> url-https://api.data.gov.in/resource/3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69?api-key=579b464db66ec23bdd0000019fc84f43ca52437351b43702f5998234&format=json&offset=0&limit=10&filters[city]=Gurugram&filters[polutant_id]=NO2
#> gave the API a rest
#> url-https://api.data.gov.in/resource/3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69?api-key=579b464db66ec23bdd0000019fc84f43ca52437351b43702f5998234&format=json&offset=0&limit=10&filters[city]=Chandigarh&filters[polutant_id]=NO2
#> gave the API a rest#> Connected to the internet
#> url-https://api.data.gov.in/resource/3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69?api-key=579b464db66ec23bdd0000019fc84f43ca52437351b43702f5998234&format=json&offset=0&limit=10&filters[city]=Gurugram&filters[pollutant_id]=PM10
#> gave the API a rest
#> url-https://api.data.gov.in/resource/3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69?api-key=579b464db66ec23bdd0000019fc84f43ca52437351b43702f5998234&format=json&offset=0&limit=10&filters[city]=Chandigarh&filters[pollutant_id]=PM10
#> gave the API a rest
#> url-https://api.data.gov.in/resource/3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69?api-key=579b464db66ec23bdd0000019fc84f43ca52437351b43702f5998234&format=json&offset=0&limit=10&filters[city]=Gurugram&filters[pollutant_id]=NO2
#> gave the API a rest
#> url-https://api.data.gov.in/resource/3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69?api-key=579b464db66ec23bdd0000019fc84f43ca52437351b43702f5998234&format=json&offset=0&limit=10&filters[city]=Chandigarh&filters[pollutant_id]=NO2
#> gave the API a rest| id | country | state | city | station | last_update | pollutant_id | pollutant_min | pollutant_max | pollutant_avg | pollutant_unit | 
|---|---|---|---|---|---|---|---|---|---|---|
| 432 | India | Haryana | Gurugram | Sector-51, Gurugram - HSPCB | 04-04-2021 07:00:00 | PM10 | 116 | 412 | 238 | NA | 
| 439 | India | Haryana | Gurugram | Teri Gram, Gurugram - HSPCB | 04-04-2021 07:00:00 | PM10 | 7 | 100 | 40 | NA | 
| 103 | India | Chandigarh | Chandigarh | Sector-25, Chandigarh - CPCC | 04-04-2021 07:00:00 | PM10 | 88 | 130 | 106 | NA | 
| 433 | India | Haryana | Gurugram | Sector-51, Gurugram - HSPCB | 04-04-2021 07:00:00 | NO2 | 15 | 18 | 17 | NA | 
| 440 | India | Haryana | Gurugram | Teri Gram, Gurugram - HSPCB | 04-04-2021 07:00:00 | NO2 | 9 | 17 | 12 | NA | 
| 446 | India | Haryana | Gurugram | Vikas Sadan, Gurugram - HSPCB | 04-04-2021 07:00:00 | NO2 | 4 | 115 | 70 | NA | 
| 104 | India | Chandigarh | Chandigarh | Sector-25, Chandigarh - CPCC | 04-04-2021 07:00:00 | NO2 | 12 | 97 | 38 | NA |