epitweetr: user documentation

European Centre for Disease Prevention and Control (ECDC)

Description

The epitweetr package allows you to automatically monitor trends of tweets by time, place and topic. This automated monitoring aims at early detecting public health threats through the detection of signals (e.g. an unusual increase in the number of tweets for a specific time, place and topic). The epitweetr package was designed to focus on infectious diseases, and it can be extended to all hazards or other fields of study by modifying the topics and keywords.

The general principle behind epitweetr is that it collects tweets and related metadata from the Twitter Standard API versions 1.1 (https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/overview) and 2 (https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-recent) according to specified topics and stores these tweets on your computer on a database that can operate to calculate statistics or as a search engine. epitweetr geolocalises the tweets and collects information on key words, URLs, hashtags within a tweet but also entities and context detected by the Twitter API 2. Tweets are aggregated according to topic and geographical location. Next, a signal detection algorithm identifies the number of tweets (by topic and geographical location) that exceeds what is expected for a given day. If a number of tweets exceeds what is expected, epitweetr sends out email alerts to notify those who need to further investigate these signals following the epidemic intelligence processes (filtering, validation, analysis and preliminary assessment).

The package includes an interactive web application (Shiny app) with six pages: the dashboard, where a user can visualise and explore tweets (Fig 1), the alerts page, where you can view the current alerts and train machine learning models for alert classification on user defined categories (Fig 2), the geotag page, where you can evaluate the geolocation algorithm and provide annotations for improving its performance (Fig 3), the data protection page, where the user can search, anonymise and delete tweets from the epitweetr database to support data deletion requests (Fig 4), the configuration page, where you can change settings and check the status of the underlying processes (Fig 5), and the troubleshoot page, with automatic checks and hints for using epitweetr with all its functionalities (Fig 6).

On the dashboard, users can view the aggregated number of tweets over time, the location of these tweets on a map and different most frequent elements found in or extracted from these tweets (words, hashtags, URLs, contexts and entities). These visualisations can be filtered by the topic, location and time period you are interested in. Other filters are available and include the possibility to adjust the time unit of the timeline, whether retweets/quotes should be included, what kind of geolocation types you are interested in, the sensitivity of the prediction interval for the signal detection, and the number of days used to calculate the threshold for signals. This information is also downloadable directly from this interface in the form of data, pictures, and/or reports.

More information on the methodology used is available in the epitweetr peer-review publication. In addition, you can also visit the general post in the discussion forum of the GitHub epitweetr repository for additional materials and training.

Shiny app dashboard:

Fig 1: Shiny app dashboard figure

Shiny app alerts page:

Fig 2: Shiny app alerts page

Shiny app geotag evaluation page:

Fig 3: Shiny app geotag evaluation page

Shiny app data protection page:

Fig 4: Shiny app data protection page

Shiny app configuration page:

Fig 5: Shiny app configuration page

Shiny app troubleshoot page:

Fig 6: Shiny app troubleshoot page

Background

Epidemic Intelligence at ECDC

Article 3 of the European Centre for Disease Prevention and Control (ECDC) funding regulation and the Decision No 1082/2013/EU on serious cross-border threats to health have established the detection of public health threats as a core activity of ECDC.

ECDC performs Epidemic Intelligence (El) activities aiming at rapidly detecting and assessing public health threats, focusing on infectious diseases, to ensure EU’s health security. ECDC uses social media as part of its sources to early detect signals of public health threats. Until 2020, the monitoring of social media was mainly performed through the screening and analysis of posts from pre-selected experts or organisations, mainly in Twitter and Facebook.

More information and an online tutorial are available:

EI sources

EI tutorial

Objectives of epitweetr

The primary objective of epitweetr is to use the Twitter Standard Search API version 1.1 and Twitter Recent Search API version 2 in order to detect early signals of potential threats by topic and by geographical unit.

Its secondary objective is to enable the user through an interactive web interface to explore the trend of tweets by time, geographical location and topic, including information on top words and numbers of tweets from trusted users, using charts and tables.

Repository of epitweetr material and training

More information on epitweetr is available in the epitweetr GitHub discussions. This post contains a summary of links and materials of relevance for new users.

Hardware requirements

The minimum and suggested hardware requirements for the computer are in the table below:

Hardware requirements Minimum Suggested
RAM Needed 8GB 16GB recommended
CPU Needed 4 cores 12 cores
Space needed for 3 years of storage 3TB 5TB

The CPU and RAM usage can be configured in the Shiny app configuration page (see section The interactive user application (Shiny app)>The configuration page). The RAM, CPU and space needed may depend on the amount and size of the topics you request in the collection process.

Installation

epitweetr is conceived to be platform independent, working on Windows, Linux and Mac. We recommend that you use epitweetr on a computer that can be run continuously. You can switch the computer off, but you may miss some tweets if the downtime is large enough, which will have implications for the alert detection.

If you need to upgrade or reinstall epitweetr after activating its tasks, you must stop the tasks from the Shiny app or restart the machine running epitweetr first.

You can find below a summary of the steps required to install epitweetr. Further detailed information is available in the corresponding sections.

  1. Ensure all pre-requisites are installed
  2. Install epitweetr (CRAN version or different version using tar.gz file)
  3. Select the folder (or create a new folder) for epitweetr
  4. Launch the epitweetr Shiny app (ensure to indicate the full path to your data directory)
  5. Check the troubleshoot page
  6. Modify the parameters in the configuration page as needed. The following must be set up by the user to enable all functionalities: Twitter credentials, SMTP for the email sending alert emails and status emails and list of subscribers. The remaining parameters have default values that can be modified by the user if needed. Always remember to save settings.
  7. Activate ‘Requirements & alerts’ pipeline in the configuration page
  8. When requested in the dependencies task, activate ‘epitweetr database’
  9. After the task languages is completed, activate ‘Data collection & processing’
  10. Alerts task may show an error if tweets have not been aggregated yet. Wait few minutes and click on ‘Run alerts’

Prerequisites for running epitweetr

Before using epitweetr, the following items need to be installed:

Prerequisites for some of the functionalities in epitweetr

Extra prerequisites for R developers

If you would like to develop epitweetr further, then the following development tools are needed:

External dependencies

epitweetr will need to download some dependencies in order to work. The tool will do this automatically the first time the alert detection process is launched. The Shiny app configuration page will allow you to change the target URLs of these dependencies, which are the following:

Please note that during the dependencies download, you will be prompted: first to stop the embedded database and then to enable it again. If you are on Windows and you have activated the tasks using the ‘activate’ buttons on the configuration page, you can performs this tasks by disabling and enabling the tasks on the ‘Windows Task Scheduler’. For more information see the section ‘Setting up tweet collection and the alert detection loop’

Installing epitweetr from CRAN

After installing all required dependencies listed in the section “Prerequisites for running epitweetr”, you can install epitweetr:

install.packages(epitweetr)

Environment variables

Additionally, the R environment needs to know where the Java installation home is. To check this, type in the R console:

Sys.getenv("JAVA_HOME")

If the command returns null or empty, then you will need to set the Java Home environment variable, for your operating system (OS). Please see your specific OS instructions. In some cases, epitweetr can work without setting the Java Home environment variable.

The first time you run the application, if the tool cannot identify a secure password store provided by the operating system, you will see a pop-up window requesting a keyring password (Linux and Mac). This is a password necessary for storing encrypted Twitter credentials. Please choose a strong password and remember it. You will be asked for this password each time you run the tool. You can avoid this by setting a system environment variable named ecdc_twitter_tool_kr_password containing the chosen password.

Launching the epitweetr Shiny app

You can launch the epitweetr Shiny app from the R session by typing in the R console. Replace “data_dir” with the designated data directory (full path) which is a local folder you choose to store tweets, time series and configuration files in:

library(epitweetr)
epitweetr_app("data_dir")

Please note that the data directory entered in R should have ‘/’ instead of ‘\’ (an example of a correct path would be ‘C:/user/name/Documents’). This applies especially in Windows if you copy the path from the File Explorer.

Alternatively, you can use a launcher: In an executable .bat or shell file type the following (replacing “data_dir” with the designated data directory):

R –vanilla -e epitweetr::epitweetr_app(“data_dir”)

You can check that all requirements are properly installed in the troubleshoot page. More information is available in section The interactive user application (Shiny app)>Dashboard:The interactive user interface for visualisation>The troubleshoot page

Migrating to epitweetr v2

Migrating epitweetr from previous versions (before January 2022) to version 2.0.0 or higher is possible without any data loss. In this section, we will describe the necessary steps to perform the migration.

This migration is not necessary if you are installing epitweetr for the first time.

In epitweetr v2, we redesigned the way how tweets and series are stored. In previous versions, tweets were saved as compressed JSON files and series as RDS data frames in ‘tweets’ and ‘series’ folder, respectively. In epitweetr v2 or higher, we have moved to a different storage system allowing epitweetr to work as a search engine and allowing efficient updates, deletions and faster aggregation of data. For doing so, data is stored using Apache Lucene indexes in the ‘fs’ folder. Note that during migration, Twitter data are moved to the ‘fs’ folder and series are left as it is. Epitweetr reports will combine data from older and new storage system.

If you have an existing installation that contains data in the previous format, you have to migrate it following the steps detailed in this section. This applies to any epitweetr version before v2.0.0. You can also check this by looking in ‘tweets/geo’ or ‘tweets/search’ folders. If there is a json.gz file, migration is needed.

The migration steps are the following:

Setting up tweet collection and the alert detection loop

In order to use epitweetr, you will need to collect and process tweets, run the ‘epitweetr database’ and ‘Requirements & alerts’ pipelines. Further details are also available in subsequent sections of this user documentation. A summary of the steps needed is as follows:

library(epitweetr)
epitweetr_app("data_dir")

library(epitweetr)
fs_loop("data_dir")
library(epitweetr)
search_loop("data_dir")
library(epitweetr)
fs_loop("data_dir")
library(epitweetr)
detect_loop("data_dir")

For more details, you can go through the section How does it work? General architecture behind epitweetr, which describes the underlying processes behind the tweet collection and the signal detection. Also, the section “The interactive Shiny application (Shiny app)>The configuration page” describes the different settings available in the configuration page.

How does it work? General architecture behind epitweetr

The following sections describe in detail the above general principles. The settings of many of these elements can be configured in the Shiny app configuration page, which is explained in the section The interactive Shiny application (Shiny app)>The configuration page.

Collection of tweets

Use of the Twitter Standard Search API version 1.1 and Twitter Recent Search API version 2

epitweetr uses the Twitter Standard Search API version 1.1 and/or Twitter Recent Search API version 2. The advantage of these APIs is that these are a free service provided by Twitter enabling users of epitweetr to access tweets free of charge. The search API is not meant to be an exhaustive source of tweets. It searches against a sample of recent tweets published in the past 7 days and it focuses on relevance and not completeness. This means that some tweets and users may be missing from search results.

While this may be a limitation in other fields of public health or research, the epitweetr development team believe that for the objective of signal detection a sample of tweets is sufficient to detect potential threats of importance in combination with other type of sources.

Other attributes of the Twitter Standard Search API version 1.1 include:

  • Only tweets from the last 5–8 days are indexed by Twitter

  • A maximum of 180 requests every 15 minutes are supported by the Twitter Standard Search API (450 requests every 15 minutes if you are using the Twitter developer app credentials; see next section)

  • Each request returns a maximum of 100 tweets and/or retweets

Other attributes of the Twitter Recent Search API version 2 include:

  • Only tweets from the last week days are indexed by Twitter

  • A maximum of 300 requests every 15 minutes are supported

  • Each request returns a maximum of 100 tweets and/or retweets

  • 500.000 tweets per month in the essential access level. You can upgrade it for free to elevated access level allowing for up to 2 million tweets per month.

If you are using both endpoints, epitweetr will alternate between them when the limits are reached.

Twitter authentication

You can authenticate the collection of tweets by using a Twitter account (this approach uses the rtweet package app) or by using a Twitter application. For the latter, you will need a Twitter developer account, which can take some time to obtain, due to verification procedures. We recommend using a Twitter account via the rtweet package for testing purposes and short-term use, and the Twitter developer application for long-term use.

  • Using a Twitter account: delegated via rtweet (user authentication)

    • You will need a Twitter account (username and password)

    • The rtweet package will send a request to Twitter, so it can access your Twitter account on your behalf

    • A pop-up window will appear where you can enter your Twitter user name and password to confirm that the application can access Twitter on your behalf. You will send this token each time you access tweets. If you are already logged in Twitter, this pop-up window may not appear and automatically take the credentials of the ‘active’ Twitter account in the machine

    • You can only use Twitter API version 1.1

  • Using a Twitter developer app: via epitweetr (app authentication)

    • You will need to create a Twitter developer account, if you have not created it yet: [https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api]

    • Follow the instuctions, answer the questions to activate the Twitter API v2 using Essential or Elevated access.

    • Next, you need to create a project and an associated developer app during the onboarding process, which will provide you a set of credentials that you will use to authenticate all requests to the API.

    • Save your OAuth settings

      • Add them to the configuration page in the Shiny app (see image below)

      • With this information, epitweetr can request a token at any time directly to Twitter. The advantage of this method is that the token is not connected to any user information and tweets are returned independently of any user context.

      • With this app, you can perform 450 requests every 15 minutes instead of the 180 requests every 15 minutes that authenticating using Twitter account allows.

      • You can activate Twitter API version 2 in the configuration page

      • If you have rtweet 1.0.2+, you will need to enter your bearer token. For previous versions the information to enter is: App Name, API key, API key secret, access token and access token secret

Topics and tweet collection queries

After the Twitter authentication, you need to specify a list of topics in epitweetr to indicate which tweets to collect. For each topic, you have one or more queries that epitweetr uses to collect the relevant tweets (e.g. several queries for a topic using different terminology and/or languages).

A query consists of keywords and operators that are used to match tweet attributes. Keywords separated by a space indicate an AND clause. You can also use an OR operator. A minus sign before the keyword (with no space between the sign and the keyword) indicates the keyword should not be in the tweet attributes. While queries can be up to 512 characters long, best practice is to limit your query to 10 keywords and operators and limit complexity of the query, meaning that sometimes you need more than one query per topic. If a query surpasses this limit, it is recommended to split the topic in several queries.

epitweetr comes with a default list of topics as used by the ECDC Epidemic Intelligence team. You can view details of the list of topics in the Shiny app configuration page (see screenshot below). In addition, the colour coding in the downloadable file allows users to see if the query for a topic is too long (red colour) and the topic should be split in several queries.

In the configuration page, you can also download the list of topics, modify and upload it to epitweetr. The new list of topics will then be used for tweet collection and visible in the Shiny app. The list of topics is an Excel file (*.xlsx) as it handles user-specific regional settings (e.g. delimiters) and special characters well. You can create your own list of topics and upload it too, noting that the structure should include at least:

  • The name of the topic, with the header “Topic” in the Excel spreadsheet. This name should include alphanumeric characters, spaces, dashes and underscores only. Note that it should start with a letter.

  • The query, with the header “Query” in the Excel spreadsheet. This is the query epitweetr uses in its requests to obtain tweets from the Twitter Standard Search API. See above for syntax and constraints of queries.

The topics.xlsx file additionally includes the following fields:

  • An ID, with the header “#” in the Excel spreadsheet, noting a running integer identifier for the topic.

  • A label, with the header “Label” in the Excel spreadsheet, which is displayed in the drop-down topic menu of the Shiny app tabs.

  • An alpha parameter, with the header “Signal alpha (FPR)” in the Excel spreadsheet. FPR stands for “false positive rate”. Increasing the alpha will decrease the threshold for signal detection, resulting in an increased sensitivity and possibly obtaining more signals. Setting this alpha can be done empirically and according to the importance and nature of the topic.

  • “Length_charact” is an automatically generated field that calculates the length of all characters used in the query. This field is helpful as a request should not exceed 500 characters.

  • “Length_word” indicates the number of words used in a request, including operators. Best practice is to limit your number of keywords to 10.

  • An alpha parameter, with the header “Outlier alpha (FPR)” in the Excel spreadsheet. FPR stands for “false positive rate”. This alpha sets the false positive rate for determining what an outlier is when downweighting previous outliers/signals. The lower the value, the fewer previous outliers will potentially be included. A higher value will potentially include more previous outliers.

  • “Rank” is the number of queries per topic

When uploading your own file, please modify the topic and query fields, but do not modify the column titles.

Scheduled plans to collect tweets

As a reminder, epitweetr is scheduled to make 180 requests (queries) to Twitter API every 15 minutes with the user authentication; or 450 (v1.1) or 300 (v2) requests every 15 minutes if you are using Twitter developer app credentials depending on the API version you use. Each request can return 100 tweets. The requests return tweets and retweets. These are returned in JSON format, which is a light-weighted data format.

In order to collect the maximum number of tweets, given the API limitations, and in order for popular topics not to prevent other topics from being adequately collected, epitweetr uses “search plans” for each query.

The first “search plan” for a query will collect tweets from the current date-time backwards until 7 days (7 days because of the Standard Search API limitation) before the current “search plan” was implemented. The first “search plan” is the biggest, as no tweets have been collected so far.

All subsequent “search plans” are done in scheduled intervals that are set up in the configuration page of the epitweetr Shiny app (see section The interactive Shiny app > the configuration page > General). For illustration purposes, let us consider the search plans are scheduled at four-hour intervals. The plans collect tweets for a specific query from the current date-time back until four hours before the date-time when the current “search plan” is implemented (see image below). epitweetr will make as many requests (each returning up to 100 tweets) during the four-hour interval as needed to obtain all tweets created within that four-hour interval.

For example, if the “search plan” begins at 4 am on the 10th of November 2021, epitweetr will launch requests for tweets corresponding to its queries for the four-hour period from 4 am to midnight on the 10th of November 2021. epitweetr starts by collecting the most recent tweets (the ones from 4 am) and continues backwards. If during the four-hour time period between 4 am and midnight the API does not return any more results, the “search plan” for this query is considered completed.

However, if topics are very popular (e.g. COVID-19 in 2020 and 2021), then the “search plan” for a query in a given four-hour window may not be completed. If this happens, epitweetr will move on to the “search plans” for the subsequent four-hour window, and put any previous incomplete “search plan” in a queue to execute when “search plans” for this new four-hour window are completed.

Each “search plan” stores the following information:

Field Type Description
expected_end Timestamp End DateTime of the current search window
scheduled_for Timestamp The scheduled DateTime for the next request. On plan creation this will be the current DateTime and after each request this value will be set to a future DateTime. To establish the future DateTime, the application will estimate the number of requests necessary to finish. If it estimates that N requests are necessary, the next schedule will be in 1/N of the remaining time.
start_on Timestamp The DateTime when the first request of the plan was finished
end_on Timestamp The DateTime when the last request of the plan was finished if that request reached a 100% plan progress.
max_id Long The max Twitter id targeted by this plan, which will be defined after the first request
since_id Long The last tweet id returned by the last request of this plan. The next request will start collecting tweets before this value. This value is updated after each requests and allows the Twitter API to return tweets before min_time(pi)
since_target Long If a previous plan exists, this value stores the first tweet id that was downloaded for that plan. The current plan will not collect tweets before that id. This value allows the Twitter API to return tweets after pi-time_back
requests Int Number of requests performed as part of the plan
progress Double Progress of the current plan as a percentage. It is calculated as (current$max_id - current$since_id)/(current$max_id - current$since_target). If the Twitter API returns no tweets the progress is set to 100%. This only applies for non error responses containing an empty list of tweets.

epitweetr will execute plans according to these rules:

  • epitweetr will detect the newest unfinished plan for each search query with the scheduled_for variable located in the past.

  • epitweetr will execute the plans with the minimum number of requests already performed. This ensures that all scheduled plans perform the same number of requests.

  • As a result of the two previous rules, requests for topics under the 180 limit of the Twitter Standard Search API (or 450 if you are using Twitter developer app authentication) will be executed first and will produce higher progress than topics over the limit.

The rationale behind this is that topics with such a large number of tweets that the 4-hour search window is not sufficient to collect them, are likely to already be a known topic of interest. Therefore, priority should be given to smaller topics and possibly less well-known topics.

An example is the COVID-19 pandemic in 2020. In early 2020, there was limited information available regarding COVID-19, which allowed detecting signals with meaningful information or updates (e.g. new countries reporting cases or confirming that it was caused by a coronavirus). However, throughout the pandemic, this topic became more popular and the broad topic of COVID-19 was not effective for signal detection and was taking up a lot of time and requests for epitweetr. In such a case it is more relevant to prioritise the collection of smaller topics such as sub-topics related to COVID-19 (e.g. vaccine AND COVID-19), or to make sure you do not miss other events with less social media attention.

If search plans cannot be finished, several search plans per query may be in a queue:

This design can have the draw back of slowing down big topics collection since epitweetr is trying to rebuilt last 7 days of history. If you are not interested in rebuilding history on a particular point of time, you can click on the “Dismiss past tweets” button which will discard all previous/historical plans and will start collecting new data.

Geolocation

In a parallel process to the collection of tweets, epitweetr attempts to geolocate all collected tweets using a supervised machine learning process. This process runs automatically after tweets are collected.

epitweetr stores two types of geolocation for a tweet: tweet location, which is geolocation information within the text of a tweet (or a retweeted or quoted tweet), and user location from the available metadata. For signal detection, the preferred location is used (i.e., tweet location) while in the dashboard both types can be visualised.

Geolocation based on tweet location

The tweet location is extracted and stored by epitweetr based on the geolocation information found within a tweet text. In case of a retweet or quoted tweet, it will extract the geolocation information from the original tweet