Using sparseDFM - Nowcasting UK Trade in Goods (Exports)

Load the sparseDFM package and exports dataframe into R. We also require the gridExtra package for this vignette.

This vignette provides a tutorial on how to apply the package sparseDFM onto a large-scale data set for the purpose of nowcasting UK trade in goods. The data contains 445 columns, including 9 target series (UK exports of the 9 main commodities worldwide) and 434 monthly indicator series, and 226 rows representing monthly values from January 2004 to October 2022. For a small-scale example see the vignette inflation-example.

Introduction

Nowcasting¹ is a method used in econometrics that involves estimating the current state of the economy based on the most recent data available. It is an important tool because it allows policy makers and businesses to make more informed decisions in real-time, rather than relying on outdated information due to publication delays that may no longer be accurate. Trade in Goods (imports and exports) is currently published with a 2 month lag by the UK’s Office for National Statistics (ONS), which is quite a long time to wait for current assessments of trade, especially during times of economic uncertainty or instability. Nowcasting UK trade information has become particularly important in recent years due to the key events of the Brexit referendum, held in 2016, and the coronavirus pandemic, reaching UK shores early 2020. While the cause of these shocks are drastically different, both have imposed restrictions on trade in goods.

We consider the task of understanding and nowcasting the movements of 9 monthly target series representing 9 of the main commodities the UK exports worldwide. These include:

Food & live animals
Beverages and tobacco
Crude materials
Fuels
Animal and vegetable oils and fats
Chemicals
Material manufactures
Machinery and transport
Miscellaneous manufactures

These target series are released with a 2 month publication delay and hence the last two rows of the dataframe for these variables are missing. To try and estimate the targets in these months, we use a large collection of 434 monthly indicator series including:

Index of Production (IoP) - Movements in the volume of production for the UK production industries - 2 month lag - 89 series
Consumer Price Inflation (CPI) - The rate at which the prices of goods and services bought by households rise or fall - 1 month lag - 166 series
Producer Price Inflation (PPI) - Changes in the prices of goods bought and sold by UK manufacturers - 1 month lag - 153 series
Exchange rates - Sterling exchange rates with 12 popular currencies - 1 month lag - 12 series
Business confidence Index (BCI) - Opinion surveys on developments in production, orders and stocks of finished goods in the industry sector - 1 month lag - 1 series
Consumer Confidence Index (CCI) - Opinion surveys on future developments of households’ consumption and saving - 1 month lag - 1 series
Google Trends (GT) - Popularity scores of 14 google search queries related to trade in goods - real-time - 14 series

This vignette uses the sparseDFM() function to fit a regular DFM and a Sparse DFM to the entire dataset of January 2004 to October 2022 with the goal of estimating the missing target series data in September and October of 2022. We explore the plot() and predict() capabilities of the package and assess the benefit of a sparse DFM in terms of interpreting factor structure and accuracy of predictions.

Exploring the Data

Before we fit any models it is first worthwhile to perform some exploratory data analysis to assess stationarity and missing data.

# Dimension of the data: n = 226, p = 445.
dim(data)
#> [1] 226 445

# Plot the 9 target series using ts.plot with a legend on the right 
def.par <- par(no.readonly = TRUE) # initial graphic parameters 
goods <- data[,1:9]
layout(matrix(c(1,2),nrow=1), width=c(4,3)) 
par(mar=c(5,4,4,0)) 
ts.plot(goods, gpars= list(col=10:1,lty=1:10))
par(mar=c(5,0,4,2)) 
plot(c(0,1),type="n", axes=F, xlab="", ylab="")
legend("center", legend = colnames(goods), col = 10:1, lty = 1:10, cex = 0.7)

par(def.par) # reset graphic parameters to initial

This plot provides us with the monthly dynamics of UK exports worldwide for 9 categories of goods. We see exports of machinery and transport being the largest. We also see two main drops during the 2009 and 2020 recessions and an upwards trend in the past year or so.

The only missing data present in the data is at the end of the sample during the months of September and October 2022 depending on publication delays of the variables. We can see this ragged edge² structure at the end of the sample by zooming in on the past 12 months:

# last 12 months 
data_last12 = tail(data, 12)

# Missing data plot. Too many variable names so use.names is set to FALSE for clearer output.
missing_data_plot(data_last12, use.names = FALSE)

We see the 2 month delay for the targets and IoP, the 1 month delay for CPI, PPI, exchange rates, BCI and CCI, and no delay for google trends. We hope to exploit this available data when predicting September and October 2022.

Fitting the Models

We first make the data stationary by simply taking first-differences like so:

# first-differences correspond to stationary_transform set to 2 for each series
new_data = transformData(data, stationary_transform = rep(2,ncol(data)))

We now tune for the number of factors to use:

tuneFactors(new_data)
#> Data contains missing values: imputing data with fillNA()

#> [1] "The chosen number of factors using criteria type  2  is  7"

According to the Bai and Ng (2002)³ information criteria, the best number of factors to use is 7. However, the screeplot seems to suggest that after 4 factors, the addition of more factors does not add that much in terms of explaining the variance of the data. For this reason, we choose to use 4 factors when modelling.

We now fit a regular DFM and a Sparse DFM to the data with 4 factors:

# Regular DFM fit - takes around 18 seconds 
fit.dfm <- sparseDFM(new_data, r = 4, alg = 'EM')

# Sparse DFM fit - takes around 2 mins to tune 
# set q = 9 as the first 9 variables (targets) should not be regularised
# L1 penalty grid set to logspace(0.4,1,15) after exploration
fit.sdfm <- sparseDFM(new_data, r = 4, q = 9, alg = 'EM-sparse', alphas = logspace(0.4,1,15))

We can explore the convergence and tuning of each algorithm like so:

# Number of iterations the DFM took to converge
fit.dfm$em$num_iter
#> [1] 14

# Number of iterations the Sparse DFM took to converge at each L1 norm penalty 
fit.sdfm$em$num_iter
#>  [1] 17  6 14  2  2  2  3  3  3  3 18  3 12  5  5

# Optimal L1 norm penalty chosen
fit.sdfm$em$alpha_opt
#> [1] 4.54091

# Plot of BIC values for each L1 norm penalty 
plot(fit.sdfm, type = 'lasso.bic')

Estimated Factor Structure

We first explore the estimated factors and loadings for the regular DFM. We are able to group the indicator series into colours depending on the source of the indicator and use the type = "loading.grouplineplot" setting in plot(). We set the trade in goods (TiG) target black, IoP blue, CPI red, PPI pink, exchange rate (Exch) green, BCI & CCI (Conf) navy and google trends (GT) brown. This will make it easier to visualise which indicators are loading onto specific factors.

## Plot the estimated factors for the DFM
plot(fit.dfm, type = 'factor')


## Plot the estimated loadings for each of the 4 factors in a grid 

# Specify the name of the group each indicator belongs too
groups = c(rep('TiG',9), rep('IoP',89), rep('CPI',166), rep('PPI',153),
           rep('Exch',12), rep('Conf',2), rep('GT',14))

# Specify the colours for each of the groups 
group_cols = c('black','blue','red','pink','green','navy','brown')

# Plot the group lineplot in a 2 x 2 grid 
p1 = plot(fit.dfm, type = 'loading.grouplineplot', loading.factor = 1, group.names = groups, group.cols = group_cols)
p2 = plot(fit.dfm, type = 'loading.grouplineplot', loading.factor = 2, group.names = groups, group.cols = group_cols)
p3 = plot(fit.dfm, type = 'loading.grouplineplot', loading.factor = 3, group.names = groups, group.cols = group_cols)
p4 = plot(fit.dfm, type = 'loading.grouplineplot', loading.factor = 4, group.names = groups, group.cols = group_cols)

grid.arrange(p1, p2, p3, p4, nrow = 2)

As all variables are loaded onto all the factors in a regular DFM it is very difficult to interpret the factor structure from these loading plots. The loadings from all groups in every factor are quite large and it is impossible to make conclusions on which data groups are related to specific factors. For greater interpretation we can fit a Sparse DFM instead and hope to induce sparsity on the loadings Let us now observe the factors and loading structure of the Sparse DFM:

## Plot the estimated factors for the Sparse DFM
plot(fit.sdfm, type = 'factor')


## Plot the estimated loadings for each of the 4 factors in a grid 

# Plot the group lineplot in a 2 x 2 grid 
p1 = plot(fit.sdfm, type = 'loading.grouplineplot', loading.factor = 1, group.names = groups, group.cols = group_cols)
p2 = plot(fit.sdfm, type = 'loading.grouplineplot', loading.factor = 2, group.names = groups, group.cols = group_cols)
p3 = plot(fit.sdfm, type = 'loading.grouplineplot', loading.factor = 3, group.names = groups, group.cols = group_cols)
p4 = plot(fit.sdfm, type = 'loading.grouplineplot', loading.factor = 4, group.names = groups, group.cols = group_cols)

grid.arrange(p1, p2, p3, p4, nrow = 2)

This time with sparse factor loadings we are able to make a lot clearer conclusions on factor structure. We can clearly visualise which indicator series are the driving force between each factor. Factor 2 for example, with the obvious drop in early 2020 due to the covid pandemic, is heavily loaded with indicators coming from the Index of Production, confidence indices and google trends. Index of Production does not actually appear in any other factors and we can view this as a clear indicator of the covid drop. It is interesting that google search words to do with trade in goods are present in factor 2 as well. With lots of economic volatility in recent years, using google trends search words may be a very useful indicator of economic activity. Factor 1 seems to be mainly loaded with PPI data, while factor 3 is heavily loaded with CPI data. Some inflation data and exchange rate indicators are present in factor 4, which seems to have shocks during the 2009 and 2020 recessions. Note that in all 4 factor loading plots the trade in goods target series are loaded as we specified q = 9 in the sparseDFM fit to ensure these variables are not regularised.

Nowcasts

It is very easy to extract nowcasts from the sparseDFM fit. As the data was inputted with the ragged edge structure, with NAs coded in for September and October 2022 for the target series, the sparseDFM output will provide us with estimates for these missing months. There are two ways you can extract these values:

## DFM nowcasts (on the differenced data)

# directly from fit.dfm 
dfm.nowcasts = tail(fit.dfm$data$fitted.unscaled[,1:9],2)

# is the same as from fitted()
dfm.nowcasts = tail(fitted(fit.dfm)[,1:9],2)

## Sparse DFM nowcasts (on the differenced data)

sdfm.nowcasts = tail(fit.sdfm$data$fitted.unscaled[,1:9],2)

To transform these first-differenced-nowcasts into nowcasts on the original level data, we need to undifference. To do this we can take the most recent observed value (August 2022) and add the September first-difference-nowcast for the September level nowcast, and then add the October first-difference-nowcast to this value to get the October level nowcast:

## August 2022 figures for targets 

obs_aug22 = tail(data,3)[1,1:9]

## DFM nowcast for original level 

dfm_sept_nowcast = obs_aug22 + dfm.nowcasts[1,]
dfm_oct_nowcast = dfm_sept_nowcast + dfm.nowcasts[2,]

## Sparse DFM nowcast for original level 

sdfm_sept_nowcast = obs_aug22 + sdfm.nowcasts[1,]
sdfm_oct_nowcast = sdfm_sept_nowcast + sdfm.nowcasts[2,]

# Print 
cbind(dfm_sept_nowcast,
dfm_oct_nowcast,
sdfm_sept_nowcast,
sdfm_oct_nowcast)
#>                                           dfm_sept_nowcast dfm_oct_nowcast
#> target.Food & live animals                      1466.20494      1487.89722
#> target.Beverages and tobacco                     892.52105       910.32919
#> target.Crude materials                           931.82943       935.86939
#> target.Fuels                                    5434.76988      5281.31709
#> target.Animal and vegetable oils and fats         81.11094        82.15963
#> target.Chemicals                                5450.35939      5496.92373
#> target.Material manufactures                    4250.23561      4348.36403
#> target.Machinery and transport                 14966.96739     15394.04752
#> target.Miscellaneous manufactures               4575.18045      4759.21107
#>                                           sdfm_sept_nowcast sdfm_oct_nowcast
#> target.Food & live animals                       1373.09109       1379.51984
#> target.Beverages and tobacco                      806.32814        810.16403
#> target.Crude materials                            843.30349        828.98577
#> target.Fuels                                     5444.14078       5284.05460
#> target.Animal and vegetable oils and fats          79.86784         80.48402
#> target.Chemicals                                 5233.46691       5237.28705
#> target.Material manufactures                     3881.44541       3923.95003
#> target.Machinery and transport                  12776.89494      12788.07209
#> target.Miscellaneous manufactures                3896.18607       3951.78687

The results show that Sparse DFM is performing better than a regular DFM in this nowcasting exercise for trade in goods. It has a lower average mean absolute error and tighter bands around the median. As expected, the error for horizon 1 is slightly lower than horizon 2 as it is able to exploit all indicators with a 1 month lag in its estimation.

For a detailed survey on the nowcasting literature see: Bańbura, M., Giannone, D., Modugno, M., & Reichlin, L. (2013). Now-casting and the real-time data flow. In Handbook of economic forecasting (Vol. 2, pp. 195-237). Elsevier.↩︎
At the end of the sample, different variables will have missing points corresponding to different dates in accordance with their publication release. This forms a ragged edge structure at the end of the sample.↩︎
Bai, J., & Ng, S. (2002). Determining the number of factors in approximate factor models. Econometrica, 70(1), 191-221.↩︎