Load the sparseDFM package and exports dataframe into R. We also require the gridExtra package for this vignette.
library(sparseDFM)
library(gridExtra)
<- exports data
This vignette provides a tutorial on how to apply the package
sparseDFM onto a large-scale data set for the purpose of
nowcasting UK trade in goods. The data contains 445
columns, including 9 target series (UK exports of the 9 main commodities
worldwide) and 434 monthly indicator series, and 226 rows representing
monthly values from January 2004 to October 2022. For a small-scale
example see the vignette inflation-example
.
Nowcasting1 is a method used in econometrics that involves estimating the current state of the economy based on the most recent data available. It is an important tool because it allows policy makers and businesses to make more informed decisions in real-time, rather than relying on outdated information due to publication delays that may no longer be accurate. Trade in Goods (imports and exports) is currently published with a 2 month lag by the UK’s Office for National Statistics (ONS), which is quite a long time to wait for current assessments of trade, especially during times of economic uncertainty or instability. Nowcasting UK trade information has become particularly important in recent years due to the key events of the Brexit referendum, held in 2016, and the coronavirus pandemic, reaching UK shores early 2020. While the cause of these shocks are drastically different, both have imposed restrictions on trade in goods.
We consider the task of understanding and nowcasting the movements of 9 monthly target series representing 9 of the main commodities the UK exports worldwide. These include:
These target series are released with a 2 month publication delay and hence the last two rows of the dataframe for these variables are missing. To try and estimate the targets in these months, we use a large collection of 434 monthly indicator series including:
This vignette uses the sparseDFM()
function to fit a
regular DFM and a Sparse DFM to the entire dataset of January 2004 to
October 2022 with the goal of estimating the missing target series data
in September and October of 2022. We explore the plot()
and
predict()
capabilities of the package and assess the
benefit of a sparse DFM in terms of interpreting factor structure and
accuracy of predictions.
Before we fit any models it is first worthwhile to perform some exploratory data analysis to assess stationarity and missing data.
# Dimension of the data: n = 226, p = 445.
dim(data)
#> [1] 226 445
# Plot the 9 target series using ts.plot with a legend on the right
<- par(no.readonly = TRUE) # initial graphic parameters
def.par <- data[,1:9]
goods layout(matrix(c(1,2),nrow=1), width=c(4,3))
par(mar=c(5,4,4,0))
ts.plot(goods, gpars= list(col=10:1,lty=1:10))
par(mar=c(5,0,4,2))
plot(c(0,1),type="n", axes=F, xlab="", ylab="")
legend("center", legend = colnames(goods), col = 10:1, lty = 1:10, cex = 0.7)
par(def.par) # reset graphic parameters to initial
This plot provides us with the monthly dynamics of UK exports worldwide for 9 categories of goods. We see exports of machinery and transport being the largest. We also see two main drops during the 2009 and 2020 recessions and an upwards trend in the past year or so.
The only missing data present in the data is at the end of the sample during the months of September and October 2022 depending on publication delays of the variables. We can see this ragged edge2 structure at the end of the sample by zooming in on the past 12 months:
# last 12 months
= tail(data, 12)
data_last12
# Missing data plot. Too many variable names so use.names is set to FALSE for clearer output.
missing_data_plot(data_last12, use.names = FALSE)
We see the 2 month delay for the targets and IoP, the 1 month delay for CPI, PPI, exchange rates, BCI and CCI, and no delay for google trends. We hope to exploit this available data when predicting September and October 2022.
We first make the data stationary by simply taking first-differences like so:
# first-differences correspond to stationary_transform set to 2 for each series
= transformData(data, stationary_transform = rep(2,ncol(data))) new_data
We now tune for the number of factors to use:
tuneFactors(new_data)
#> Data contains missing values: imputing data with fillNA()
#> [1] "The chosen number of factors using criteria type 2 is 7"
According to the Bai and Ng (2002)3 information criteria, the best number of factors to use is 7. However, the screeplot seems to suggest that after 4 factors, the addition of more factors does not add that much in terms of explaining the variance of the data. For this reason, we choose to use 4 factors when modelling.
We now fit a regular DFM and a Sparse DFM to the data with 4 factors:
# Regular DFM fit - takes around 18 seconds
<- sparseDFM(new_data, r = 4, alg = 'EM')
fit.dfm
# Sparse DFM fit - takes around 2 mins to tune
# set q = 9 as the first 9 variables (targets) should not be regularised
# L1 penalty grid set to logspace(0.4,1,15) after exploration
<- sparseDFM(new_data, r = 4, q = 9, alg = 'EM-sparse', alphas = logspace(0.4,1,15)) fit.sdfm
We can explore the convergence and tuning of each algorithm like so:
# Number of iterations the DFM took to converge
$em$num_iter
fit.dfm#> [1] 14
# Number of iterations the Sparse DFM took to converge at each L1 norm penalty
$em$num_iter
fit.sdfm#> [1] 17 6 14 2 2 2 3 3 3 3 18 3 12 5 5
# Optimal L1 norm penalty chosen
$em$alpha_opt
fit.sdfm#> [1] 4.54091
# Plot of BIC values for each L1 norm penalty
plot(fit.sdfm, type = 'lasso.bic')
We first explore the estimated factors and loadings for the regular
DFM. We are able to group the indicator series into colours depending on
the source of the indicator and use the
type = "loading.grouplineplot"
setting in
plot()
. We set the trade in goods (TiG) target black, IoP
blue, CPI red, PPI pink, exchange rate (Exch) green, BCI & CCI
(Conf) navy and google trends (GT) brown. This will make it easier to
visualise which indicators are loading onto specific factors.
## Plot the estimated factors for the DFM
plot(fit.dfm, type = 'factor')
## Plot the estimated loadings for each of the 4 factors in a grid
# Specify the name of the group each indicator belongs too
= c(rep('TiG',9), rep('IoP',89), rep('CPI',166), rep('PPI',153),
groups rep('Exch',12), rep('Conf',2), rep('GT',14))
# Specify the colours for each of the groups
= c('black','blue','red','pink','green','navy','brown')
group_cols
# Plot the group lineplot in a 2 x 2 grid
= plot(fit.dfm, type = 'loading.grouplineplot', loading.factor = 1, group.names = groups, group.cols = group_cols)
p1 = plot(fit.dfm, type = 'loading.grouplineplot', loading.factor = 2, group.names = groups, group.cols = group_cols)
p2 = plot(fit.dfm, type = 'loading.grouplineplot', loading.factor = 3, group.names = groups, group.cols = group_cols)
p3 = plot(fit.dfm, type = 'loading.grouplineplot', loading.factor = 4, group.names = groups, group.cols = group_cols)
p4
grid.arrange(p1, p2, p3, p4, nrow = 2)
As all variables are loaded onto all the factors in a regular DFM it is very difficult to interpret the factor structure from these loading plots. The loadings from all groups in every factor are quite large and it is impossible to make conclusions on which data groups are related to specific factors. For greater interpretation we can fit a Sparse DFM instead and hope to induce sparsity on the loadings Let us now observe the factors and loading structure of the Sparse DFM:
## Plot the estimated factors for the Sparse DFM
plot(fit.sdfm, type = 'factor')
## Plot the estimated loadings for each of the 4 factors in a grid
# Plot the group lineplot in a 2 x 2 grid
= plot(fit.sdfm, type = 'loading.grouplineplot', loading.factor = 1, group.names = groups, group.cols = group_cols)
p1 = plot(fit.sdfm, type = 'loading.grouplineplot', loading.factor = 2, group.names = groups, group.cols = group_cols)
p2 = plot(fit.sdfm, type = 'loading.grouplineplot', loading.factor = 3, group.names = groups, group.cols = group_cols)
p3 = plot(fit.sdfm, type = 'loading.grouplineplot', loading.factor = 4, group.names = groups, group.cols = group_cols)
p4
grid.arrange(p1, p2, p3, p4, nrow = 2)
This time with sparse factor loadings we are able to make a lot
clearer conclusions on factor structure. We can clearly visualise which
indicator series are the driving force between each factor. Factor 2 for
example, with the obvious drop in early 2020 due to the covid pandemic,
is heavily loaded with indicators coming from the Index of Production,
confidence indices and google trends. Index of Production does not
actually appear in any other factors and we can view this as a clear
indicator of the covid drop. It is interesting that google search words
to do with trade in goods are present in factor 2 as well. With lots of
economic volatility in recent years, using google trends search words
may be a very useful indicator of economic activity. Factor 1 seems to
be mainly loaded with PPI data, while factor 3 is heavily loaded with
CPI data. Some inflation data and exchange rate indicators are present
in factor 4, which seems to have shocks during the 2009 and 2020
recessions. Note that in all 4 factor loading plots the trade in goods
target series are loaded as we specified q = 9
in the
sparseDFM
fit to ensure these variables are not
regularised.
It is very easy to extract nowcasts from the sparseDFM
fit. As the data was inputted with the ragged edge structure, with NAs
coded in for September and October 2022 for the target series, the
sparseDFM
output will provide us with estimates for these
missing months. There are two ways you can extract these values:
## DFM nowcasts (on the differenced data)
# directly from fit.dfm
= tail(fit.dfm$data$fitted.unscaled[,1:9],2)
dfm.nowcasts
# is the same as from fitted()
= tail(fitted(fit.dfm)[,1:9],2)
dfm.nowcasts
## Sparse DFM nowcasts (on the differenced data)
= tail(fit.sdfm$data$fitted.unscaled[,1:9],2) sdfm.nowcasts
To transform these first-differenced-nowcasts into nowcasts on the original level data, we need to undifference. To do this we can take the most recent observed value (August 2022) and add the September first-difference-nowcast for the September level nowcast, and then add the October first-difference-nowcast to this value to get the October level nowcast:
## August 2022 figures for targets
= tail(data,3)[1,1:9]
obs_aug22
## DFM nowcast for original level
= obs_aug22 + dfm.nowcasts[1,]
dfm_sept_nowcast = dfm_sept_nowcast + dfm.nowcasts[2,]
dfm_oct_nowcast
## Sparse DFM nowcast for original level
= obs_aug22 + sdfm.nowcasts[1,]
sdfm_sept_nowcast = sdfm_sept_nowcast + sdfm.nowcasts[2,]
sdfm_oct_nowcast
# Print
cbind(dfm_sept_nowcast,
dfm_oct_nowcast,
sdfm_sept_nowcast,
sdfm_oct_nowcast)#> dfm_sept_nowcast dfm_oct_nowcast
#> target.Food & live animals 1466.20494 1487.89722
#> target.Beverages and tobacco 892.52105 910.32919
#> target.Crude materials 931.82943 935.86939
#> target.Fuels 5434.76988 5281.31709
#> target.Animal and vegetable oils and fats 81.11094 82.15963
#> target.Chemicals 5450.35939 5496.92373
#> target.Material manufactures 4250.23561 4348.36403
#> target.Machinery and transport 14966.96739 15394.04752
#> target.Miscellaneous manufactures 4575.18045 4759.21107
#> sdfm_sept_nowcast sdfm_oct_nowcast
#> target.Food & live animals 1373.09109 1379.51984
#> target.Beverages and tobacco 806.32814 810.16403
#> target.Crude materials 843.30349 828.98577
#> target.Fuels 5444.14078 5284.05460
#> target.Animal and vegetable oils and fats 79.86784 80.48402
#> target.Chemicals 5233.46691 5237.28705
#> target.Material manufactures 3881.44541 3923.95003
#> target.Machinery and transport 12776.89494 12788.07209
#> target.Miscellaneous manufactures 3896.18607 3951.78687
The results show that Sparse DFM is performing better than a regular DFM in this nowcasting exercise for trade in goods. It has a lower average mean absolute error and tighter bands around the median. As expected, the error for horizon 1 is slightly lower than horizon 2 as it is able to exploit all indicators with a 1 month lag in its estimation.
For a detailed survey on the nowcasting literature see: Bańbura, M., Giannone, D., Modugno, M., & Reichlin, L. (2013). Now-casting and the real-time data flow. In Handbook of economic forecasting (Vol. 2, pp. 195-237). Elsevier.↩︎
At the end of the sample, different variables will have missing points corresponding to different dates in accordance with their publication release. This forms a ragged edge structure at the end of the sample.↩︎
Bai, J., & Ng, S. (2002). Determining the number of factors in approximate factor models. Econometrica, 70(1), 191-221.↩︎