A Visual Guide to Time Series

This article will talk about how to model a single (univariate) time series. However, I have usually left a comment when a concept is relevant to multivariate time series (i.e. when you want to study many time series concurrently).

We will focus on seasonal autoregressive integrated moving average (ARIMA) models, but we will also discuss other common methods.

This article will discuss two typical goals in studying a time series:

  • Understanding the key aspects of a time series (such as trend, seasonality, and cyclical behaviour including business cycles)
  • Forecasting future values of a time series

Some additional assumptions are also made:

  • The frequency of our time series data is assumed to be at most 12 times per year (monthly data). See “Appendix – Complex seasonality”.
  • We do not have a time series of counts where the counts are smaller than 100. If they are, see “Appendix – Time series of small counts”.

The approach we will follow is largely based off concepts discussed in Forecasting: Principles and Practice. 3rd Ed. (2020) by Rob J Hyndman and George Athanasopoulos. I have tried to combine as many concepts as possible from that text that fit together in the analysis of a single dataset (although sometimes I have had to resort to introducing other datasets when our main dataset is not able to display the phenomena in question).

R is used to for the images and occasional code snippets. In particular, the tidyverts and the tidyverse collection of packages are utilised heavily. The former includes packages such as tsibble, fable and fabletools, while the latter has packages such as ggplot2, dplyr and tidyr. These packages are automatically loaded by loading the package fpp3, which is the R package for the aforementioned textbook. The tidyverse collection of packages provides an alternative to R base graphics and matrix manipulation functions and take some getting used to (see “Appendix – tidyverse”).

tidyverse packages are useful for large datasets and manipulating and sifting through matrices in that context is reminiscent of relational database syntax (such as SQL).

The forecasting process

The forecasting process can be summarised by the image below.

0 - process

The process will involve at least one instance of each of the following steps.

  • Data preparation (tidy)
  • Plot the data (visualise)
  • Define a model (specify)
  • Train the model (estimate)
  • Check model performance (evaluate)
  • Produce forecasts (forecast)

We will explore these steps one at a time with respect to a sample dataset from the R package fpp3. The dataset is called PBS, but in particular we filter according to ATC2 code H02, as demonstrated in the code below.

h02sib <- PBS %>%
filter(ATC2 == “H02”) %>%
summarise(Cost = sum(Cost)/1e6)

The dataset records the monthly government subsidies in Australia of a type of drug called a corticosteroid (more specifically focussed on “corticosteroids for systemic use”). The data is from July 1991 to June 2008 and is in the units of millions of dollars. The data is part of the Pharmaceutical Benefit Scheme for products falling under Anatomical Therapeutic Chemical (ATC) code H02 as recorded by the Australian Health Insurance Commission and Medicare Australia.

This dataset is also discussed in Hyndman and Athanasopoulos (2020).

This article is quite long, so below is an informal ‘table of contents’, with some self-explanatory steps removed.

[Automatic version]
0. Preparing for the forecasting project
1. Data preparation (tidy)
1a. Issues with data length
1ai. Very short time series
1aii. Very long time series
1b. Outliers and missing values
1c. Data adjustments
1ci. Calendar adjustments
1cii. Population adjustments
1ciii. Inflation adjustments
2. Plot the data (visualise)
2a. Studying the plot
2b. Various types of plots
2bi. Plot of the data (time plot)
2bi. Seasonal plot
2bii. Seasonal subseries plot
2biii. Scatterplot
2biv. Comparison plot
2bv. Scatterplot matrix
2bvi. Lagplot
2c. Decompositions
2ci. X11 decomposition
2cii. SEATS decomposition
2ciii. STL decomposition
2d. Time series features
2di. Simple statistics
2dii. Autocorrelation/ACF features
2diii. STL features (shown previously)
2div. Most heavily trended and heavily seasonal series
2dv. All features
2dvi. Outlying time series
3. Define a model (specify)
3a. Seasonally adjusted data
3b. Mathematical transformation
3bi. Adjust for heteroskedasticity
3bii. Ensuring forecasts stay within limits
4. Train the model (estimate) – ARIMA approach
4a. Set up for ARIMA
4b. Automatic ARIMA() or manually determine ARIMA order
4c. Check the residuals (residual diagnostics)
4ci. Residual plots and ACF plots
4cii. The Ljung-Box test
4ciii. Q-Q plot
5. Check model performance (evaluate)
5a. Training and test data
5b. Accuracy
5bi. Evaluating point forecast accuracy
5bii. Evaluating distributional forecast accuracy
5biii. Accuracy results
5c. Cross-validation
6. Produce forecasts (forecast)
6a. Option 1: Calculate prediction intervals using the normal distribution
6b. Option 2: Calculate prediction intervals using bootstrapping
6c. Forecast combinations (Simple average)
6d. Bootstrapping and bagging time series
7. Prediction intervals for aggregates
8. Judgmental adjustments

[Manual version – original 4b replaced by 4b, 4c and 4d]
0. Preparing for the forecasting project
1. Data preparation (tidy)
1a. Issues with data length
1ai. Very short time series
1aii. Very long time series
1b. Outliers and missing values
1c. Data adjustments
1ci. Calendar adjustments
1cii. Population adjustments
1ciii. Inflation adjustments
2. Plot the data (visualise)
2a. Studying the plot
2b. Various types of plots
2bi. Plot of the data (time plot)
2bi. Seasonal plot
2bii. Seasonal subseries plot
2biii. Scatterplot
2biv. Comparison plot
2bv. Scatterplot matrix
2bvi. Lagplot
2c. Decompositions
2ci. X11 decomposition
2cii. SEATS decomposition
2ciii. STL decomposition
2d. Time series features
2di. Simple statistics
2dii. Autocorrelation/ACF features
2diii. STL features (shown previously)
2div. Most heavily trended and heavily seasonal series
2dv. All features
2dvi. Outlying time series
3. Define a model (specify)
3a. Seasonally adjusted data
3b. Mathematical transformation
3bi. Adjust for heteroskedasticity
3bii. Ensuring forecasts stay within limits
4. Train the model (estimate) – ARIMA approach
4a. Set up for ARIMA
4b. If the data are non-stationary, take first differences of the data until the data are stationary.
4bi. Sample ACF and sample PACF plots
4bii. Confirm differencing
4c. Examine the ACF/PACF: Is an ARIMA(p,d,0) or ARIMA(0,d,q) model appropriate?
4ci. SACF and SPACF plots
4cii. SEACF plot
4d. Use the AICc to determine the seasonal orders
4e. Check the residuals (residual diagnostics)
4ei. Residual plots and ACF plots
4eii. The Ljung-Box test
4eiii. Q-Q plot
5. Check model performance (evaluate)
5a. Training and test data
5b. Accuracy
5bi. Evaluating point forecast accuracy
5bii. Evaluating distributional forecast accuracy
5biii. Accuracy results
5c. Cross-validation
6. Produce forecasts (forecast)
6a. Option 1: Calculate prediction intervals using the normal distribution
6b. Option 2: Calculate prediction intervals using bootstrapping
6c. Forecast combinations (Simple average)
6d. Bootstrapping and bagging time series
7. Prediction intervals for aggregates
8. Judgmental adjustments

0. Preparing for the forecasting project

Firstly, we will discuss some principles with regards to new forecasting projects. Some of these principles will be revisited in step 8 on judgmental adjustments.

Principle 1: Set the forecasting task clearly and concisely

  • All definitions should be clear and comprehensive.
  • Avoid incorporating emotive terms and irrelevant information that may distract the forecaster.
  • It may sometimes be useful to conduct a preliminary round of information gathering before setting the forecasting task.

Principle 2: Implement a systematic approach

It is worthwhile to put in significant effort and resources to create decision rules that will lead to a good systematic approach.

Use checklists of categories of information which are relevant to the forecasting task. In particular, it is helpful to identify what information is important and how this information is to be weighted.

For example, when forecasting the demand for a new product, we need to consider what factors we should account for and how we should we account for them. The factors might include

  • price,
  • the quality and/or quantity of the competition,
  • the economic environment at the time, and
  • the target population of the product.

Principle 3: Document and justify

Formalising and documenting the decision rules and assumptions implemented in the systematic approach will allow the same rules to be implemented consistently. Required justification for forecast changes leads to greater accountability. Such changes will be discussed more in step 8 on judgmental adjustments.

Principle 4: Segregate forecasters and users

Forecasters and forecast users should be clearly segregated, such as in the launch of a new product, where company management may be overly optimistic.

However, it is still important that forecasters communicate forecasts to potential users thoroughly so that users will have confidence in the forecasts.

Example: Judgemental adjustments for policy impacts

Some things to avoid:

  • Forecast adjustments often heavily reliant on the work of one person.
  • Forecasts not subject to a formal review process.
  • No guidelines for how to construct judgmental forecasts for policy impacts.
  • Lack of adequate documentation about how these forecasts were obtained, the assumptions underlying them, etc.

Best practices include:

  • Guidelines for forecasting new policy impacts be developed
  • The forecast methodology should be documented in each case, including all assumptions made in forming the forecasts.
  • New policy forecasts be made by at least two people from different areas of the organisation
  • A review of forecasts be conducted one year after the implementation of each new policy by a review committee, especially for new policies that have a significant annual projected cost or saving.
  • The review committee should include those involved in generating the forecasts, but also others.

1. Data preparation (tidy)

The idea in this step is that the data should be in a format ready for plotting and subsequent analysis.

Tasks may include identifying missing values and filtering the time series.

It is possible that old data are less useful due to significant system changes. That being said, in general, good statistical models can handle evolutionary changes in the system.

1a. Issues with data length

1ai. Very short time series

In some situations, we can have little or no data. Judgmental forecasts should be mentioned here, although it is a non-statistical approach.

There are several cases of interest:

  • There are no available data (e.g. a new product).
  • Data are incomplete or have delayed availability. For example, for central banks, GDP is only available on a quarterly basis.
  • Data are available, statistical and judgmental forecasts are calculated separately and then combined.
  • Data are available, statistical forecasts are generated, and judgement is used to adjust the forecasts.

This final case we will discuss in step 8. For the other cases, please refer to Hyndman and Athanasopoulos’ (2020) section on judgmental forecasts.

When it comes to very short time series, the only theoretical limit is that we need more observations than there are parameters in our forecasting model. However, giving a rule of thumb is difficult and, in practice, we usually need considerably more observations than that.

1aii. Very long time series

When the number of observations is over 200, we may be crossing over into undesirable territory.

One perennial concern in statistics is the flexibility of models. Flexibility sounds like a good thing, but in a statistical context, flexibility is equivalent to fitting a more complicated model or, more specifically, estimating a larger number of parameters. The other side of the spectrum is simpler (or more ‘parsimonious’) models, which fit fewer parameters and often have stronger predictive power.

If we are only interested in forecasting the next few observations, then the solution is easy. Just throw away the earliest observations and only fit a model to the most recent observations. A simple (inflexible) model can work well because there is not enough time for the relationships to change substantially. For example, if we have 26 years of weekly gasoline production data and want to forecast a few periods into the future, it is reasonable just to use the most recent years instead.

In some cases we do want to use a more flexible model. Flexible models allows the trend-cycle and seasonal components to evolve over time. (We will discuss those components of time series in greater detail later.)

It should be noted that to incorporate a larger amount of observations into our model, we assume that the model structure will work over the entire period of the data. Two models that allow for greater flexibility are ARIMA and exponential smoothing (ETS) models. In contrast, dynamic regression models do not allow any evolution of model components.

1b. Outliers and missing values

1bi. Outliers

Outliers could be errors or unusual observations in our data, and will interfere with statistical models.

While we have logical methods for dealing with outliers, one should first consider why they have occurred. For example, a special weather event that caused significant cloud cover may have affected your satellite data for a period of time. Outliers may need to be explained by experts in the context of the data.

If you are convinced that the outlier(s) are genuinely errors, or that they won’t occur in the forecasting period, then replacing them is a reasonable option.

One useful way to find outliers is to apply Seasonal and Trend decomposition using Loess (STL). This decomposes the time series into a trend-cycle, seasonal and irregular component, which we will discuss in more detail later.

For now, know that we will apply the STL() function to the series with the argument robust=TRUE. Any outliers should show up in the remainder series.

Then, a boxplot of the remainder series will show outliers as those that are greater than 1.5 interquartile ranges (IQRs) from the central 50% of the data. The preferred rule to define outliers are observations that are greater than 3 interquartile ranges (IQRs) from the central 50% of the data. This means 1 in 500,000 normally distributed observations are outliers.

I have purposely inserted an outlier into the corticosteroid data.

56 - original outlier

Below, STL easily finds the outlier.

57 - STL for outliers

The code below finds the outlier row in the data.

outliers <- STL_decomposition %>%
filter(
remainder < quantile(remainder, 0.25) – 3*IQR(remainder) |
remainder > quantile(remainder, 0.75) + 3*IQR(remainder)
)
outliers

We will proceed to delete the outlier from the data and replace it with a ‘missing value’ placeholder. We will discuss how to replace the outlier in the next section.

1bii. Missing values

Missing data can arise for many reasons:

  1. Context dependent, such as businesses closed on a public holiday.
  2. Human or equipment error, such as someone forgetting the sales figured or measuring equipment malfunctioning
  3. Purposely removing unusual observations.

It is worth considering whether the missing data is due to some important reason, because then it may induce bias in your model. For example, for the second case of missing data, if the timing of the missing data is not relevant (informative) for the forecasting analysis, then the missing values are not of considerable concern and can be handled easily.

Public holidays are a typical example of the first category of missing data when studying business operations. If no adjustments for public holidays are made, we actually experience problems in two ways. We will

  • under-estimate sales on the day after the public holiday, and
  • over-estimate sales on subsequent days after that.

A dynamic regression model is useful in counteracting this, with dummy variables indicating if the day is a public holiday or the day after a public holiday. (Subsequent days are ‘normal’ days.) No automated method can handle such an effect.

When missing values cause errors, there are at least two ways to handle the problem.

  1. Only use data after the last missing value, assuming it is of reasonable length.
  2. Interpolate the missing observations.

Although the first case may be reasonable for the outlier shown above, let us apply the second case.

ARIMA models, dynamic regression models and neural network (NNAR) models will allow us to ‘retrospectively’ estimate the missing value. Note that ETS() and STL() will not work.

h02_miss <- h02siboutlier %>%
# Remove outlying observations
anti_join(outliers) %>%
# Replace with missing values (notice number of observations)
fill_gaps()
h02_fill <- h02_miss %>%
# Fit ARIMA model to the data containing missing values
model(ARIMA(Cost,stepwise=FALSE)) %>%
# Estimate Cost for all periods
interpolate(h02_miss) # same number of observations as h02_miss

ARIMA fills in the outlier we removed.

You can confirm the value that was filled in using the following code.

h02_fill %>%
right_join(outliers %>% select(-Cost))

In the plot below, we can’t even tell where the outlier used to be.

58 - missing value filled2

(We will work with the original, non-manipulated dataset from now on.)

1c. Data adjustments

This section will focus on adjustments to data that make it a fairer representation of the underlying values. We will discuss adjustments used for satisfying model assumptions later on.

1ci. Calendar adjustments

The fact that there are differing numbers of days in each month can interfere with the analysis. For example, it may not be desirable to have the total amount of sales in a month. Instead, for each month, divide the total amount of sales in the month with the number of trading days in that month to determine the “average sales per trading day” for that month. You will still have 12 observations per year.

A useful function here is days_in_month(). For example, instead of using the total amount of orange juice produced per month, Orange, use Orange/days_in_month(Orange) instead.

Another example is in the context of stock prices. Because stock prices are not observed every day, we should set up a new time index based on the trading days rather than calendar days.

(For simplicity, we do not apply this adjustment to our dataset, although we probably should.)

1cii. Population adjustments

Population adjustments should be performed on data that are affected by population changes. The idea is to use per-capita data instead. You want data in the format “X per 1000 people”, or whatever number of people is more convenient.

For example instead of using GDP, use the variable GDP/population. (You need the actual population numbers at each time of observation of GDP.)

1ciii. Inflation adjustments

Some data can be affected by inflation. Most financial time series are adjusted so that the effect of inflation is removed. For example, the house price data may be stated in year 2000 dollars.

These adjustments can be made using price indices. To adjust to year 2000 dollars, simply divide each observation by its price index at the time of the observation, then multiply it by the price index for the year 2000.

2. Plot the data (visualise)

We proceed with plotting the data.

If we have more than one time series variable, we need to consider how strong the relationships are between the variables.

2a. Studying the plot

The patterns in a time series are:

  • Trend
  • Seasonality
  • Cyclical behaviour – fluctuations that are do not occur with fixed frequency, including the ‘business cycle’
  • Non-constant variance (dealt with in the section “Mathematical transformation”)

Note that we often consider a slightly different way to decompose the time series for modelling purposes.

  • Trend-cycle – the trend and cyclical behaviour are combined; often this is simply referred to as “trend”
  • Seasonality
  • Irregular/remainder – anything left after the other components

We will generally consider this decomposition from now on.

2b. Various types of plots

2bi. Plot of the data (time plot)

This is simply the plot of the time series values against time, as discussed in step 2a.

3 - time plot2

General comment on the plot: The last few observations appear to be more variable than the rest of the data. This may be because the data are sometimes corrected when earlier sales are reported late.

Stationarity is often a major concern in time series. If the time plot of the original data is stationary, it will tend to be roughly horizontal with constant variance. However, cyclic behaviour may still be present. Cyclic behaviour does not rule out stationarity, but level changes do, since they form part of the trend.

The other elements can be summarised as follows:

  • Trend: There is an obvious upward trend.
  • Seasonality: There is clear seasonality within each year.
  • Cyclical behaviour: Not obvious in this plot.
  • Non-constant variance: There is a slight increase in variance as we move across the time axis. We can deal with this using a mathematical transformation later.

2bi. Seasonal plot

6 - seasonal plot

The clearest feature here is the big drop in February followed by a gradual increase throughout the year, although it’s difficult to distinguish between December and January.

2bii. Seasonal subseries plot

7 - seasonal subseries

The first plot on the left is a collection of all the January’s together throughout 1991-2008. The second plot contains all the February’s, etc. The horizontal blue lines are the means for each graph.

We can compare between graphs to see the seasonal pattern again. We can look for changes within seasons, but there is not much to be gained here.

[Comment about X11 decomposition]

We will discuss X11 decomposition soon, but it should be noted that we can develop both seasonal and seasonal sub-series plots from X11 decomposition as well as other decomposition methods.

15 - seasonal plot of x11 seasonal

14 - seasonal subseries of x11 seasonal

These help us to visualise the variation in the seasonal component over time. In the seasonal sub-series plot, the changes over time are significant.

2biii. Scatterplot

I haven’t described another time series yet, but I’ve just chosen one at random (antithrombotic agents subsidies, ATC2 code B01) from the same parent dataset.

8 - scatterplot

Clearly antithrombotic agent subsidies has received a lot more funding over 1991-2008 as time as progressed, compared to corticosteroids.

2biv. Comparison plot

Here are simply examples of antithrombotic agent subsidies and corticosteriod subsidies plotted in the same window.

9 - comparison plot

This is probably the rough differences in shape we suspected from the scatterplot.

2bv. Scatterplot matrix

The scatterplot matrix, otherwise known as a ‘pairs’ plot is shown below, showing correlations between three time series (antihemorrhagics now included, which is known by ATC2 code B02).

10 - scatterplot matrix

Density plots are shown on the diagonal, while correlations and scatterplots are also shown. The correlation is clearly high between B01 and B02.

2bvi. Lagplot

The lagplot plots the relationship between y_{t} and y_{t-k} for k from 1 to 16 in this case.

11 - lagplot path

The behaviour is complex here as seasonality seems to affect the plots considerably. The relationships have the strongest correlation in the plots for lags 1 and 12. Note it is also possible to plot with points rather than line graphs.

2c. Decompositions

Decompositions allow us to visualise the components of a time series described earlier (the second version I described, to be exact).

  • Trend-cycle (often simply referred to as “trend”)
  • Seasonality
  • Irregular/remainder

We will discuss three main decomposition methods.

  • X11
  • Seasonal Extraction in ARIMA Time Series (SEATS)
  • Seasonal and Trend decomposition using Loess (STL)

2ci. X11 decomposition

X11 can only be used on quarterly and monthly data.

Unlike classical decomposition, the seasonal component can vary slowly over time.

X11 uses sophisticated methods for handling

  • trading day variation,
  • holiday effects, and
  • the effects of known predictors.

X11 can be used for both additive and multiplicative decomposition. (We focus on additive, which is more general.)

X11 is also robust to outliers and level shifts in a time series. For example, compared to STL, which we will also discuss, the trend-cycle that X11 estimates will typically do a better job at capturing unusual/uncommon events, such as the sudden fall in the markets during the 2007-2008 global financial crisis.

12 - x11 decomp

The main subtle point to note is the little rectangular bar on the left. On the top plot, the plot of the original data, the size of the rectangular bar provides a comparison for how big the vertical ‘amplitude’ of these graphs are. For example, the final plot has a very large bar compared to the comparison bar, which means the final plot is deliberately zoomed in a lot on the vertical axis so that its patterns are visible. Its true effect is much smaller in magnitude.

The plots overlaid on each other are shown below.

13 - x11 all

As already commented on before, the seasonal and seasonal subseries plots of the seasonal component can also be plotted. (Plots below are repeats of previously shown plots.)

15 - seasonal plot of x11 seasonal

14 - seasonal subseries of x11 seasonal

These help us to visualise the variation in the seasonal component over time. In the seasonal sub-series plot, the changes over time are significant.

2cii. SEATS decomposition

Seasonal Extraction in ARIMA Time Series (SEATS) is again only suitable for quarterly or monthly data. The method it is based on (ARIMA, discussed later) is different to X11 and it is frequently used by government agencies.

The images are similar to those in X11 decomposition. Note that by default SEATS uses a multiplicative decomposition.

Note the seasonal package has many options for handling variations of X11 and SEATS.

2ciii. STL decomposition

Seasonal and Trend decomposition using Loess (STL) can handle any type of seasonality including

  • any frequency, including monthly and quarterly data
  • seasonality that changes over time (the rate of change is a parameter)

In addition, there are other parameters, such as

  • the smoothness of the trend-cycle.
  • option to use robust modelling (occasional unusual observations will still affect the remainder component however)

Disadvantages of STL include

  • not handling trading day or calendar variation automatically (we discuss how to manually do this in the next section)
  • no option for multiplicative decomposition (but this is easily done manually)
  • in comparison to X11, STL typically does a worse job at capturing unusual/uncommon events, such as the sudden fall in the markets during the 2007-2008 global financial crisis

15 - STL

The standard STL output is shown, but there are many parameters you can change. One parameter is robustness, which can be set to TRUE.

The two parameters to be chosen when using STL are

  • the trend-cycle window trend(window = ?), and
  • the seasonal window season(window = ?).

These control how rapidly the trend-cycle and seasonal components can change. Larger values lead to smoother plots.

The defaults are

  • trend(window = nextodd(ceiling((1.5*period) / (1-(1.5/s.window)))), degree, jump), and
  • season(period = NULL, window = 13, degree, jump).

By default, the settings is season(window=13),  while the trend window is chosen automatically based on the seasonal period. For example, for monthly data, the default is trend(window=21). Usually this is suitable, but some time series require adjustments.

For example, the default trend window setting could produce a trend-cycle component that is too rigid. As a result, signal from the 2008 global financial crisis has leaked into the remainder component. Selecting a shorter trend window improves this. (Images: worse, better)

Both trend and seasonal windows should be odd numbers.

  • Trend window is the number of consecutive observations to be used when estimating the trend-cycle.
  • Season window is the number of consecutive years to be used in estimating each value in the seasonal component.

Note that setting the seasonal window to be infinite is equivalent to forcing the seasonal component to be periodic (setting window to “periodic” is equivalent to setting it to Inf). In other words, the seasonal pattern is identical across years (the seasonal pattern is fixed). For this setting, we don’t specify a number for the window, but simply use season(window=’periodic’).

[Comment on STL features]

Note that we can obtain a summary of seasonal metrics using the following command.

PBS %>%
    features(Cost, feat_stl)

17 - stl features

Among other things, the feat_stl() function returns:

  • linearity measures the linearity of the trend component of the STL decomposition. It is based on the coefficient of a linear regression applied to the trend component.
  • curvature measures the curvature of the trend component of the STL decomposition. It is based on the coefficient from an orthogonal quadratic regression applied to the trend component.
  • stl_e_acf1 is the first autocorrelation coefficient of the remainder series.
  • stl_e_acf10 is the sum of squares of the first ten autocorrelation coefficients of the remainder series.

2d. Time series features

2di. Simple statistics

You can pass a function to the features function to explore different aspects of the data. For example

tourism %>% features(Trips, mean), or

tourism %>% features(Trips, quantile).

2dii. Autocorrelation/ACF features

12 - ACF

We can see two main features:

  • The slow decrease in the magnitude is due to trend in the data, while
  • the “scalloped” shape is caused by seasonality.

The sum of the first ten squared autocorrelation coefficients is a useful summary of how much autocorrelation there is in a series, regardless of lag.

We may want ot compute the seasonal differences of a series. If we had monthly data, for example, we would compute the difference between consecutive Januaries, consecutive Februaries, and so on. This allows us to look at how the series is changing between years, rather than between months. The autocorrelations of the seasonally differenced series are also useful.

The feat_acf() function computes a selection of the autocorrelations discussed here. It will return six or seven features:

  • the first autocorrelation coefficient from the original data;
  • the sum of square of the first ten autocorrelation coefficients from the original data;
  • the first autocorrelation coefficient from the differenced data;
  • the sum of square of the first ten autocorrelation coefficients from the differenced data;
  • the first autocorrelation coefficient from the twice differenced data;
  • the sum of square of the first ten autocorrelation coefficients from the twice differenced data;

For seasonal data, the autocorrelation coefficient at the first seasonal lag is also returned.

2diii. STL features (shown previously)

We can obtain a summary of seasonal metrics using the following command.

PBS %>%
    features(Cost, feat_stl)

17 - stl features

Among other things, the feat_stl() function returns:

  • linearity measures the linearity of the trend component of the STL decomposition. It is based on the coefficient of a linear regression applied to the trend component.
  • curvature measures the curvature of the trend component of the STL decomposition. It is based on the coefficient from an orthogonal quadratic regression applied to the trend component.
  • stl_e_acf1 is the first autocorrelation coefficient of the remainder series.
  • stl_e_acf10 is the sum of squares of the first ten autocorrelation coefficients of the remainder series.

2div. Most heavily trended and heavily seasonal series

The STL features allow us to obtain the most heavily trended and heavily seasonal series.

18 - heaviest trend and seasonality

The most seasonal series is easily isolated.

hsib %>%
features(Cost, feat_stl) %>%
filter(seasonal_strength_year == max(seasonal_strength_year)) %>%
left_join(hsib, by = c(“Concession”, “Type”, “ATC2”)) %>%
ggplot(aes(x = Month, y = Cost)) + geom_line() +
facet_grid(vars(Concession, Type, ATC2))

19 - most seasonal

2dv. All features

A table of all the features discussed can also be obtained.

hsib %>%
features(Cost, feature_set(pkgs=”feasts”))

20 - all features

2dvi. Outlying time series

We can use PCA to identify outlying time series. This is fairly involved computationally, so we refer you to Hyndman and Athanasopoulos (2020).

3. Define a model (specify)

Before we formally choose a modelling method, there are some further adjustments we might want to make for our data.

3a. Seasonally adjusted data

We will discuss seasonality more later on, but let’s quickly touch on why we might want to model seasonally adjusted data rather than the original data (although often we won’t).

Working with seasonally adjusted data is useful if the variation due to seasonality is not the main interest. For example, monthly unemployment data is highly affected by seasonality but we are more interested in the underlying state of the economy. Many economic series such as unemployment are usually seasonally adjusted.

Seasonally adjusted series are comprised of the trend-cycle subseries and the remainder subseries. Downturns or upturns can be misleading in such series and interpreting turning points and changes in direction, should be done with the trend-cycle component only.

Seasonally-adjusted series can be estimated using moving average smoothing, although more sophisticated methods are available. For more on moving average smoothing, see “Appendix – moving average smoothing”.

A common technique is to use STL() to decompose the data, then use ETS() to model the seasonally adjusted series.

Below is a plot of the seasonally adjusted data.

20 - seasonally adjusted

3b. Mathematical transformation

The main idea here is to adjust for heteroskedasticity (non-constant variance). However, there are other considerations that can be taken into account, such as requiring forecasts to be positive or within certain limits.

3bi. Adjust for heteroskedasticity

This is typically accomplished using a Box-Cox transformation.

2 - box cox

\lambda typically ranges between -1 and 2, but can obviously be any other value. A larger magnitude of lambda indicates a stronger transformation.

For interpretation, it often pays to choose a simple value of \lambda. For example, the logarithmic transformation \lambda=0 is often used and probably the best-regarded transformation.

The log transformation is useful because it can help with non-constant variance and has an easy interpretation. For example, compare an increase in the original versus the log scale.

Original scale

x\mapsto x+3

Log base 10 scale (base 10 is used to make the effect look clearer, but in almost all cases, we should replace all instances of 10 with e in the example below)

log_{10}(x)\mapsto log_{10}(x)+3

x\mapsto10^{log_{10}(x)+3}

x\mapsto10^{log_{10}(x)}\times10^{3}

x\mapsto10^{3}x

A square root transformation is potentially useful for data with a Poisson distribution. Often no transformation is needed.

Forecasting results (the predicted future values) are relatively insensitive to the value of \lambda, but prediction intervals are strongly affected.

Note: We discuss additive decomposition in this article. Multiplicative decomposition is like a Box-Cox transformation with \lambda=1, except we apply the log to both sides of the time series regression equation.

Here is the plot after a log transformation, which appears to a more stable variable.

21 - post transformation 2

3bii. Ensuring forecasts stay within limits

For a positive forecast, simply apply a log (or square root) transformation.

To constrain within (a,b), use the scaled logit transformation on your series v_t. Let us called the transformed version z_t.

z_t = scaledlogit(v_t) = \log\left(\frac{v_t-a}{b-v_t}\right)

Now we have transformed a variable v_t which naturally ranges between (a,b) and transformed it so that its values range between (-\infty,\infty). (Informally, the ‘link’ function is the scaled logit.)

At the end, get the predicted values for z_t and reverse the transformation.

v_t = \frac{(b-a)e^y}{1+e^y} + a.

Notes

  • Prediction intervals from these transformations have the same coverage probability as on the transformed scale. This is because quantiles are preserved when using monotonically increasing transformations.

  • If you choose an artificial (and possibly unrealistic) constraint, the forecast distributions can become extremely skewed.

4. Train the model (estimate) – ARIMA approach

Introduction

At this point we are at a cross-roads and can choose different models for our time series modelling.

I have chosen to discuss (seasonal) ARIMA in this article, but I would like to mention several other methods too.

Other methods (some of which use combinations of many methods):

  • Forecasting with decomposition (see “Appendix – Forecasting with decomposition”)
  • Exponential smoothing (ETS)
  • Dynamic regression
  • Neural network
  • Frequency domain models
  • Average of various methods

ARIMA models study autocorrelations in the data. In contrast, exponential smoothing models are more focussed on the trend and seasonality in the data. Frequency domain models are useful for studying cyclic behaviour.

It should be noted that taking the simple average of many methods is often considered the optimal approach for forecasting (Hyndman & Athanasopoulos, 2020). We will discuss this later in the article.

Another thing to note is that a disadvantage of a purely univariate approach is that other factors are ignored, such as changing market conditions, increased advertising and innovation of competitors.

When we try to use another time series to try to explain a time series we are interested in, this is called multivariate time series. For example, we can try to model daily ice cream sales according to the maximum temperature on those days, along with previous ice cream sales. A common technique for that this kind of analysis in economics is vector autoregression (VAR), but it should be noted that other approaches are available.

Note about residuals

Regardless of the model, we will be studying the behaviour of the residuals. In particular, we are interested in the “innovation residuals”, the residuals on the transformed scale (if a transformation has been used in the model).

For example, suppose we modelled the logarithms of the data, w_t = log(y_t). Then the innovation residuals are given by w_t - \hat{w}_t whereas the regular residuals are given by y_t - \hat{y}_t.

When we refer to residuals from now on, we refer to the innovation residuals unless otherwise specified.

ARIMA approach

The ARIMA modelling process is shown below.

arimaflowchart

Figure: General process for forecasting using an ARIMA model. (Hyndman & Athanasopoulos, 2020)

We will advocate for using the automated way to find Box-Jenkins Seasonal ARIMA models using the ARIMA() function, effectively jumping steps 3-5 in the diagram above. (See the right side of the flowchart.) However, the resulting model does not always perform well on diagnostics (step 6). Assuming step 6 is produces unsatisfactory results, we will then explore steps 3-5 on the left side of the flowchart. This allows us to return to step 6 and, if satisfactory, move onto step 7.

We have already completed the steps.

  • Plot the data and identify any unusual observations.
  • If necessary, transform the data (using a Box-Cox transformation) to stabilise the variance.

Hence, the remaining steps are stated below.

  1. Decide on s, the seasonality parameter
  2. Automatic ARIMA()
  3. Check the residuals from your chosen model by plotting the ACF of the residuals, and doing a portmanteau test of the residuals. If they do not look like white noise, try a modified model.
  4. Once the residuals look like white noise, calculate forecasts.
  5. [Steps 5-9 are an alternative to steps 2-4. Specifically Steps 5-7 are a manual substitute to Step 2.] If the data are non-stationary, take first differences of the data until the data are stationary.
  6. Examine the ACF/PACF: Is an ARIMA(p,d,0) or ARIMA(0,d,q) model appropriate?
  7. Try your chosen model(s), and use the AICc to search for a better model.
  8. (Repeat step 3) Check the residuals from your chosen model by plotting the ACF of the residuals, and doing a portmanteau test of the residuals. If they do not look like white noise, try a modified model. (Part 2) Assuming the previous diagnostics were successful, check the normality of the residuals to see if confidence intervals will be accurate.
  9. (Repeat step 4) Once the residuals look like white noise, calculate forecasts.

Note that we assume step 3 in the steps above are initially unsatisfactory (i.e. the residuals do not look like white noise). If step 3 is initially successful, we can simply omit steps 4-8.

Note on step 7: When models are compared using AICc values, it is important that all models have the same orders of differencing.

4a. Set up for ARIMA

The full seasonal ARIMA model is ARIMA(p,d,q)(P,D,Q)[s], where {P,D,Q} are related to seasonality and s is how many time periods it takes for seasonal behaviour to repeat.

The equation can be written as

(1-\phi_1B - \cdots - \phi_p B^p)(1-B)^d y_t = c + (1 + \theta_1 B + \cdots + \theta_q B^q)\varepsilon_t.

There are some things to keep in mind when thinking about ARIMA models.

22 - ARIMA comments

(Hyndman & Athanasopoulos, 2020)

These parameters also have a significant affect on prediction intervals. The higher the value of d, the faster the prediction intervals increase in size.

  • For d = 0, the long-term forecast standard deviation will go to the standard deviation of the historical data, so the prediction intervals will be predominantly uniform.
  • For d \geq 1, the prediction intervals will continue to grow into the future.

4b. Automatic ARIMA()

The ARIMA() function from the fable package in R will automatically return an ARIMA model for you.

The default parameters in ARIMA() are stepwise variable/feature selection and the use of approximations depending on sample size. Here, keep everything as default except that we set stepwise to FALSE, so that exhaustive search is used.

We obtain ARIMA(2,1,3)(0,1,1)[12]. Note that we have monthly data, so s=12.

22 - model2

Note: The fable::ARIMA() function uses an alternate parameterisation of constants to stats::arima() and forecast::Arima(). (See “Appendix – Fable parameterisation”.)

Note: The example from Hyndman and Athanasopoulos (2020) uses the model ARIMA(3,0,1)(0,1,2)[12] instead.

4c. Check the residuals (residual diagnostics)

In this section, we check important assumptions regarding residuals.

  • The residuals plot allow us to check if the residuals have zero mean (i.e. the forecast is biased).
  • The Ljung-Box test allows us to check if the residuals are correlated.

Note that adjusting for bias is simple as we can simply add a constant to all forecasts. Despite being easy, it is important to make this adjustment if necessary.

We will also check less important assumptions for residuals:

  • The residuals plot allows us to check if the residuals have constant variance.
  • We can use a histogram and/or Q-Q plot to check if the residuals are normally distributed.

4ci. Residual plots and ACF plots

Here are some diagnostic plots of the residuals, w_t, of the model.

23 - residual diagnostics

There is no obvious pattern in the residuals plot and we appear to have zero mean. There may be some increasing variance at the end, but overall the plot does not appear to have non-constant variance. (If there were non-constant variance, we might need to adjust our Box-Cox transformation, but sometimes the problem cannot be fixed using a Box-Cox transformation.)

The ACF plot is an initial precursor to check for correlation between residuals, but we will use the more accurate Ljung-Box test.

The histogram plot of the residuals allows us to look at how they are distributed. A normally distributed histogram is helpful, as we will discuss soon, but we can also use a Q-Q plot to double check.

4cii. The Ljung-Box test

The Ljung-Box test is used to check if the residuals are uncorrelated (as otherwise there is extra information that can be incorporated into the model).

We have the hypotheses below.

H_0: w_t is white noise

H_1: w_t is not white noise

OR

H_0: \rho_w(k+1)=...=\rho_w(H)=0 is white noise

H_1: At least one of the \rho_w(i) is not equal to 0, where i={k,...,H}

We have chosen H=120 here. k is the number of coefficients in our model, which is 6.

The p-value given by the feasts::ljung_box is 0.2553928.

We cannot reject the null hypothesis at any typical significance level and conclude that there is insufficient evidence that w_t is not white noise. (Or more informally, we ‘accept’ the null hypothesis at any typical significance level and conclude that w_t ‘is’ white noise.)

Note: If the Ljung-Box test is failed, the model point predictions can still be used, but the prediction intervals may be misleading.

4ciii. Q-Q plot

Note: We might skip this step if the Ljung-Box test is unsatisfactory.

23 - q-q plot2

The residuals appear to have a heavy-tailed distribution here. This means out prediction intervals may not be accurate. (Our point forecasts can still be quite good if the other assumptions are satisfied.)

Note: we can also look at the distribution of residuals on the bottom right of the diagram in the previous section.

5. Check model performance (evaluate)

5a. Training and test data

Note: 5c. Cross-validation provides a more sophisticated version of this step, but it is highly computationally taxing.

If you want to leave out some of your observations (e.g. leave out a few most recent observations as ‘test data’) to see if your model can forecast it well, you can split your data in training and test data. Sometimes test data is referred to as the “hold-out set” or “out-of-sample data”.

The size of the test set is typically about 20% of the observations or larger to meet the maximum forecast horizon required. However, this is also affected by how large the sample is.

Let us leave out the final two years of data from our training data, i.e.

  • training data is from July 1991 to June 2006, and
  • test data is from July 2006 to June 2008.

5b. Accuracy

We will provide a quick outline of point and interval forecasting accuracy measures below. For more information, see “Appendix – Accuracy”.

5bi. Evaluating point forecast accuracy

The forecast errors we will focus on are:

  • Scale-dependent error type 1 – root mean square error (RMSE); complex and related to the mean
  • Scale-dependent error type 2 – mean absolute error (MAE); simpler and related to median
  • Percentage error – mean absolute percentage error (MAPE)
  • Scaled error – mean absolute scaled error (MASE)

5bii. Evaluating distributional forecast accuracy

Types of scores

  • Quantile scores – evaluates quantiles
  • Winkler score – evaluate a prediction interval, rather than a few quantiles
  • Continuous ranked probability score – whole forecasting distribution, rather than particular quantiles or prediction intervals
  • Skill scores – Scale-free comparisons

40 - snaive interval

5biii. Accuracy results

Below, we have results for:

  • Point accuracy measures
  • Interval accuracy measures (Winkler)
  • Distributional accuracy measures

39 - accuracy3

Interestingly, the seasonal naïve method does well for interval accuracy.

Note: Unlike comparing AICc values, we can mix and match different orders of first differencing and seasonal differencing here.

Note: In this example, it is hard to find a model that passes all the diagnostic tests. In practice, we normally still use the best model we could find.

5c. Cross-validation

Cross-validation in some sense ‘generalises’ the steps from 5a. Training and test data and 5b. Accuracy, by considering many more possible splits of training and test data than just the single one chosen in those steps. A disadvantage is that it is highly computationally intensive.

As usual, we want to use a model that incorporates different amounts of the earlier observations to predict one or more of the observations held out.

The forecast error typically increases as the forecast horizon increases.

Results for cross-validation (test) and training using ETS and ARIMA(2,1,3)(0,1,1)[12] for the forecasting performance of 1- to 4-step-ahead forecasts.

We used the ETS model as it has better convergence properties for the cross-validation algorithm we used.

52 - cross validation results2

A good way to choose the best forecasting model is to find the model with the smallest RMSE computed using time series cross-validation. Here the ETS model outperforms our ARIMA(2,1,3)(0,1,1)[12] model.

Note

Usually we compute

  • one-step forecasts on the training data (known as the “fitted values”), and
  • multi-step forecasts on the test data.

For the latter, the forecast variance usually increases with the forecast horizon, so if we are simply averaging the absolute or squared errors from the test set, we are combining results with different variances. 1-step errors on the test data is one resolution to this issue (note: the test data are still not used to estimate the parameters and the model is not re-estimated in this case). This approach can be used to compare one-step forecasts from different models.

To apply compute

  • multi-step forecasts on the training data, or
  • one-step forecasts on the test data…

please see Hyndman and Athanasopoulos (2020), as these are uncommonly used.

6. Produce forecasts (forecast)

We can calculate forecasts once we achieve satisfactory results in the Ljung-Box test.

Note that a forecaster should keep records of forecasts and use them to obtain feedback when the corresponding observations become available. Changes occur, and you need to monitor these in order to evaluate the decision rules and assumptions from step 0.

6a. Option 1: Calculate prediction intervals using the normal distribution

We can only do this if we are happy that our residuals sufficiently resemble a normal distribution. We cannot do this here due to conclusions from the Q-Q plot.

6b. Option 2: Calculate prediction intervals using bootstrapping

To use this option, we still need to obtain satisfactory results in the Ljung-Box test, but we no longer need to achieve a normal distribution for the residuals.

The idea is that, assuming future errors will be similar to past errors, we can replace the forecast error, e_T+1, by sampling from the residuals (past errors).

Unlike calculating prediction intervals using the normal distribution, bootstrapping does not produce symmetric prediction intervals.

1,000 is a standard amount of repeated sampling via bootstrapping.

The forecast is shown below.

24 - bootstrap forecast

The following image is a close-up on the most recent periods.

24 - bootstrap zoom

WARNING

Regardless of modelling method, prediction intervals are usually too narrow since various sources of error are ignored. Modelling methods only account the variation in the errors. There is also

  • variation in the parameter estimates, and
  • in the model order.

Another assumption is that historical patterns will continue into the future.

6c. Forecast combinations (Simple average)

We combine four models: ETS, ARIMA, STL-ETS, NNAR

(Alternatively, just use ETS, ARIMA, STL-ETS.)

Taking the simple average of many methods is often considered the optimal approach (Clemen, 1989, as cited in Hyndman & Athanasopoulos, 2020).

We simulate the models to create the prediction interval.

65 - forecast combination

6d. Bootstrapping and bagging time series

6di. Creating bootstrapped time series

Later on, we bootstrap the residuals of a time series in order to forecast (similar) future values. This bypasses the assumption that the residuals should be normally distributed.

However, it is possible to use bootstrapping immediately to enhance our modelling.

The idea is that we can generate new time series that are similar to our observed series using the following steps.

  1. (Already discussed) Transform the time series if necessary.
  2. Use STL to decompose into trend-cycle, seasonal and remainder components.
  3. Check that the decomposition has adequately captured the trend and seasonality. There should be no clear remaining signal in the remainder series.
  4. Obtain shuffled versions of the remainder component to get bootstrapped remainder series.
  5. Use a “blocked bootstrap,” where contiguous (next in sequence) sections of the time series are selected at random and joined together.
  6. Add the bootstrapped remainder series to the trend and seasonal components.
  7. Reverse the transformation from the first step.

Note: For the blocked bootstrap step, because there may be autocorrelation present in the remainder series determined by STL, we cannot use simple re-draws to complete this step.

In the end, we are able to obtain variations on the original time series.

6dii. Bagging

For each of these series, we fit an ETS model. A different ETS model may be selected in each case, although it will most likely select the same model because the series are similar. However, the estimated parameters will be different, so the forecasts will be different even the selected model is the same. This is a time-consuming process as there are a large number of time series to model.

Bergmeir et al. (2016) show that, on average, bagging gives better forecasts than just applying ETS() directly. Of course, it is slower because a lot more computation is required.

The resultant model with bagging results in red is displayed below.

41 - bagged

6e. (Interlude)

We have succeeded in producing what look to be reasonable forecasts.

However, let us assume that we did not obtain a satisfactory Ljung-Box test result (or, for whatever reason, we don’t want to use bootstrapped prediction intervals to circumvent our non-normal residuals). We can use manual methods to look for a better seasonal ARIMA model. That is what the next few steps focus on. After that, we will study methods of assessing the accuracy of forecasts, which is relevant to both the bootstrapped prediction intervals as well as the models obtained in the next few steps. 

[The following parts are an optional return to Step 4, but using manual steps rather than an automatic algorithm. Steps 4d to 4f are an alternative to Step 4b.]

4d. If the data are non-stationary, take first differences of the data until the data are stationary.

First differences are the change between adjacent observations. Seasonal differences are the change between one year to the next.

4di. Sample ACF and sample PACF plots

We look at the sample ACF and PACF plots, otherwise known as the SACF and SPACF plots respectively.

25 - original sacf2

In the SACF plot, we see damping oscillations with peaks at observations 12, 24, 36, etc. This is classic seasonal behaviour. In the PACF, we see a spike near lag 12 (although there is also an unexpected spike at lag 1). These both tend to suggest, particularly the SACF, that we take the first seasonal difference.

We take the first seasonal difference and replot.

26 - seasondiff sacf

An indication of a linear trend is slow exponential decay in SACF plot and a spike at lag 1 (at the very first lag) in the SPACF plot. There is some indication of this behaviour here, so we proceed to take a first difference too.

27 - doublediff sacf

Finally, the plots look relatively acceptable.

4dii. Confirm differencing

We can use the following command (Kwiatkowski-Phillips-Schmidt-Shin or KPSS test) to confirm how many differences we need to take.

h02sib %>% features(Cost, unitroot_ndiffs) # first differences

h02sib %>% features(Cost, unitroot_nsdiffs) # seasonal differences

In this case, we can confirm that one basic and one seasonal difference is sufficient.

In practice, it is virtually never necessary to go beyond second-order differences.

4e. Examine the ACF/PACF: Is an ARIMA(p,d,0) or ARIMA(0,d,q) model appropriate?

Recall that the full seasonal ARIMA model is ARIMA(p,d,q)(P,D,Q)[s], where {P,D,Q} are related to seasonality. We ignore the seasonal parameters for now and focus on ARIMA(p,d,q).

We have already settled on d=1.

4ei. SACF and SPACF plots

We can study the plot after double differencing again.

27 - doublediff sacf

ARIMA(p,d,0) is an AR model, so we need

  • SACF undergoes exponential decay or sinusoidal.
  • SPACF(k) \approx0 for k=p+1,p+2,..., and
  • Pick p as the last lag that exceeds the white noise confidence intervals in the SPACF plot.

ARIMA(0,d,q) is an MA model, so we need

  • SPACF undergoes exponential decay or sinusoidal.
  • SACF(k) \approx0 for k=q+1,q+2,..., and
  • Pick q as the last lag that exceeds the white noise confidence intervals in the SACF plot.

Neither of these two cases are relevant here as there are lags that exceed the boundaries quite far down, which are unlikely to be good models.

4eii. SEACF plot

It is possible to skip this step and use AICc to determine all of {p,q,P,Q}, but we will mention a faster alternative here, which is to use the ‘sample EACF’ (SEACF) plot to determine candidates for p and q.

Note that AICc is not appropriate for determining the value of d, as these AICc values are not comparable. When models are compared using AICc values, it is important that all models have the same orders of differencing.

When models are compared using AICc values, it is important that all models have the same orders of differencing.

29 - seacf2

Note: We perform this plot on the (transformed and) differenced series.

We can discern two things from the SEACF plot.

  • Confirming the seasonality – a lot x’s occur near MA(12), so we are confident that there is seasonality and that it occurs within each 12 months.
  • Possible values for p,q are obtained by looking for the triangles closest to the top left corner that include the least x’s and start on an ‘o’. There are usually multiple options.

The possible ARIMA(p,1,q) models are

  • ARIMA(2,1,3) (pictured) but there is an x at MA(13)
  • ARIMA(4,1,1) but there are some x’s at MA(7) for this one
  • ARIMA(1,1,4)
  • ARIMA(0,1,5)

Note that we expect x’s around MA(12) due to seasonality.

ARIMA(2,1,4) and ARIMA (4,1,3) avoid some of the x’s mentioned but we are generally looking for a parsimonious model.

4f. Use the AICc to determine the seasonal orders

Recall that the full seasonal ARIMA model is ARIMA(p,d,q)(P,D,Q)[s], where {P,D,Q} are related to seasonality.

We have already determined d=1, D=1 and s=12.

Let us continue with, {p=2,q=3}, giving us ARIMA(2,1,3)(P,1,Q)[12]. This happens to be the same result as in the automatic exhaustive ARIMA fitting, but let us proceed with finding a good model ‘manually’. In practice, we might want to use one of the alternatives discussed in the SEACF section, as the primary thing we’re trying to fix is the Q-Q plot. However, I tried all the alternatives and even though this section on AICc loops will provide different answers, all of them have the same issues with the Q-Q plot.

Now we can use a loop in R to find appropriate values for P and Q, as there aren’t many good ways to determine these parameters visually.

30 - aicc4

Out of a maximum P and Q of 3, we determine P=0, Q=1.

Therefore, our model is ARIMA(2,1,3)(0,1,1)[12]. This is the same exact model as the automatic exhaustive ARIMA fitting.

Note: For your interest, the other SEACF-determined candidate models would have provided the following results under this step assuming maximum P and Q of 3.

  • ARIMA(4,1,1)(2,1,1)[12]
  • ARIMA(1,1,4)(0,1,2)[12]
  • ARIMA(0,1,5)(2,1,2)[12]

Note: When models are compared using AICc values, it is important that all models have the same orders of differencing.

4g. Check the residuals (residual diagnostics)

Now, it is appropriate to return to 4c. Check the residuals (residual diagnostics). Work onwards from there and when you get back to 6e. (Interlude), skip these sections again and jump to step 7.

(In other words, Steps 4d to 4f are an alternative to Step 4b. Refer back to the ARIMA flowchart.)

7. Prediction intervals for aggregates

A common situation is wanting to forecast the total amount for the next year, despite the fact that we have monthly predictions.

We can simply add up our next 12 point forecasts (as they are ‘means’ by default).

Prediction intervals are more complicated due to correlations between forecast errors. We can simply use simulations.

For 12 months,

  • our mean estimate is $12.0 million, and
  • our 95% prediction interval is [$10.7 million, $13.3 million].

Our simulation mean estimate is also $12.0 million as expected. The scale of these numbers is in agreement with our monthly predictions from earlier. (The monthly estimates are around $1 million, so to add up to $12 million in a year seems reasonable.)

8. Judgmental adjustments

There may be factors that we have not accounted for in our statistical model yet, such as promotions, public holidays or extremely recent events.

Principle 1: Use sparingly

It should be noted that, ideally, the previous (‘methodical’) steps should be implemented so as to minimise the need for judgmental adjustments. In general, adjustments should be used sparingly. In particular, end users, who are usually not trained in these methods, often make undesirable adjustments (Hyndman & Athanasopoulos, 2020).

Principle 2: Go big when appropriate

Judgmental adjustments are most effective when there a lot of extra information available or there is strong evidence that an adjustment is needed. Adjustments seem to be most accurate when they are large in size (Hyndman & Athanasopoulos, 2020).

Principle 3: Apply a structured approach

A policy of having to document and justify adjustments is recommended (Hyndman & Athanasopoulos, 2020).

Principle 4: Consider forecasts first 

If adjustments are implemented in a group meeting, the team should consider the forecasts of key markets or products first, which is the most tiring part of the process. This leads to fewer adjustments throughout the day, which is usually preferred.

Principle 5: Don’t confuse setting targets with producing forecasts

The way in which forecasts is used will clearly depend on managerial decision making. For example, management may decide to adjust a forecast upwards (over-optimistic) after a cost-benefit analysis reveals that the cost of holding excess stock is much lower than that of lost sales. This type of adjustment should be part of setting goals or planning supply, rather than part of the forecasting process.

Appendix – tidyverse

tidyverse packages are useful for large datasets and manipulating and sifting through matrices in that context is reminiscent of relational database syntax (such as SQL).

# last 20 observations
h02sib %>%
slice(n()-19:0) # note: 19

# subset observations from each key (i.e. slice works with groups)
# e.g. subset the first year of data from each time
# series in the data
aus_retail %>%
group_by(State, Industry) %>%
slice(1:12) # notice drop in number of rows

PBS %>% filter(ATC2 == “H02”) %>%
select(Scripts) %>%
group_by(ATC1, Type) %>%
slice(1:12) # notice drop in number of rows

# slice_max is useful for extracting the most extreme
# observations. For example, the highest closing stock
# price for Google, Apple, Facebook and Amazon
gafa_stock %>%
group_by(Symbol) %>%
slice_max(Close)

PBS %>%
group_by(ATC1) %>%
slice_max(Scripts)

Appendix – Complex seasonality

So far, we have mostly considered relatively simple seasonal patterns such as quarterly and monthly data. However, higher frequency time series often exhibit more complicated seasonal patterns. For example, daily data may have a weekly pattern as well as an annual pattern. Hourly data usually has three types of seasonality: a daily pattern, a weekly pattern, and an annual pattern. Even weekly data can be challenging to forecast as there are not a whole number of weeks in a year, so the annual pattern has a seasonal period of 52.179 on average.

We don’t necessarily want to include all of the possible seasonal periods in our models — just the ones that are likely to be present in the data. For example, if we have only 180 days of data, we may ignore the annual seasonality. If the data are measurements of a natural phenomenon (e.g., temperature), we can probably safely ignore any weekly seasonality.

We will discuss several options.

  • The STL approach (option 1) is preferable when the seasonality changes over time.
  • The dynamic harmonic regression approach (option 2) is preferable if there are covariates that are useful predictors as these can be added as additional regressors.

Note: we omit Option 3: Complex seasonality.

Option 1: Simple approach with STL decomposition

Weekly data is difficult to work with because the seasonal period (the number of weeks in a year) is both large and non-integer. The average number of weeks in a year is 52.18. Most of the methods we have considered require the seasonal period to be an integer. Even if we approximate it by 52, most of the methods will not handle such a large seasonal period efficiently.

The simplest approach is to use an STL decomposition along with a non-seasonal method applied to the seasonally adjusted data.

# The following code creates a template for a future
# function call. It never references us_gasoline.
my_dcmp_spec <- decomposition_model(
STL(Barrels),
ETS(season_adjust ~ season(“N”))
)
# Barrels will be decomposed using STL
# Then the seasonally adjusted series (originl series
# with seasonality removed) will be modelled using ETS

us_gasoline %>%
model(stl_ets = my_dcmp_spec) %>% report()

us_gasoline %>%
model(stl_ets = my_dcmp_spec) %>%
forecast(h = “2 years”) %>%
autoplot(us_gasoline)

Option 2: Medium difficulty approach with dynamic harmonic regression model

The number of Fourier terms was selected by minimising the AICc. The order of the ARIMA model is also selected by minimising the AICc, although that is done within the ARIMA() function.

gas_dhr <- us_gasoline %>%
model(dhr = ARIMA(Barrels ~ PDQ(0,0,0) + fourier(K=6)))

gas_dhr %>%
forecast(h = “2 years”) %>%
autoplot(us_gasoline)

Appendix – Time series of small counts

If our data contains small counts (less than 100), then we need to use forecasting methods that are more appropriate for a sample space of non-negative integers.

There is one simple method which gets used in this context Croston’s method. Actually, this method does not properly deal with the count nature of the data either, but it is used so often, that it is worth knowing about it.

This model is not stochastic.

j06 <- PBS %>%
filter(ATC2==”J06″) %>%
summarise(Scripts = sum(Scripts)) %>%
fill_gaps(Scripts=0)
j06 %>% autoplot(Scripts)

fc <- j06 %>%
model(CROSTON(Scripts)) %>%
forecast(h = 6)

fc %>% autoplot(j06)

Appendix – Forecasting with decomposition

y_t = \hat{S}_t + \hat{A}_t

(Multiplicative approach also possible)

To forecast a decomposed time series, we forecast the seasonal component, \hat{S}_t, and the seasonally adjusted component \hat{A}_t, separately. It is usually assumed that the seasonal component is unchanging, or changing extremely slowly, so it is forecast by simply taking the last year of the estimated component. In other words, a seasonal naïve method is used for the seasonal component.

To forecast the seasonally adjusted component, any non-seasonal forecasting method may be used. For example, a random walk with drift model, or Holt’s method (discussed in Chapter 8), or a non-seasonal ARIMA model (discussed in Chapter 9), may be used.

Appendix – Differencing versus detrending

The first two methods are relatively similar and the only difference is how to deal with the trend-cycle component of the time series. In the former, we use ‘differencing’ and in the latter, we ‘detrend’ by subtracting an estimate of the trend-cycle using either a parametric or nonparametric method.

In other words, detrending involves decomposing a time series into trend-cycle, seasonal and irregular components; while, the differencing method uses differencing to remove (overall) trend-cycle as well as the trend in the seasonality, while the remaining seasonality and irregular components are considered together. Usually differencing is more widely applicable and preferred, unless you are specifically interested in the irregular component (Shumway and Stoffer, 2017).

Appendix – Fable parameterisation

The only difference is that the coefficient(s) for the constant/mean will differ.

In fable, the parameterisation used is:

(1-\phi_1B - \cdots - \phi_p B^p)(1-B)^d y_t = c + (1 + \theta_1 B + \cdots + \theta_q B^q)\varepsilon_t

In stats and forecast, an ARIMA model is parameterised as:

(1-\phi_1B - \cdots - \phi_p B^p)(y_t' - \mu) = (1 + \theta_1 B + \cdots + \theta_q B^q)\varepsilon_t

where (1-B)^d (y_t - \mu t^d/d!) = y_t' - \mu,

\mu =E[(1-B)^d y_t] (i.e. expected value refers to the mean of the expression), and

c = \mu(1-\phi_1 - \cdots - \phi_p ).

Appendix – Moving average smoothing

Moving average smoothing allows us to model the trend-cycle of a time series, although more sophisticated methods are available.

The trend-cycle of a time series is the same as a seasonally-adjusted version of a time series.

An equally weighted moving average is shown below.

3 - moving avg

where

m=2k+1.

k is an integer so that m is odd and we can have a symmetric series.

A larger m or k leads to a smoother (moving averaged) series.

Though is often odd, we can apply moving average smoothing with an even m twice to also achieve a symmetric series.

For example, apply with m as 4, then apply with m as 2. This is called a “2\times4-MA”.

5 - 2x4 moving avg

The result is

4 - moving avg2

For quarterly data, each quarter of the year is given equal weight since the first and last components combine to a quarter.

The seasonal variation will be averaged out. This can similarly be achieved using 2\times8-MA or 2\times12-MA.

If the seasonal period is even and of order m, we use 2\times m-MA. For example, 2\times12-MA is used for monthly data.

If the seasonal period is odd and of order m, we use m-MA. For example, 7-MA is used to estimate the trend-cycle of daily data with weekly seasonality.

Other choices will not fully remove seasonality.

Sources

Forecasting: Principles and Practice. 3rd Ed. (2020) by Rob J Hyndman and George Athanasopoulos

Time Series Analysis and Its Applications. 4th Ed. (2017) by Shumway and Stoffer

Applied Time Series Analysis (notes) by Tao Zou

Graphical Data Analysis (notes) by Michael Martin