Forecasting

At long last, we’re ready to look at the actual data. I also need to add a constant into the regression.

knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(tidyverse)
library(lubridate)
library(rstan) #2.17 hasn't been released yet for rstan, but we'll be using cmdstan 2.17.
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

To backtest (or cross-validate, depending on how you came by your statistical knowledge), we will be running the model on segments of the data with the expectation that we will predict one period into the future. We can generate the full posterior distribution of the forecast by adding a generated data section to the model, and passing an additional matrix of exog values for the latest date.

Here is the full GP model with added hooks for forecasting1:

cat(read_file("../stan/gp_forecasts.stan"))
// Full gp model, plus forecasting implementation 
data {
  int<lower=1> N1; // # of dts for regression
  int<lower=1> N2;  // # of dts for forecast
  int<lower=1> A; // # of endos
  int<lower=1> B; // # of betas
  matrix[N1,A] y;
  matrix[B,A] x[N1+N2]; // Exog
  cov_matrix[A] y_cov[N1+N2]; //contemporaneous covariance of ys
}
transformed data {
  int<lower=1> N = N1+N2;
  real t[N];
  matrix[A,A] L_y_cov[N];
  real delta = 1e-9;
  vector[B] beta_mu_mu = rep_vector(0, B);
  for(n in 1:N) {
    t[n] = n;
    L_y_cov[n] = cholesky_decompose(y_cov[n]);
  }
}
parameters {
  matrix[N,B] beta;
  real<lower=0> alpha[B];
  vector<lower=0>[B] sigma;
  real<lower=0> length_scale[B];
  cholesky_factor_corr[B] L_omega; //eta correlation
  matrix[N,B] beta_mu; //Instead of 0s, I'm going to 
  //put the beta-wise covariance on this term to avoid
  //having a massive kronecker product matrix.
}
model {
  matrix[N,N] cov[B];
  matrix[B,B] L_Sigma_b;

  to_vector(alpha) ~ normal(0.25, 0.5);
  sigma ~ cauchy(0, 1);
  L_omega ~ lkj_corr_cholesky(4);
  to_vector(length_scale) ~ gamma(4, 4);

  L_Sigma_b = diag_pre_multiply(sigma, L_omega);
  for(b in 1:B) 
    cov[b] = cov_exp_quad(t, alpha[b], length_scale[b]);
  for(n in 1:N) {
    beta_mu[n] ~ multi_normal_cholesky(beta_mu_mu, L_Sigma_b);
    for(b in 1:B)
      cov[b, n, n] = cov[b, n, n] + delta;
  }
  for(b in 1:B)
    beta[,b] ~ multi_normal_cholesky(beta_mu[,b], cholesky_decompose(cov[b]));
  for(n in 1:N1)
    y[n] ~ multi_normal_cholesky(beta[n] * x[n], L_y_cov[n]);
}
generated quantities {
  vector[A] new_y[N2];
  for(n in 1:N2)
    new_y[n] = multi_normal_cholesky_rng(to_vector(beta[N1+n] * x[N1+n]),
                                        L_y_cov[N1+n]);
}

Exogs

Let’s review the data

load("../data/calculating_factors.rData")

bind_rows(exogs, endo %>% rename(value=endo) %>% 
            select(Date,asset,value) %>%
            mutate(exog="endo")) %>% 
  ggplot(aes(x=Date,y=value)) + geom_line() + facet_grid(asset~exog)

Right, so I decided not to scale carry on a rolling Z-score, as I did with everything else. Have I subtracted USD carry from that yet, at least?

exogs %>% filter(exog=="carry") %>% 
  ggplot(aes(x=Date, y=value)) + facet_wrap(~asset, ncol=3) + 
    geom_line() + ggtitle("Carry values")

Yes. What are the ranges of the different series?

exogs %>% group_by(exog) %>% summarize(high=quantile(value, 0.95, na.rm=T),
                                       low=quantile(value, 0.05, na.rm=T))
## # A tibble: 5 x 3
##      exog       high         low
##     <chr>      <dbl>       <dbl>
## 1   carry 0.08541506 -0.01607872
## 2 d_carry 1.93473119 -1.88780961
## 3   eqy_z 1.48709312 -1.35064430
## 4  spot_z 1.81702173 -1.94670759
## 5    yc_z 1.90032961 -2.52255470

So multiplying by 25 will get us roughly scaled, but not centered. That’s fine, as I want to preserve the meaning of zero as it relates to interest rate differentials.

We need to do four things:

I’ll reshape last so we use tidy methods to do nearly everything.

Add Constant, Scale Carry, filter down

Note that if we had some kind of multivariate prior on the X values, the constant would cause our \(\Sigma_x\) to be non-positive definite.

On the scaling, I’m informed by the data I will be using to fit the model, but I feel ok with this since my scalar happens to wind up being a round number.

exogs <- bind_rows(exogs %>% mutate(value=ifelse(exog=="carry",value*25,value)),
                   exogs %>% filter(exog=="carry") %>%
                     mutate(exog="constant",
                            value=1)) %>%
         group_by(asset, exog) %>% arrange(Date) %>% fill(value) %>% ungroup() %>%
         filter(Date >= ymd(20080101) & Date <= ymd(20170825) & wday(Date) == 6) %>%
         filter(!(asset=="USD"))

exogs %>% ggplot(aes(x=Date, y=value)) + facet_grid(asset~exog) + geom_line() + ylim(-4,4)

Reshape

As before, Mr. Wickham has abandoned any functions handling higher-dimensional arrays, but he wrote the definitive package for handling them before.

library(reshape2)
exogs <- acast(exogs, Date~asset~exog)

Endos

How do these look?

endo %>% ggplot(aes(x=Date,y=endo)) + facet_wrap(~asset, ncol=3) + geom_line()

Fine.

I need to lag by a week since we are forecasting. Never forget this step, or you will get very excited, then very disappointed, by your new strategy. I’m actually going to lead by a week, so the endo for this week matches against exogs from last week. Lag exogs, or lead endos. Then remember which one you did when you put it in production and have to automate updating data.

We also need to put endo into an NxA matrix instead of long-form.

endo <- endo %>% group_by(asset) %>% 
          mutate(lead_endo=lead(endo, order_by=Date)) %>%
          select(-spd,-carry,-endo) %>%
          ungroup() %>%
          spread(asset,lead_endo)

endo %>% glimpse()
## Observations: 557
## Variables: 12
## $ Date <date> 2007-01-05, 2007-01-12, 2007-01-19, 2007-01-26, 2007-02-...
## $ AUD  <dbl> 0.007586785, 0.009064914, -0.019799756, 0.003121001, 0.00...
## $ CAD  <dbl> 0.0032056513, -0.0008906238, -0.0054733138, -0.0029870450...
## $ CHF  <dbl> -8.324283e-03, -8.054717e-04, -3.439112e-03, 4.800366e-03...
## $ CZK  <dbl> -0.0114433849, 0.0044268978, -0.0151051100, 0.0012768959,...
## $ EUR  <dbl> -0.0054059408, 0.0033167625, -0.0024807297, 0.0041648328,...
## $ GBP  <dbl> 0.0164439622, 0.0084379747, -0.0063536954, 0.0044373690, ...
## $ JPY  <dbl> -1.411995e-02, -7.452373e-03, -2.388342e-03, 3.384612e-03...
## $ NOK  <dbl> -1.327980e-02, 7.869853e-04, 1.783203e-02, 1.208534e-02, ...
## $ NZD  <dbl> 0.007955010, 0.008623572, 0.002832169, -0.020669711, 0.00...
## $ SEK  <dbl> -0.0088588251, 0.0012706074, -0.0015342861, 0.0067909809,...
## $ TRY  <dbl> 0.0084858089, 0.0143930461, -0.0064208005, 0.0194207882, ...

Additional Parameters - \(\Sigma_y\)

We will also need to provide the covariance estimate for Y. As discussed before, I’m assuming that we have a separate forecast for endogenous covariance. In these notebooks I’ll just use the exponentially weighted observed covariance, and repeat the most recently available data for the new periods.

FIRST_DATA_SET <- 52 #Start the forecast after we have at least 52 weeks of data
N1 <- nrow(exogs) - 1 - FIRST_DATA_SET
N2 <- 1 # # of extra periods
A <- ncol(endo)-1

y_cov <- array(NA, dim=c(N1+N2, A, A), 
               dimnames=list(Date=dimnames(exogs)[[1]][(FIRST_DATA_SET+1):dim(exogs)[1]], dimnames(exogs)[[2]], dimnames(exogs)[[2]]))
y_offset <- which(as.Date(rownames(y_cov)[1]) == endo$Date) - 1

lambda <- exp(log(0.5)/52) # 1Y half-life exponential decay
wts <- lambda ^ (seq.int(0, y_offset+N1+N2))
for(n in 1:nrow(y_cov)) {
  ind <- seq.int(y_offset+n-1) #extra -1 in the index because we are lagging these an extra step
  y_cov[n,,] <- cov.wt(endo[ind,-1],rev(wts[ind]), center=F)$cov  
}

Test

Backtesting and Processing Power

If we run the model on the entire data history at each period, the amount of processing time necessary to forecast the model will grow quadratically, as we will need to forecast on a symmetric matrix of growing dimension. This might be ok, and it’s certainly what I would prefer to do. We want to learn from 2008, and if we forecast on a rolling window, we are losing much of the appeal of using a Gaussian Process over a Kalman Filter or Hidden Markov Model. Advances in Stan to permit using GPUs and the open MPI standard should make it possible to throw arbitrarily large amounts of processing power at the problem, whereas in the current formulation we are limited by the clock speed of a maximum of 4 processors. Alternatively, as I’ve mentioned before, we could abandon generative modeling and use TensorFlow or some other system which already permits parallelization.

Testing

Let’s see how long the system takes for just the first year.

N1 <- 52
gp_fit <- stan(file="../stan/gp_forecast.stan", model_name="gp_forecast",
               data=list(N1=N1, N2=N2, A=A, B=dim(exogs)[3], 
                         y=na.omit(endo[(1+y_offset):(y_offset+N1),-1]),
                         x=aperm(exogs[(1+FIRST_DATA_SET):(FIRST_DATA_SET+N1+N2),,],c(1,3,2)),
                         y_cov=y_cov[1:(N1+N2),,]), chains=4, iter=2000)

Run time 45 minutes, and that’s the fewest parameters we’re going to get. However if the chains are mixing well we can probably do substantially fewer than 2000 iterations.

require(bayesplot)
mcmc_trace(as.array(gp_fit), pars=c("alpha[1]", "sigma[2]", "length_scale[3]", "L_omega[2,1]"))

Those look good. Let’s look at some of the betas.

mcmc_trace(as.array(gp_fit), pars=paste0("beta[",c(2,5,25,50),",",1:4,"]"))

These are mixing well, but note the scale.

Parameter Scaling

Ideally, we want all fitted parameters on a unit scale. Stan will slow down when the best values for parameters are far away from that size, and functions especially poorly when some parameters are of scales orders of magnitude away from others. We could specify a smaller default stepsize to help with the betas, but then the model will have trouble converging on larger parameters, like length_scale. The solution is to scale down the exogenous factors, X, so that betas will be estimated on a comparable scale to the other parameters. For ease of use, I will do this in the model itself. That way, we won’t have to remember to scale in a separate step in production. An alternate solution would be to increase the size of both \(y\) and \(\Sigma_y\), but this is slightly easier, and will leave the forecast ys directly interpretable. As we are specifying \(\Sigma_y\), we needn’t worry that \(y\) is no where near unit scale.

Back to Testing

How about the beta correlation matrix?

mcmc_trace(as.array(gp_fit), pars=paste0("L_omega[",c(2,3,2,3,3),",",c(1,1,2,2,3),"]"))

Good. I could probably get away with 100 iterations, but let’s do 500 to be safe.

How do the posterior predictions look? I’m anticipating a pretty wide scale around zero.

mcmc_recover_hist(as.array(gp_fit)[,,paste0("new_y[",N2,",",1:A,"]")],
                  true=as.numeric(endo[N1+1,-1]))

Yup. This reflects the low signal-to-noise of the data. The good news is the values are all reasonable.

On Currency Forecasting in General

The true model for returns in any asset class is the combination of carry and price movements. In FX, the carry is relatively stable but subject to punctuated equilibrium as new data comes to light. In developed markets, this data primarily consists of central bank rate decisions and economic data that might affect those decisions. Price movements, however, are the deterministic result of countless iterative, interacting agents. While we may know many of these agents’ motives, it is not possible to aggregate their behavior with any accuracy, because the actions of each agent are affected by those of all of the other agents, and small measurement errors compound. It is impossible to tell, for example, the periodicity of market data. A day’s worth of price movements at 5-minute increments looks the same as a year’s worth of daily movements. The asset price measures the result of a chaotic, nonlinear dynamical system.

So why bother? What I have given you is the reason I believe no amount of layers of neural nets—or of any other algorithm which takes as an assumption that the data is stochastic—will be able to accurately predict market movements on any human timescale. However, we know that for currencies, interest rate expectations (i.e., forecasts of carry) and other fundamental factors impact the decision-making of the agents I described above. The hope is that, over time, the small differences between predicted means, plus better-than-nothing modeling of the relationships between these factors, and the relationships between the predictions, will add up. The investor will have an edge in harvesting the carry over a portfolio of currencies, whilst weathering idiosyncratic price movements.

Save the additional data manipulations.

save(y_cov,y_offset,N1,N2,A,exogs,endo,file="../data/forecasting.rData")

  1. Unfortunately, knitr doesn't render Stan code correctly in html yet, so there's no syntax highlighting. They don't escape sequences like "int<lower=1>" properly.