Gaussian Process Regression for Multiple Assets, single factor

Following on from the model overview and simple implementation notebook, we’ll now expand from a single Y to a set of multiple, correlated Ys.

In equation form: \[ y_{t+1} \sim N(\beta_t X_t, \Sigma_y) \\ \beta \sim N(0, \Sigma_N) \\ \Sigma(N)_{i,j} = \alpha^2exp\left (-\frac{1}{2\rho^2}(i-j)^2 \right ) + \delta_{i,j}\\ \rho \sim \gamma^{-1}(5,5) \\ \begin{matrix}\alpha \sim N(0, 1) & \alpha > 0 \end{matrix} \delta_{i,j} \{\begin{matrix} \sim N(0,1) & i \equiv j\\ 0 & i \neq j \end{matrix} \]

\[ \Sigma_y = \Omega_y'\sigma\Omega_y \\ \Omega_y \sim LJKCorr(3) \\ \sigma_a \sim cauchy(3), \sigma_a > 0 \]

We need two more parameters to define \(\Sigma_y\). Following best practices in the Stan manual, we separate correlations and variances. In the actual code, we’ll be estimating the Cholesky decomposition of correlations, as it is much easier to generate stable positive-definite matrixes this way.

As before, we’ll validate the new model by simulating fake data with known parameters, then seeing if we can recover those parameters.

knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(tidyverse)
library(rstan) #2.17 hasn't been released yet for rstan, but we'll be using cmdstan 2.17.
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

Here’s the simulation code:

cat(read_file("../stan/single_x_many_y_gp_sim.stan"))
//latent variable GP with single x and multiple ys
functions {
  vector eta_rng(int N, real alpha, real rho) {
    matrix[N,N] K;
    matrix[N,N] L_K;
    real time[N];

    for(n in 1:N)
      time[n] = n;
    K = cov_exp_quad(time, alpha, rho);
    for(n in 1:N)
      K[n,n] = K[n,n] + 1e-9;
    L_K = cholesky_decompose(K);
    return multi_normal_cholesky_rng(rep_vector(0,N), L_K);
  }
  matrix L_Omega_rng(int A, real eta) {
    return lkj_corr_cholesky_rng(A, eta);
  }
}
data {
  int<lower=1> N;
  int<lower=1> A; // # of assets
  matrix[N,A] x; //Exogs
  real<lower=0> rho;
  real<lower=0> alpha;
  vector[N] eta;
  matrix[A,A] Sigma_y; // covariance matrix for endos
}
transformed data {
  real delta = 1e-9;
  vector[N] mu = rep_vector(0, N);
  real time[N]; // Time sequence
  matrix[A,A] L_y = cholesky_decompose(Sigma_y);
  row_vector[A] mu_y[N]; // I'd like to do diag_pre_multiply(eta, x), but
  // that makes a matrix, which, AFAIK, I would still have to loop through
  // to convert to an array of row_vectors to provide to multi_normal_chol
  for(n in 1:N) {
    time[n] = n;
    mu_y[n] = eta[n] * x[n];
  }
}
parameters {
  row_vector[A] y[N]; 
}
model {
  y ~ multi_normal_cholesky(mu_y, L_y);
}

As I mention in the code, because there seems to be no efficient way to convert from a matrix to an array of row_vectors, I’m having to loop to create the XBs (\(\mu_y\)). I would prefer to have the last line be y ~ multi_normal_cholesky(diag_pre_multiply(eta,x), L_y);, and skip creating mu_y altogether.

Generate Fake Data

I’ll generate the \(\beta\) using a Stan function, as in the previous notebook. I’ll also be generating the Sigma_y partly using a Stan function, as the Lewandowski onion method is not implemented in R (at least not in any package I know).

sim_model <- stan_model(model_name='single_x_many_y_gp_sim',
                   model_code=read_file("../stan/single_x_many_y_gp_sim.stan"))
rstan::expose_stan_functions(sim_model)
SEED <- 8675309 #Let's keep the randomness consistent.
set.seed(SEED) 
fake_data = list(N=100, A=3, alpha=0.25, rho=1.0)
fake_data <- within(fake_data, {
  x <- matrix(rnorm(N*A), nrow=N, ncol=A)
  eta <- eta_rng(N, alpha, rho,seed=SEED)
  
  L_Omega <- L_Omega_rng(A, 3)
  sigma_y <- sqrt(c(0.05, 0.1, 0.15))
  Sigma_y <- diag(sigma_y) %*% tcrossprod(L_Omega) %*% diag(sigma_y)
})

Let’s get a feeling for the lkj_corr so we know how to set the priors.

cross(list(shape=1:10,iter=1:100)) %>% 
  map_dfr(function(x, seed, A){ 
    rho <- tcrossprod(L_Omega_rng(A, x$shape,seed=seed)) %>% as_tibble() %>%
      rename_all(funs(LETTERS[as.numeric(substr(.,2,3))])) %>%
      mutate(asset=LETTERS[seq.int(A)]) %>%
      gather(asset2,value,-asset) #much uglier than melt(tcrossprod(x))
    rho %>% mutate(shape=x$shape, iteration=x$iter)
  }, seed=SEED, A=3) %>%
  ggplot(aes(x=as.factor(shape),y=value)) + facet_grid(asset~asset2) + geom_violin() +
    ggtitle("Random Correlations for different shape parameters") + xlab("shape")

Here’s what the correlations look like based on the actual weekly returns:

stopifnot(require(reshape2))
load("../data/calculating_factors.rData")
endo %>% select(-spd,-carry) %>% na.omit() %>% 
  spread(asset,endo) %>% select(-Date) %>% cor() %>% 
  melt() %>% filter(value < 1) %>%
  ggplot(aes(x=value)) + geom_density() +
    ggtitle("Distribution of Weekly FX return correlations")

The great mass is a bit above 0.5, which you might chalk up to the predominance of USD-specific drivers in most rates.

I would say we could set the LKJ prior at 5 or 6 and not miss much. Realistically, when it comes time to make this model more efficient, we might end up just specifying an exponentially-weighted covariance matrix, rather than trying to fit it. The LJK prior is more important for the betas, as we don’t know how they are related going in. We could fit a multivariate GARCH within this model, but I haven’t found that those add much value over weighted covariance, at least for weekly FX.

fake_data$eta %>% as_tibble() %>% mutate(t=seq.int(100)) %>% 
  gather(beta,value, -t) %>% ggplot(aes(x=t,y=value)) +
  geom_line() + ggtitle("Fake beta over time")

with(fake_data,x*eta) %>% as_tibble() %>% mutate(t=seq.int(100)) %>% 
  gather(beta,value, -t) %>% ggplot(aes(x=t,y=value, col=beta)) +
  geom_line() + ggtitle("Fake XBs over time")

What does the covariance look like?

print(tcrossprod(fake_data$L_Omega))
##             [,1]         [,2]        [,3]
## [1,]  1.00000000 -0.458583348 0.099047423
## [2,] -0.45858335  1.000000000 0.002536903
## [3,]  0.09904742  0.002536903 1.000000000
print(fake_data$Sigma_y)
##              [,1]          [,2]         [,3]
## [1,]  0.050000000 -0.0324267395 0.0085777585
## [2,] -0.032426739  0.1000000000 0.0003107059
## [3,]  0.008577758  0.0003107059 0.1500000000

Now for the ys.

sim_samples <- sampling(sim_model, fake_data, cores=2, chains=2)

Already taking significantly longer than the last time.

How do these ys look? sim_y is going to be a 3D array, and the new ‘tidy data’ world only wants dataframes. I’ve always used melt from reshape2 to put arrays into shape. I guess this is deprecated now, but it seems to be faster than adply from plyr, and that package was replaced by dplyr in any case. My guess would be that if things keep going the way they have been, rstan::extract is going to put parameters in data.frames only before long anyway.

library(reshape2)
sim_y <- extract(sim_samples)$y
sim_y <- melt(sim_y, varnames = c("iteration", "Date", "asset"))
sim_y %>% filter(iteration %in% sample.int(2000, 100)) %>%
  ggplot(aes(x=Date,y=value, group=iteration)) + facet_wrap(~asset) +
  geom_line(alpha=0.05) +  ggtitle("Y sims")

Hey, apart from scale, these actually look more like the Y values we might expect from weekly currency returns! This is with correlations:

cov2cor(fake_data$Sigma_y)
##             [,1]         [,2]        [,3]
## [1,]  1.00000000 -0.458583348 0.099047423
## [2,] -0.45858335  1.000000000 0.002536903
## [3,]  0.09904742  0.002536903 1.000000000
sim_y %>% spread(asset,value) %>% select(-iteration, -Date) %>% cor()
##             1           2          3
## 1  1.00000000 -0.14489546 0.01724343
## 2 -0.14489546  1.00000000 0.04289853
## 3  0.01724343  0.04289853 1.00000000

The XBs affect the final value, too. Let’s see what Stan recovers.

Recover Known Parameters

single_gp_model <- stan_model(file="../stan/single_x_many_y_gp.stan", 
                              model_name='single_x_many_y_gp')
fake_data$y <- sim_y %>% group_by(asset, Date) %>% summarize(value=median(value)) %>% spread(asset, value) %>% select(-Date) %>% as.matrix()
 
single_fit <- sampling(single_gp_model, fake_data[c("N","A","x","y")])

This one took less than 4 minutes! Are we sure it ran properly?

stopifnot(require(bayesplot))
mcmc_trace(as.array(single_fit), pars=c("alpha", "rho", "L_omega[2,1]", 
                                        "L_omega[3,2]", "L_omega[3,1]",
                                            paste0("sigma_y[",seq.int(fake_data$A),"]")))

Looks like we came up far too low on the \(\sigma_y\)s.

mcmc_intervals(as.array(single_fit), pars=c("alpha", "rho", "L_omega[1,1]", 
                                            "L_omega[2,1]", "L_omega[2,2]",
                                            "L_omega[3,1]", "L_omega[3,2]", "L_omega[3,3]",
                                            paste0("sigma_y[",seq.int(fake_data$A),"]")))

fake_data$L_Omega
##             [,1]       [,2]      [,3]
## [1,]  1.00000000 0.00000000 0.0000000
## [2,] -0.45858335 0.88865140 0.0000000
## [3,]  0.09904742 0.05396762 0.9936182

\(\Omega_L\) is passable, although I think 2,1 is out of range.

I think in the next complication I’m going to with the exponentially-weighted covariance I mentioned before for the endos. When we get to multiple \(\beta\)s, having two freely determined correlation matrixes is probably going to be too much to fit anyway. Additionally, this way we can have the covariance move over time without specifying \(\frac{A^2-A}{2}\) additional Gaussian Processes.

Lastly, I’ve been looking at hyperparameter recovery, but not \(\eta\) (aka \(\beta\)). I doubt the model did very well recovering the actual values, but are they at least in 95% interval 95% of the time?

extract(single_fit, pars="eta")$eta %>% as_tibble() %>%
  mutate(iterations=seq.int(nrow(.))) %>%
  gather(time, value, -iterations) %>% mutate(time=as.numeric(substr(time,2,6))) %>% 
  group_by(time) %>% summarize(min=quantile(value, 0.05),
                               low=quantile(value, 0.16),
                               mid=median(value),
                               high=quantile(value, 0.84),
                               max=quantile(value, 0.95)) %>%
  mutate(actual=fake_data$eta) %>%
  ggplot(aes(x=time)) + geom_ribbon(aes(ymin=min, ymax=max), fill="salmon") +
    geom_ribbon(aes(ymin=low, ymax=high), fill="red") +
    geom_line(aes(y=mid), colour="white") + geom_line(aes(y=actual), alpha=0.5) + 
    labs(title=expression(eta),
         caption="Salmon is 2SD, Red is 1SD, white is the median, real values are black")

Wow, that’s nuts. Wanna bet if we’ll be this accurate with real data? \(\alpha\) prior should be much higher than my pick for the simulated data. Which reminds me, we need a constant in the exogenous factors.

On to the multiple X, multiple Y validation.