Model Overview

This is following Jim Savage’s Bayesian Workflow

We are going to run a dynamic linear regression, in which inputs X relate to asset returns Y on the basis of betas B, which vary over time as a Gaussian process. Gaussian processes are discussed at length in chapter 18 of the Stan reference manual.

In full: \[ y_{t+1} \sim N(\beta_t X_t, \Sigma_y) \\ \beta_d \sim MN(0, \Sigma_{N,d}, \Sigma_{\beta}) \\ \Sigma(N, d)_{i,j} = \alpha_d^2exp\left (-\frac{1}{2\rho_d^2}(i-j)^2 \right ) + \delta_{i,j}\\ \rho_d \sim \gamma^{-1}(4,4) \\ \begin{matrix}\alpha_d \sim N(0.25, 0.5) & \alpha > 0 \end{matrix} \]

\[ \delta_{i,j} \{\begin{matrix} \sim N(0,1) & i \equiv j\\ 0 & i \neq j \end{matrix} \]

\[ \Sigma_\beta = \Omega_\beta'\tau\Omega_\beta \\ \Omega_\beta \sim LJKCorr(4) \\ \tau \sim cauchy(0, 1), \tau > 0 \]

\[ \Sigma_y = \Omega_y'\sigma\Omega_y \\ \Omega_y \sim LJKCorr(3) \\ \sigma_a \sim cauchy(3), \sigma_a > 0 \]

In plain English:

  1. next period’s Y depends on this period’s inputs X, weighted by this period’s betas \(\beta\). Ys move together across assets with covariance \(\Sigma_y\). N, here, is the multivariate-normal distribution.

  2. Each \(\beta_d\) is distributed matrix-variate-normal, with \(\mu\) = 0. The \(\beta_d\) is correlated with itself over all times \(t \in {1 \dotsi T}\) with covariance matrix \(\Sigma_{N, d}\). Additionally, all \(\beta\)s covary with each other according to \(\Sigma_\beta\). We include this term because the exogenous variables are not independent.

  3. Each \(\Sigma_{N, d}\) is a Gaussian process, in which the degree of similarities between elements varies according to their squared Euclidian distance from one another (cf. Stan Reference Manual 2.17.0, pp.246-247). As the length of each measurement period is constant, the distance between row \(i\) and row \(j\) is simply the difference between \(i\) and \(j\).

  4. \(\rho\) and \(\alpha\) are parameters defining the speed at which \(\Sigma_{N,\beta}\) terms become unrelated to each other. \(\delta\) is the scale of the noise term in the regression, and guarantees a positive-definite matrix. Betas are not independent of each other, but their parameters \(\alpha\) and \(\rho\) are.

  5. \(\Sigma_\beta\) is the covariance matrix between betas. My exogenous variables are correlated. For example, I have both the relative level of 2Y rates, and the relative level of changes in 2Y rates. We account for this with two additional parameters, \(\Omega\), representing the correlation, and \(\tau\), representing the variance. \(\tau\) is going to interact with $_{N, } as a whole, so we’ll need to do some testing to decide what the prior distribution should look like. The ratio between the two is going to determine whether covariation or variation over time dominate, and we want the variation over time to be more important. I’m putting a half-cauchy distribution in for now, as that’s the prior the mc-stan guys recommend, and it is weighted closer to 0 than, say, the Gaussian distribution.

  6. Similarly, \(\Sigma_y\) is the covariance matrix between endogenous variables, indexed as \(a\).

We may never get to the full implementation of the model above, particularly as Y is directly observable, so there are better and easier ways to estimate its covariance. We are going to start with a single asset, single beta parameter implementation of that model. Here, \(\beta\) becomes just multivariate-normal with respect to time, and y becomes normally distributed.

Gaussian Process Regression for a Single Asset, single factor

knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(tidyverse)
library(rstan) #2.17 hasn't been released yet for rstan, but we'll be using cmdstan 2.17.
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

NB RStudio actually provides a stan chunk-style, which will compile the model when executed. However, in my experience RStudio gets unstable when one runs a large stan model inside it. Additionally, the chunks don’t show up when you compile notebooks to html with knitr. I’m just going to state the model here. This is stan/single_xy_gp.stan in the repository.

cat(read_file("../stan/single_xy_gp.stan"))
// multi-dimensional input latent variable GP with single y
// cf. manual 2.17.0, pp 253-254, 
// Gaussian Processes for Machine Learning, pp 119-122
data {
  int<lower=1> N;
  vector[N] x; //Exogs
  vector[N] y;
}
transformed data {
  real delta = 1e-9;
  real time[N]; // Time sequence
  vector[N] mu = rep_vector(0, N);
  for(n in 1:N)
    time[n] = n;
}
parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
  vector[N] eta;
}
model {
  matrix[N, N] K = cov_exp_quad(time, alpha, rho);
  matrix[N, N] L_K;
  //perturb diagonal elements
  for (n in 1:N) 
    K[n, n] = K[n, n] + delta;
  L_K = cholesky_decompose(K);
    
  rho ~ inv_gamma(5, 5);
  alpha ~  normal(0.5, 1);
  sigma ~ normal(0, 1);
  eta ~ multi_normal_cholesky(mu, L_K);

  y ~ normal(eta .* x, sigma);
}

Let’s follow the workflow I linked at the top and see if this simple model can recover a set of known parameters.

Generate Fake Data

We’ll need to write a function in stan that can simulate results when given the parameters. The model I stated above can simulate Ys (and the distribution of those simulated Ys gives you the forecast distribution), but the code has \(\rho\), \(\alpha\), and \(\sigma\) as factors. We want to specify these, generate fake data on the basis of that specification, then apply the model above to see if it can recover them.

This is the simulation code:

cat(read_file("../stan/single_xy_gp_sim.stan"))
// multi-dimensional input latent variable GP with single x and y
// cf. manual 2.17.0, pp 253-254, 
// Gaussian Processes for Machine Learning, pp 119-122
functions {
  vector eta_rng(int N, real alpha, real rho) {
    matrix[N,N] K;
    matrix[N,N] L_K;
    real time[N];

    for(n in 1:N)
      time[n] = n;
    K = cov_exp_quad(time, alpha, rho);
    for(n in 1:N)
      K[n,n] = K[n,n] + 1e-9;
    L_K = cholesky_decompose(K);
    return multi_normal_cholesky_rng(rep_vector(0,N), L_K);
  }
}
data {
  int<lower=1> N;
  vector[N] x; //Exogs
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
  vector[N] eta;
}
transformed data {
  real delta = 1e-9;
  vector[N] mu = rep_vector(0, N);
  real time[N]; // Time sequence
  for(n in 1:N)
    time[n] = n;
}
parameters {
  vector[N] y;
}
model {
  y ~ normal(eta .* x, sigma);
}

The body of the model is the same as the fitting version, except I’ve moved the parameters into data, and y into parameters.

I’ve also written a function at the top to generate the \(\eta\)s. The first time I did this, I tried to get away with a simple AR(1) process, but \(\eta\) depends on two of the other parameters, and the priors are not necessarily intuitive. Let’s generate \(\eta\) over a range of parameter values so you can see what I mean.

First we need to compile the model.

sim_model <- stan_model(model_name='single_xy_gp_sim',
                   model_code=read_file("../stan/single_xy_gp_sim.stan"))

Now we can expose function eta_rng to play with.

rstan::expose_stan_functions(sim_model)
SEED <- 8675309 #Let's keep the randomness consistent.
cross(list(alpha=seq(0.01,1,by=0.2), rho=seq(0.1,1,by=0.2))) %>%
        map_dfr(function(x, N, seed){data.frame(alpha=x$alpha,
                                             rho=x$rho,
                                             time=seq.int(N),
                                             eta=eta_rng(N,x$alpha,x$rho,seed))}, 
                seed=SEED, N=100) %>%
  ggplot(aes(x=time,y=eta)) + facet_grid(alpha~rho) + geom_line() +
    ggtitle("Random Betas for different values of alpha & rho") +
    labs(x="Facets by rho",y="Facets by alpha")

You can clearly see how \(\alpha\) (vertical) controls the amplitude and \(\rho\) (horizontal) is the length scale.

set.seed(SEED) 
fake_data = list(N=100, alpha=0.25, rho=1.0, sigma=0.05)
fake_data <- within(fake_data, {
  x <- rnorm(N)
  eta <- eta_rng(N, alpha, rho,seed=SEED)
})
fake_data$eta %>% as_tibble() %>% mutate(t=seq.int(100)) %>% 
  gather(beta,value, -t) %>% ggplot(aes(x=t,y=value)) +
  geom_line() + ggtitle("Fake beta over time")

with(fake_data,x*eta) %>% as_tibble() %>% mutate(t=seq.int(100)) %>% 
  gather(beta,value, -t) %>% ggplot(aes(x=t,y=value)) +
  geom_line() + ggtitle("Fake XB over time")

It’s not all that likely that we’ll be able to get close to these betas with only 100 data points, but I’d hope at least that we’ll have something in range on \(\alpha\), \(\rho\), and \(\sigma\).

sim_samples <- sampling(sim_model, fake_data, cores=2, chains=2)

Now we have 4000 possible series based on those parameters. Let’s see how different they are. We’re going to wind up using the median to test recovering parameters.

sim_y <- extract(sim_samples)$y
sim_y %>% t() %>% as_tibble() %>%
  select(num_range("V",sample.int(4000, 100))) %>%
  mutate(Date=seq.int(100)) %>% gather(sample,value,-Date) %>%
  ggplot(aes(x=Date,y=value, group=sample)) + geom_line(alpha=0.05) + 
  ggtitle("Y sims")

This is wildly optimistic compared to the data we have. Relatively speaking, the real Y probably has a \(\sigma\) an order of magnitude higher, as the noise overwhelms the signal. Keep in mind, this is meant to be a stochastic time series, not a random walk. Anyway, let’s see if we can recover that.

Recover Known Parameters

single_gp_model <- stan_model(file="../stan/single_xy_gp.stan", model_name='single_xy_gp')
single_fit <- sampling(single_gp_model, 
                       within(fake_data, {y<-apply(sim_y,2,median)})[c("N","D","x","y")])

Amazingly, I had no divergent transitions on this one. This sample took 5 minutes to run on my laptop’s i7, 2.3GHz. The full model may go up by two orders of magnitude. Stan is supposed to be adding GPU support for Gaussian processes very soon. That’s going to make this a lot more practical on consumer-grade hardware, as will the addition of cov_exp_quad_cholesky. Probably, there’s an optimization that could be performed to this kernel given that the distance term is always an integer, but I don’t know it.

Anyway, let’s see if we came reasonably close to recovering the true parameters.

stopifnot(require(bayesplot))

mcmc_intervals(as.array(single_fit), pars=c("alpha", "rho", "sigma"))

\(\rho\) and \(\alpha\) came in on target, but \(\sigma\) is a little low. Given that I took the median of 2000 simulated ys, that isn’t too surprising.

Were the chains well-mixed?

mcmc_trace(as.array(single_fit), pars=c("alpha", "rho", "sigma"))

So-so for \(\sigma\).

We are validated in concept for the simplest possible case. Next, let’s look at multiple endogenous variables. I’m saving multiple exogs for last because matrix-variate is the most complicated distribution we will need to handle.