Now that we have a validated model, and we’ve tested it forecasting some of our actual data, it’s time to run a backtest. This will probably take a while, so let’s take a minute to think about model efficiency.
We know that, unless we cap the total number of weeks on which we run the regression, the model complexity is going to rise at least at \(O(n^2)\), because the Gaussian Process requires us to pass an NxN matrix as the beta covariance. The good news is that we have, relatively, a tiny data set compared to what a machine learning expert would want for, say, a recursive neural net. And Stan is improving quickly, with the possibility to parallelize individual chains and use GPUs promised soon (I’m writing this in September, 2017). At the moment, I’m estimating the last week for which we have data, in which N will be 451, should take about 4 & 1/2 hours. So we are early on the inevitable quadratic complexity curve. It may be possible to keep ahead of the increasing complexity of the model due to improvements in Stan’s algorithms and in the hardware on which they run. Failing that, we might take shortcuts like using the optimizer with initial values from the previous run.
To see if it’s viable to use the optimizer, we will need to see how much previous betas change over time in the full generative model. I’m going to use Syracuse University’s Research Computing Lab to run all the backtesting periods independently so we can see how much drift we might expect.
This means I’m going to have to prepare 400 separate data files, one for each week with exactly the right amount of data. I could probably rewrite the Stan code to do it in the transformed data block, but R is much easier to debug, and these files are small.
library(tidyverse)
library(lubridate)
library(rstan)
if(!dir.exists("../data/weekly")) {
  dir.create("../data/weekly")
}
load("../data/forecasting.rData")The command line version of Stan wants data in a particular format that can be generated by rstan::stan_rdump. Annoyingly, stan_rdump insists on receiving a vector of character strings naming objects in the environment, rather than a list of the actual data, so I’ll be making my modifications in a pocket environment rather than just assigning everything to a list. We could mess around with a new environment directly, using new.env, but any function creates its own environment, so let’s go that route.
dump_data <- function(N, path, A=dim(exogs)[2], min_N1=52, 
                      N2=1, B=dim(exogs)[3], y_offset=104,
                      p_endo=endo, p_exogs=exogs, p_y_cov=y_cov,
                      FIRST_DATA_SET=52, debug=F) {
  N1 <- min_N1 + N
  y <- na.omit(p_endo[(1+y_offset):(y_offset+N1),-1])
  x <- aperm(p_exogs[(1+FIRST_DATA_SET):(FIRST_DATA_SET+N1+N2),,],c(1,3,2))
  y_cov <- p_y_cov[1:(N1+N2),,]
  if(debug) {
    return(list(N1=N1,N2=N2,A=A,B=B,y=y,x=x,y_cov=y_cov))
  } else {
    rstan::stan_rdump(c("N1","N2","A","B","y","x","y_cov"), 
                    file=paste0(path,"gp_",N,".dat"))
  }
}
walk(0:399, dump_data, path="../data/weekly/")I compiled the gp_forecast.stan using CmdStan which lets you run a single chain directly from the bash shell. This means it’s easy to script. Syracuse uses Condor in its computer labs, which runs jobs on idle processors. The disadvantage is that your job can be arbitrarily killed if somebody logs on. As a result, about 1450 of the 1600 jobs I needed to run (400 sets of data above times 4 chains each) finished in 8 hours, but the remainder took days. Condor has hooks for resuming a run, and really you could just start over with fewer iterations and concatenate the samples if interrupt, but Stan doesn’t have an easy mechanism to make that happen. I think after the MPI stuff comes through (and that should be soon!), this would be structured quite differently.
In case you’re curious, the condor job submit code looks like this:
executable      =/home/cnaylor/bin/gp_forecasts 
arguments       ="data file=/home/cnaylor/weekly/gp_$(Process).dat sample num_samples=500 num_warmup=500 random seed=8675$(Process) output file=samples_$(Process)_0" 
output          = ../logs/output_$(Process)_0 
error           = ../logs/error_$(Process)_0 
log             = ../logs/gp.log 
request_cpus    = 1 
Requirements    = Memory >= 100 
Image_Size      = 101000 
Rank            = kflops 
queue 400 Then repeat with _[1-3] for an additional set of chains. I actually wound up running 5, for reasons explained below. If someone knows a way to loop the chains in Condor, I’m all ears.
At the end of this process, I have 1600 files full of generated parameter samples, totaling about 13GB of data. For obvious reasons, I’m not including them in git.
Let’s have a look.
I don’t have 13GB of RAM on this machine, and R keeps everything in memory, so I’m going to load and process this data one fit at a time. It’s good practice not to assume you’ll be able to hold all results in memory anyway. ggplot2 is a memory hog, and you do not want to be stuck moving stuff in and out of swap when you try to graph something big.
As a result, I’ll do my analysis in one pass of loading data files, then look at the result later.
We need to: * Check model fits and convergence on at least a representative sample of chains * Check the stability of betas across runs * Assemble a complete set of weekly forecast distributions to compare to actual returns
A little house-keeping first. We need to translate the numbered files to corresponding dates.
N2dt <- function(N, return_all=F, MIN_N1=52, FIRST_DATA_SET=52,
                 N2=1, dts=as.Date(dimnames(exogs)[[1]])) {
  OFFSET <- MIN_N1+N2
  if(return_all) {
    return(dts[OFFSET:(OFFSET+FIRST_DATA_SET+N)])
  }
  else {
    return(dts[OFFSET+FIRST_DATA_SET+N])
  }
}
load_fit <- function(N, DATA_PATH="../data/backtest", 
                     dts=as.Date(dimnames(exogs)[[1]]),
                     kill_stuck=T) {
  #We have an issue with stuck chains in my current formulation, so remove those.
  #Since R is also 1-indexed, the index equals the number on the data files + 1
  files <- paste0(DATA_PATH,"/samples_",N,"_",0:4)
  exist <- sapply(files,file.exists)
  if(!all(exist)) {
    if(all(!exist)) {
      message(paste("No samples for",N))
      return(NULL)
    }
    nc <- nchar(files[1]) #filenames will always be same length
    #warning(paste0("Loading files for ", N, 
    #               ", but missing chains: ", 
    #               paste(substr(files[!exist],nc,nc), 
    #                    collapse=", ")))
    files <- files[exist]
  }
  fit <- tryCatch( {
    fit <- rstan::read_stan_csv(files)
    stuck <- get_stuck_chains(fit)
    if(any(stuck)) { #Slow, but safer than trying to drop chains from the StanFit object. 
      #Stan team disapproves of this.
      if(sum(stuck) > 3) {
        stop("Too many stuck chains")
      }
      fit <- rstan::read_stan_csv(files[!stuck])
    }
    fit@model_name <- format(N2dt(N, dts=dts), "%Y-%m-%d")
    return(fit)
  },
  error=function(cond) {
    message(paste(N, "failed.",cond))
    return(NULL)
  }
  )
}CmdStan has generated a large number of log files, which I can search for common error messages like ‘divergent transitions’, or ‘max_treedepth reached’. Just use the classic find -> grep pattern: find ./logs/* -exec grep divergent {} \;, and see if anything comes up. Nothing did for me. You could potentially have divergences if you got unlucky with the seed. This is generally not likely, but certainly a possibility with GPs.
The other thing to check is that all of the chains actually ran to completion. Recall this is not a guarantee with the HTCondor setup. find ./data/backtest/ -iname "samples*" -exec grep -LH terminated {} \; > not_terminated reveals for me 9 stragglers, which isn’t bad out of 1600 total runs. I reran these guys on dedicated hardware. I could also have grepped ‘Abnormal Termination’ on the log files.
For convergence, our main problems will be the length_scale and the beta correlation matrix. From what I can find in the literature, there’s no canonical way to check convergence of chains. Usually I would eyeball them, but here we have too many runs.
My backtesting is producing a lot of stuck chains. Michael Betancourt (recently published)[https://betanalpha.github.io/assets/case_studies/gp_part3/part3.html] a set of case studies of GP fits which recommends strong (i.e. informative) priors on the length_scale parameter. Because I am fitting several length_scales over a number of different periods, this isn’t ideal. For now, I’m running 5 chains for the backtest, and discarding stuck chains. We’ll see if this is adequate.
get_stuck_chains <- function(fit, tol=1e-5) {
  #infer stuck chains from mean step size
  require(stringr)
  stuck <- get_adaptation_info(fit) %>%
          str_match("Step size = (.*)\n") %>%
          `[`(,2) %>% #is this approved tidy syntax? probably not.
          as.numeric() <= tol
  return(stuck)
}I will be keeping only point estimates of historical betas by run. If I kept the full distribution, we would have too much to look at. The exception is the last row of betas in each of these runs, which are forecasts. I want to keep those distributions so I can see how much of the endogenous forecast variance is due to variance in the betas, and how much to the given covariance of y.
get_betas <- function(fit, N, dts=N2dt(N,T),
                      beta_names=dimnames(exogs)[[3]],
                      n_iter=500) {
  betas <- extract(fit, pars="beta")[["beta"]]
  median_beta <- apply(betas,2:3,median)
  dimnames(median_beta) <- list(Date=format(dts,"%Y-%m-%d"),beta=beta_names)
  forecast_beta <- betas[,dim(betas)[2],]
  dimnames(forecast_beta) <- list(iterations=NULL,beta=beta_names)
  return(list(median=median_beta,forecast=forecast_beta))
}I want the full distribution for forecasts, too.
get_yhat <- function(fit, y_names=dimnames(exogs)[[2]]) {
  ys <- extract(fit, pars="new_y")[["new_y"]][,1,]  #The stan code is written so you can forecast beyond one week, but we didn't do that
  colnames(ys) <- y_names
  return(ys)
}Now we need to iterate over these functions. I’m going to pre-allocate data structures since we know exactly how each should be shaped. I think an old-fashioned loop is clearer than lapply or the purrr functions here, as we’re doing four things at once. If I didn’t want to minimize disk access, I’d use separate lapplys for each variable (or call purrr:map).
As a further wrinkle, I have variable numbers of chains for each fit. Best to put all the data in long form to accomodate this. It’ll end up there anyway when I’m graphing. We do not want to grow a data frame, however. This is (circle 2 of the R Inferno)[http://www.burns-stat.com/pages/Tutor/R_inferno.pdf]. Best to allocate a list and combine everything at once at the end.
N <- 400
all_dts <- N2dt((N-1), T)
betas <- vector('list', N)
forecast_betas <- vector('list', N)
length_scales <- vector('list', N)
yhat <- vector('list', N)
length_scales <- vector('list', N)
for(i in 0:(N-1)) {
  if(i %% 50 == 0) {
    message(i)
  }
  dt <- N2dt(i, dts=all_dts)
  fit <- load_fit(i, dts=all_dts)
  if(!is.null(fit)) {
    p_betas <- get_betas(fit, i, beta_names=beta_names)
    betas[[i+1]] <- p_betas$median
    forecast_betas[[i+1]] <- p_betas$forecast
    
    yhat[[i+1]] <- get_yhat(fit, y_names=y_names)
    length_scales[[i+1]] <- extract(fit,pars=c("length_scale"))[["length_scale"]]
    names(betas)[i+1] <- format(dt,"%Y-%m-%d")
  }
  else {
    message(paste(i,"failed"))
  }
}3 of these failed, but had ‘Adaption terminated’ in the samples. Looks like I’d be better off checking if the sample file has “Elapsed” at the end.
betas_l <- map_dfr(betas[!sapply(betas, is.null)],
               ~data.frame(Date=as.Date(rownames(.x)),.x) %>%
                 gather(beta,value,-Date), .id="run_date") %>%
        mutate(run_date=N2dt(as.numeric(run_date)))
betas_l %>% ggplot(aes(x=Date,y=value,col=run_date)) + facet_wrap(~beta) +
  geom_line(alpha=0.1) + scale_color_gradient() + ylim(-1,1) + 
  xlab("") + ylab("") + theme(axis.text.x=element_text(angle=45,hjust=1))Variance of some betas drops considerably as we add more data. There also looks to be a bad run or two around the 2015 holiday. This would be right around the time the Yen made a historic run cheaper. Just a guess, at this point.
Let’s take a closer look at one.
betas_l %>% filter(beta=="eqy_z") %>%
  filter(Date %in% sample(unique(Date),9)) %>%
  ggplot(aes(x=run_date,y=value)) + facet_wrap(~Date) +
    geom_line() + ggtitle("Values for eqy_z at specific Dates over time")Doesn’t look too bad, although in production I think we’d need to let the chains run quite a bit longer.
How does the overall distribution look, ignoring the ordering given by run_date? I’ll be using these ribbon plots time and again. I always used to graph distributions over time as violins, and I’m endebted to Jim Savage for demonstrating this setup instead.
betas_l %>% group_by(Date, beta) %>%
  summarize(min=quantile(value, 0.025),
            low=quantile(value, 0.16),
            mid=median(value),
            high=quantile(value,0.84),
            max=quantile(value,0.975)) %>%
 ggplot(aes(x=Date)) + facet_wrap(~beta,ncol=1,scales="free_y") +
  geom_ribbon(aes(ymin=min,ymax=max), fill="salmon") +
  geom_ribbon(aes(ymin=low,ymax=high), fill="red") +
  geom_line(aes(y=mid), col="black", alpha=0.7) +
  ggtitle("Beta distribution over different run dates")Ok, I’d better take a quick look at the data close to NYE 2014.
map_dfr(dimnames(exogs)[[3]],
        ~ data.frame(Date=as.Date(dimnames(exogs)[[1]]), exogs[,,.]) %>%
          gather(asset,value,-Date), .id="beta") %>% 
  mutate(beta=dimnames(exogs)[[3]][as.numeric(beta)]) %>%
  filter(Date >= ymd(20141201) & Date <= ymd(20150131)) %>%
  ggplot(aes(x=Date,y=value,col=asset)) + facet_wrap(~beta) +
    geom_line() + guides(col="none") 
Hm. What about endos?
endo %>% gather(asset,value,-Date) %>%
  filter(Date >= ymd(20141201) & Date <= ymd(20150131)) %>%
  ggplot(aes(x=Date,y=value,col=asset)) + geom_line()Ah-ha! Not the Yen, but when CHF went off the peg1. Well, that’s legit. If this were a Kalman filter it would blow out \(\Sigma_\beta\) for a while, assuming you have put an exponential decay on that matrix (the standard example has fixed values). We’ll probably see similar for the Gaussian Process regression.
Of course, these are all betas fitted with hindsight. What did the forecast betas look like?
not_run <- sapply(forecast_betas, is.null)
names(forecast_betas) <- sapply(1:400, function(x){format(N2dt(x),"%Y%m%d")})
f_beta_dist <- map2_dfr(forecast_betas[!not_run], 
                        names(forecast_betas)[!not_run],
                       function(x, dt) {
                         x %>% as_tibble() %>%
                           gather(beta,value) %>%
                           group_by(beta) %>%
                           summarize(min=quantile(value, 0.025),
                              low=quantile(value, 0.16),
                              mid=median(value),
                              high=quantile(value,0.84),
                              max=quantile(value,0.975),
                              Date=ymd(dt)) %>%
                          ungroup()
                       })
f_beta_dist %>% ggplot(aes(x=Date)) + 
  facet_wrap(~beta,ncol=1,scales="free_y") +
  geom_ribbon(aes(ymin=min,ymax=max), fill="salmon") +
  geom_ribbon(aes(ymin=low,ymax=high), fill="red") +
  geom_line(aes(y=mid), col="black") +
  ggtitle("Forecast Betas")You can really see the impact of the CHF. Note also the high degree of uncertainty in the true beta value compared to the size of the median beta. Simple optimization would lead you to a point estimate (hopefully) at the median displayed above, but you would have to model this uncertainty separately. Here, we have an integrated model, and it’s telling us that FX forecasting is hard.
In production, the portfolio manager would know about the Swiss peg, while the model does not. You could handle this several ways. At my fund, we just temporarily constrained our optimizer to reject positions in CHF. As it was pegged to the EUR, we reasoned, we would prefer to take our risk there, rather than lose a few basis points in carry each month on the CHFEUR cross waiting for the peg to break. This decision was at least methodologically parsimonious.
With the formulation we’re using here, we are specifying the asset return covariance, which implies that we are modeling it separately. We could artificially raise the CHF \(\Sigma\) to reflect the risk endogenous to our covariance model (in our case, simple exponentially-weighted historical covariance).
The best solution would probably be to choose an alternate distribution for the Y forecasts. We know multivariate normal is not correct; currency returns have fat tails, and the CHF breaking its peg is a perfect example. Stan offers multivariate Student’s T, which is a bit better, and we would only need to estimate a single additional parameter, \(\nu\). For anything more complicated than that, we would probably need to switch to a vectorized univariate likelihood, in which we stack the Y values on top of each other and alter the distribution spread parameter according to the asset to which each Y refers (examples of hierarchical models in the Stan manual show how to set this up).
All of this is beyond the scope of what I want to demonstrate here. Instead, we should expect the spike in uncertainty after the Swiss broke their peg to follow through into the forecast Y values.
Here is the distribution of forecasts for each week generated by Stan:
not_run <- sapply(yhat, is.null)
names(yhat) <- sapply(1:400, function(x){format(N2dt(x),"%Y%m%d")})
yhat_dist <- map2_dfr(yhat[!not_run], 
                        names(yhat)[!not_run],
                       function(x, dt) {
                         x %>% as_tibble() %>%
                           gather(asset,value) %>%
                           group_by(asset) %>%
                           summarize(min=quantile(value, 0.025),
                              low=quantile(value, 0.16),
                              mid=median(value),
                              high=quantile(value,0.84),
                              max=quantile(value,0.975),
                              Date=ymd(dt)) %>%
                          ungroup()
                       })
yhat_dist %>% ggplot(aes(x=Date)) + 
  facet_wrap(~asset,ncol=1) +
  geom_ribbon(aes(ymin=min,ymax=max), fill="salmon") +
  geom_ribbon(aes(ymin=low,ymax=high), fill="red") +
  geom_line(aes(y=mid), col="darkgrey") +
  ggtitle("Forecast Asset Returns") Once again, we have very low signal to noise. We can check how well the model forecasts fit the actual data by lagging the asset returns back to reality, then overlaying them on our forecast.
yhat_dist %>% ggplot(aes(x=Date)) + 
  facet_wrap(~asset,ncol=1, scales="free_y") +
  geom_ribbon(aes(ymin=min,ymax=max), fill="salmon") +
  geom_ribbon(aes(ymin=low,ymax=high), fill="red") +
  geom_line(aes(y=mid), col="darkgrey") +
  geom_line(data=endo %>% gather(asset,value,-Date) %>%
              group_by(asset) %>% 
              mutate(`Actual Return`=lag(value, order_by = Date)),
            aes(y=`Actual Return`)) +
  ggtitle("Forecast Asset Returns (actual in black)")We expect the actual returns to exceed the light pink band 5% of the time, and the red band about 28% of the time.
require(knitr)
yhat_dist %>% left_join(endo %>% gather(asset,value,-Date) %>%
              group_by(asset) %>% 
              mutate(value=lag(value, order_by = Date)),
              by=c("asset","Date")) %>%
  summarize(`2 SDs`=sum(value > max | value < min) / length(value),
            `1 SD`=sum(value > high | value < low) / length(value)) %>%
  mutate_all(funs(sprintf("%.0f%%",.*100))) %>% 
  kable(caption="% Values outside of")| 2 SDs | 1 SD | 
|---|---|
| 3% | 23% | 
My covariance is a bit wider than a normal distribution would be. But then we know the Ys have fat tails, so that’s fine.
We now have a full backtest of the factor model, and we’ve validated the forecasts against actual values. We’ve also confirmed that the signals from these factors are swamped by the noise of weekly movements. To reiterate, the trick with these models is to be consistently a bit better than random. The next step is to create optimized weights using the forecast. As we have used generative modeling, we have the covariance of the forecasts in addition to their point estimate, which should be equivalent to the median. We therefore do not need a separate risk model, although it would be advisable to validate forecast covariance against one. Apart from better factors, the single biggest area in which there is room for improvement is in the Y covariance modeling. With judicious risk management, we can build a winning portfolio on this basis that is mostly2 uncorrelated to equities or other traditional assets.
This looks a week early because I led the endogenous data back in Forecasting.↩
I say mostly because, in a crisis, otherwise unrelated assets become correlated by the virtue of demands on the capital of investors.↩