Please see my markdown hosted on github which shows some surprising results regarding investments based on forecasts and uncorrelated data. It turns out that uncorrelated data can yield better returns.

# Tag Archives: stats

# Forecast ensemble

Over at github I have put the following:

This introduces a few known and a few new forecast functions. It then builds an **ensemble forecast** out of 13 models. It has the following steps:

- Learn all models over training period
- Predict h periods ahead and build a weighted Bayesian model of the forecasts
- Retrain the model on training + h to give new forecasts beyond this period (using previous weights)

It introduces four Bayesian models in stan

- ARMA(2, 1)
- ARMA(2, 1) with weighting of obs
- Local linear trend
- Weight model (eg it can model 13 weights on 13 X variables and 10 time steps which is not possible in frequentist setup)

Note that most code has tests around the functions. You need to load all scripts to get the `forecastEns()`

to run.

# GCSE mean imputation

Many GCSE results are reported in a very compact form. I have written some R code which via simulation allows to translate grade brackets into numeric grades. You can read it here. I specifically look at grades and dispersion by gender.

# Matching sets/distributions

I was interested to see how to match 2 different sets with slightly different distributions. This can be relevant when you want to test the differences between 2 groups.

Assume you have a long (1) and short set (2). First I sort set 2 by its values. I also estimate the mean difference between consecutive sorted values (mean diff).

My algorithm passes once through set 1 and tries to find a match for every set 1 item in set 2. If the current difference/distance is better than the previous or the next then we have a match (because it’s sorted), and we remove the matched item from set 2. I added a condition where the difference has to be between X multiples of the mean difference. This ensures that I don’t match some remaining large value/distance just because few match items remain. I also added another break condition: if distances get bigger, stop.

I matched 93% in my test. It takes 6% of n1*n2 possible iterations in my test with n1=60k and n2=10k.

The resulting distributions of the matches is an average between distribution 1 and 2. This means you cannot longer assume that the matched items represent the full sets 1 or 2. However you can compare matches to each other.

The code can be seen and run here https://repl.it/BU9Z