Matching sets/distributions

I was interested to see how to match 2 different sets with slightly different distributions. This can be relevant when you want to test the differences between 2 groups.

Assume you have a long (1) and short set (2). First I sort set 2 by its values. I also estimate the mean difference between consecutive sorted values (mean diff).

My algorithm passes once through set 1 and tries to find a match for every set 1 item in set 2. If the current difference/distance is better than the previous or the next then we have a match (because it’s sorted), and we remove the matched item from set 2. I added a condition where the difference has to be between X multiples of the mean difference. This ensures that I don’t match some remaining large value/distance just because few match items remain. I also added another break condition: if distances get bigger, stop.

I matched 93% in my test. It takes 6% of n1*n2 possible iterations in my test with n1=60k and n2=10k.

The resulting distributions of the matches is an average between distribution 1 and 2. This means you cannot longer assume that the matched items represent the full sets 1 or 2. However you can compare matches to each other.

The code can be seen and run here https://repl.it/BU9Z