There has been a lot of criticism that many modern algorithms enforce inequality and are racist. This example shows how a predictive model can be made less discriminatory.

Imagine you need to create a default risk model, you have 3 variables: binary default outcome, binary black race, continuous credit history. Being black influences history and hence your default risk. We will use the residuals from the model \(hist_i = a + b \cdot black_i\) to then estimate the model \(default_i = logit^{-1}(c + d \cdot hist^r_i)\) where r denotes residual. This means we are stripping out the effect of being lack from the history.

We start by simulating some data (N = 2000) and use 90% as training data.

We will see that our model will have lower accuracy (AUC) but will not penalise blacks.

library(AUC)
## AUC 0.3.0
## Type AUCNews() to see the change log and ?AUC to get an overview.
inv_logit <- function(x) {
  return(1 / (1 + exp(-x)))
}


set.seed(100)
n <- 2000
train <- 1:(n * .9)
black <- 1 * (runif(n) < .3)
hist <- -1 + 2 * black + rnorm(n)
default <- 1 * (inv_logit(hist + rnorm(n)) > .9)
df <- data.frame(black, hist, default)
summary(df)
##      black             hist            default      
##  Min.   :0.0000   Min.   :-4.2070   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:-1.4156   1st Qu.:0.0000  
##  Median :0.0000   Median :-0.5401   Median :0.0000  
##  Mean   :0.2895   Mean   :-0.4264   Mean   :0.0705  
##  3rd Qu.:1.0000   3rd Qu.: 0.4605   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   : 3.9751   Max.   :1.0000
cor(df)
##             black      hist   default
## black   1.0000000 0.6838882 0.3668467
## hist    0.6838882 1.0000000 0.4784179
## default 0.3668467 0.4784179 1.0000000
boxplot(hist ~ black, main = 'boxplot of history by black')

m1 <- glm(hist ~ black, data = df)
summary(m1)
## 
## Call:
## glm(formula = hist ~ black, data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1733  -0.6907   0.0130   0.6521   3.2851  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.03371    0.02694  -38.37   <2e-16 ***
## black        2.09787    0.05007   41.90   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.031308)
## 
##     Null deviance: 3871.1  on 1999  degrees of freedom
## Residual deviance: 2060.6  on 1998  degrees of freedom
## AIC: 5741.4
## 
## Number of Fisher Scoring iterations: 2
df$r1 <- resid(m1)


# undiscrim model
m2 <- glm(default ~ r1, data = df[train, ], family = binomial)
summary(m2)
## 
## Call:
## glm(formula = default ~ r1, family = binomial, data = df[train, 
##     ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6667  -0.3671  -0.2311  -0.1355   3.0947  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.3913     0.1533  -22.12   <2e-16 ***
## r1            1.3678     0.1189   11.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 876.46  on 1799  degrees of freedom
## Residual deviance: 705.53  on 1798  degrees of freedom
## AIC: 709.53
## 
## Number of Fisher Scoring iterations: 6
p2 <- predict(m2, newdata = df[-train, ], type = 'resp')
df$p2 <- NA
df$p2[-train] <- p2
auc(roc(p2, factor(df$default[-train])))
## [1] 0.8715526
# naive model
m3 <- glm(default ~ hist, data = df[train, ], family = binomial)
summary(m3)
## 
## Call:
## glm(formula = default ~ hist, family = binomial, data = df[train, 
##     ])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.87605  -0.18261  -0.07814  -0.03145   3.00217  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.2794     0.2603  -16.44   <2e-16 ***
## hist          1.9713     0.1495   13.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 876.46  on 1799  degrees of freedom
## Residual deviance: 452.65  on 1798  degrees of freedom
## AIC: 456.65
## 
## Number of Fisher Scoring iterations: 8
p3 <- predict(m3, newdata = df[-train, ], type = 'resp')
df$p3 <- NA
df$p3[-train] <- p3
auc(roc(p3, factor(df$default[-train]))) # better than nice model
## [1] 0.9698672
t.test(p2 ~ black, df)
## 
##  Welch Two Sample t-test
## 
## data:  p2 by black
## t = -0.85046, df = 142.46, p-value = 0.3965
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.03974704  0.01583428
## sample estimates:
## mean in group 0 mean in group 1 
##      0.06048494      0.07244132
t.test(p3 ~ black, df) # significant difference
## 
##  Welch Two Sample t-test
## 
## data:  p3 by black
## t = -7.2581, df = 75.209, p-value = 2.999e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2724339 -0.1550970
## sample estimates:
## mean in group 0 mean in group 1 
##     0.009746668     0.223512125