Uncovering and Mitigating Algorithmic Bias through Learned Latent Structure
Alexander Amini1,3∗ , Ava Soleimany2∗ , Wilko Schwarting1,3 , Sangeeta Bhatia3 , and Daniela Rus1,3
1

Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology
2
Biophysics Program, Harvard University
3
Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology
∗
Denotes co-first authors
{amini, asolei, wilkos, sbhatia, rus}@mit.edu

Abstract
Recent research has highlighted the vulnerabilities of modern
machine learning based systems to bias, especially for segments of society that are under-represented in training data. In
this work, we develop a novel, tunable algorithm for mitigating the hidden, and potentially unknown, biases within training data. Our algorithm fuses the original learning task with a
variational autoencoder to learn the latent structure within the
dataset and then adaptively uses the learned latent distributions to re-weight the importance of certain data points while
training. While our method is generalizable across various
data modalities and learning tasks, in this work we use our algorithm to address the issue of racial and gender bias in facial
detection systems. We evaluate our algorithm on the Pilot Parliaments Benchmark (PPB), a dataset specifically designed to
evaluate biases in computer vision systems, and demonstrate
increased overall performance as well as decreased categorical bias with our debiasing approach.

1

Introduction

Machine learning (ML) systems are increasingly making decisions that impact the daily lives of individuals and society in general. For example, ML and artificial intelligence
(AI) are already being used to determine if a human is eligible to receive a loan (Khandani, Kim, and Lo 2010), how
long a criminal should spend in prison (Berk, Sorenson, and
Barnes 2016), the order in which a person is presented the
news (Nalisnick et al. 2016), or even diagnoses and treatments for medical patients (Mazurowski et al. 2008).
The development and deployment of fair and unbiased
AI systems is crucial to prevent any unintended side effects and to ensure the long-term acceptance of these algorithms (Miller 2015; Courtland 2018). Even the seemingly simple task of facial recognition (Zafeiriou, Zhang,
and Zhang 2015) has been shown to be subject to extreme amounts of algorithmic bias among select demographics (Buolamwini and Gebru 2018). For example, (Klare et al.
2012) analyzed the face detection system used by the US
law enforcement and discovered significantly lower accuracy among dark women between the age of 18-30 years old.
This is especially concerning since these facial recognition
systems are often not deployed in isolation but rather as part
of a larger surveillance or criminal detection pipeline (Abdullah et al. 2017).

Random Batch Sampling During
Standard Face Detection Training

Batch Sampling During Training
with Learned Debiaising

Homogenous skin color, pose

Diverse skin color, pose, illumination

Mean Sample Prob: 7.57 x 10-6

Mean Sample Prob: 1.03 x 10-4

Figure 1: Batches sampled for training without (left) and
with (right) learned debiasing. The proposed algorithm
identifies, in an unsupervised manner, under-represented
parts of training data and subsequently increases their respective sampling probability. The resulting batch (right)
from the CelebA dataset shows increased diversity in features such as skin color, illumination, and occlusions.

While deep learning based systems have been shown to
achieve state-of-the-art performance on many of these tasks,
it has also been demonstrated that algorithms trained with
biased data lead to algorithmic discrimination (Bolukbasi
et al. 2016; Caliskan, Bryson, and Narayanan 2017). Recently, benchmarks quantifying discrimination (Kilbertus et
al. 2017; Hardt et al. 2016) and even datasets designed
to evaluate the fairness of these algorithms (Buolamwini
and Gebru 2018) have emerged. However, the problem of
severely imbalanced training datasets and the question of
how to integrate debiasing capabilities into AI algorithms
still remain largely unsolved.
In this paper, we tackle the challenge of integrating debiasing capabilities directly into a model training process that
adapts automatically and without supervision to the shortcomings of the training data. Our approach features an endto-end deep learning algorithm that simultaneously learns
the desired task (e.g., facial detection) as well as the underlying latent structure of the training data. Learning the latent
distributions in an unsupervised manner enables us to uncover hidden or implicit biases within the training data. Our
algorithm, which is built on top of a variational autoencoder

 (VAE), is capable of identifying under-represented examples
in the training dataset and subsequently increases the probability at which the learning algorithm samples these data
points (Fig. 1).
We demonstrate how our algorithm can be used to debias
a facial detection system trained on a biased dataset and to
provide interpretations of the learned latent variables which
our algorithm actively debiases against. Finally, we compare the performance of our debiased model to a standard
deep learning classifier by evaluating racial and gender bias
on the Pilot Parliaments Benchmark (PPB) dataset (Buolamwini and Gebru 2018).
The key contributions of this paper can be summarized as:
1. A novel, tunable debiasing algorithm which utilizes
learned latent variables to adjust the respective sampling
probabilities of individual data points while training; and
2. A semi-supervised model for simultaneously learning a
debiased classifier as well as the underlying latent variables governing the given classes; and
3. Analysis of our method for facial detection with biased
training data, and evaluation on the PPB dataset to measure algorithmic fairness across race and gender.
The remainder of this paper is structured as follows: we
summarize the related work in Sec. 2, formulate the model
and debiasing algorithm in Sec. 3, describe our experimental
results in Sec. 4, and provide concluding remarks in Sec. 5.

2

Related Work

Interventions that seek to introduce fairness into machine
learning pipelines generally fall into one of three categories: those that use data pre-processing before training, inprocessing during training, and post-processing after training. Several pre-processing and in-processing methods rely
on new, artificially generated debiased data (Calmon et
al. 2017) or resampling (More 2016). However, these approaches have largely focused on class imbalances, rather
than variability within a class, and fail to use any information about the structure of the underlying latent features. Learning the latent structure of data has a long standing history in machine learning, including ExpectationMaximization (Bailey, Elkan, and others 1994), topic modelling (Blei 2012), latent-SVM (Felzenszwalb, McAllester,
and Ramanan 2008), and more recently, variational autoencoders (VAE) (Kingma and Welling 2013; Rezende, Mohamed, and Wierstra 2014). The presented work uses a novel
VAE-based approach for resampling based on the data’s
underlying latent structure, debiases automatically during
training, and does not require any data pre-processing or annotation prior to training or testing.
Resampling for class imbalance: Resampling approaches have largely focused on addressing class imbalances (More 2016; Zhou and Liu 2006), as opposed to biases
within individual classes. For example, duplicating instances
of the minority class as in (Lu, Guo, and Feldkamp 1998) has
been used as pre-processing steps for mitigating class imbalance, yet is not capable of running adaptively during training
itself. Further, applying these approaches to debiasing variabilities within a class would require a priori knowledge of

the latent structure to the data, which necessitates manual
annotation of the desired features. On the other hand, our
approach debiases variability within a class automatically
during training and learns the latent structure from scratch
in an unsupervised manner.
Generating debiased data: Recent approaches have utilized generative models (Sattigeri et al. 2018) and data transformations (Calmon et al. 2017) to generate training data
that is more ‘fair’ than the original dataset. For example,
(Sattigeri et al. 2018) used a generative adversarial network
(GAN) to output a reconstructed dataset similar to the input
but more fair with respect to certain attributes. Preprocessing
data transformations that mitigate discrimination, as in (Calmon et al. 2017), have also been proposed, yet such methods
are not learned adaptively during training nor do they provide realistic training examples. In contrast to these works,
we do not rely on artificially generated data, but rather use a
resampled, more representative subset of the original dataset
for debiasing.
Clustering to identify bias: Supervised learning approaches have also been used to characterize biases in imbalanced data sets. Specifically, k-means clustering has been
employed to identify clusters in the input data prior to training and to inform resampling the training data into a smaller
set of representative examples (Nguyen, Bouzerdoum, and
Phung 2008). However, this method does not extend to high
dimensional data like images or to cases where there is
no notion of a data ‘cluster’, and relies on significant preprocessing. Our proposed approach overcomes these limitations by learning the latent structure using a variational
approach.

3

Methodology

Problem Setup
Consider the problem of binary classification in which we
are presented with a set of paired training data samples
Dtrain = {(x(i) , y (i) )}ni=1 consisting of features x ∈ Rm
and labels y ∈ Rd . Our goal is to find a functional mapping
f : X → Y parameterized by θ which minimizes a certain
loss L(θ) over our entire training dataset. In other words, we
seek to solve the following optimization problem:
n

θ ∗ = arg min
θ

1X
Li (θ)
n i=1

(1)

Given a new test example, (x, y), our classifier should
ideally output ŷ = fθ (x) where ŷ is “close” to y, with
the notion of closeness being defined from the original loss
function. Now, assume that each datapoint also has an associated continuous latent vector z ∈ Rk which captures the
hidden, sensitive features of the sample (Zemel et al. 2013).
We can formalize the notion of a biased classifier as follows:
Definition 1 A classifier, fθ (x), is biased if its decision
changes after being exposed to additional sensitive feature
inputs. In other words, a classifier is fair with respect to a
set of latent features, z, if: fθ (x) = fθ (x, z).
For example, when deciding if an image contains a face
or not, the skin color, gender, or even age of the individual

 Learning Latent Structure with Variational
Autoencoders
In this work, we learn the latent variables of the class in
an entirely unsupervised manner and proceed to use these
to adaptively resample the dataset while training. To accomplish this, we propose an extension of the variational autoencoder (VAE) network architecture: a debiasing-VAE (DBVAE). The encoder portion of the VAE learns an approximation qφ (z x) of the true distribution of the latent variables
given a data point. As opposed to classical VAE architectures, we also introduce d additional output variables where
ŷ ∈ Rd . With k latent variables and d output variables, the
encoder outputs 2k+d activations corresponding to µ ∈ Rk ,
Σ = Diag[σ 2 ]   0, which are used to define the distribution of z, and the d-dimensional output, ŷ.
Note that, in order to still learn our original supervised
learning task we assign and explicitly supervise the d out-

Classification Network

~
Data

z0
x

qφ (z   x)
Encoder

Input

{z}1k-1

Latent

are all underlying latent variables and should not impact the
classifier’s decision.
To ensure fairness of a classifier across these various latent variables, the dataset should contain roughly uniform
samples over the latent space. In other words, the training
distribution itself should not be biased to overrepresent a
certain category while under-representing others. Note that
this is different than claiming that our dataset should be balanced with respect to the classes (i.e., include roughly the
same number of faces as non-faces in the dataset). Namely,
we are saying that within a single class the unobserved latent variables should also be balanced. This would promote
the notion that all instances of a single class will be treated
fairly by the classifier such that even if a latent variable was
changed to the opposite extreme (e.g., skin tone from light
to dark) the accuracy of the classifier would not be changed.
Furthermore, given a labeled test set across the space of
sensitive latent variables, z, we can measure the bias of the
classifier by computing its accuracy across each of the sensitive categories (e.g. skin tone). While the overall accuracy
of the classifier is the mean accuracy over all sensitive categories, the bias is the variance in accuracies across all realizations of these categories (e.g., light vs. dark faces). For
example, if a classifier performs equally well no matter the
realization of a specific latent variable (e.g., skin tone), it
will have zero variance in accuracy, and thus be called unbiased with respect to that variable. On the other hand, if
some realizations of the latent variable cause the classifier to
perform better or worse, the variance in the accuracies will
increase, and thus, so will the overall bias of said classifier.
While it is possible to use a set of human defined sensitive
variables to ensure fair representation during training, this
requires time intensive manned annotation of each variable
over the entire dataset. Additionally, this approach is subject
to potential human bias in the selection of which variables
are deemed sensitive or not. In this work, we address this
problem by learning the latent variables of the class in an entirely unsupervised manner and proceed to use these learned
variables to adaptively resample the dataset while training.
In the following subsection, we will outline the architecture
used to learn the latent variables.

Update positive
debiasing probabilities

pθ (x   z)
Decoder

x̂
Reconstruction

Note:
indicates gradients are
blocked when y=0

Figure 2: Debiasing Variational Autoencoder. Architecture of the semi-supervised DB-VAE for binary classification (blue region). The unsupervised latent variables are
used to adaptively resample the dataset while training.
put variables. This, in turn, transforms our traditional VAE
model from an entirely unsupervised model to a semisupervsied model, where some latent variables are implicitly
learned by trying to reconstruct the input and the others are
explicitly supervised for a specific task (e.g. classification).
For example, if we originally wanted to train a binary classifier (i.e., ŷ ∈ {0, 1}), our DB-VAE model would learn a latent encoding of k latent variables (i.e., {zi }i∈{1,k} ) as well
as a single variable specifically for classification: z0 = ŷ.
A decoder network mirroring the encoder is then used to
reconstruct the input back from the latent space by approximating pθ (x z). VAEs utilize reparameterization to differentiate the outputs through a sampling step, where we sample
1
  ∼ N (0, (I)) and compute z = µ(x) + Σ 2 (x) ◦  . This
decoded reconstruction enables unsupervised learning of the
latent variables during training, and is thus necessary for automated debiasing of the data during training.
We train the network end-to-end using backpropagation
with a three component loss function comprising of a supervised latent loss, a reconstruction loss, and a latent loss for
the unsupervised variables. For a binary classification task,
for example, the supervised loss Ly (y, ŷ) is given by the
cross-entropy loss; the reconstruction loss Lx (x, x̂) is given
by the Lp norm between the input and the reconstructed output; and the latent loss LKL (µ, σ) is given by the KullbackLiebler (KL) divergence. Finally, the total loss is a weighted
combination of these three losses:
"
#
"
   #
X
1
LT OT AL = c1
yi log
+c2 kx − x̂kp
ŷi
i∈{0,1}
 
{z
}
 
{z
}
Ly (y,ŷ)

Lx (x,x̂)

"

#
k−1
1X
2
+ c3
(σj + µj − 1 − log(σj ))
2 j=0
 
{z
}

(2)

LKL (µ,σ)

where c1 , c2 , c3 are the weighting coefficients to impact the
relative importance of each of the individual loss functions.
For comparison, the baseline model used for the desired task
has a similar architecture as the DB-VAE, without the unsupervised latent variables and decoder network, and would be
trained according to only the supervised loss function.

 Note that special care needs to be taken when feeding
training examples from classes which you do not want to
debias. For example, in the facial detection problem, we primarily care about ensuring that our positive dataset of faces
is fair and unbiased, and less about debiasing the negative
example where there is no face present. For these negative
samples, the gradients from the decoder and latent space
should be stopped and not backpropogated. This effectively
means that, for these classes, we only train the encoder to
improve the supervised loss.

Algorithm for Automated Debiasing
In this section, we present the algorithm for adaptive resampling of the training data based on the latent structure learned by our DB-VAE model. By dropping overrepresented regions of the latent space according to their
frequency of occurrence, we increase the probability of selecting rarer data for training. This is done adaptively as the
latent variables themselves are being learned during training. Thus, our debiasing approach accounts for the complete
distribution of the underlying features in the training data.
The training dataset is fed through the encoder network,
which provides an estimate Q(z X) of the latent distribution. We seek to increase the relative frequency of rare data
points by increased sampling of under-represented regions
of the latent space. To do so, we approximate the distribution
of the latent space with a histogram Q̂(z X) with dimensionality defined by the number of latent variables, k. To
circumvent the high-dimensionality of the histogram when
the latent space becomes increasingly complex, we simplify
further and use independent histograms to approximate the
joint distribution. Specifically, we define an independent histogram, Q̂i (zi  X), for each latent variable zi :
Y
Q̂(z X) ∝
Q̂i (zi  X)
(3)
i

This allows us to neatly approximate Q(z X) based on the
frequency distribution of each of the learned latent variables. Finally, we introduce a single parameter, α, to tune
the degree of debiasing introduced during training. We define the probability distribution of selecting a datapoint x as
W(z(x) X), parameterized by the debiasing parameter α:
Y
1
W(z(x) X) ∝
(4)
Q̂
(z
(x) X)
+α
i i
i
We provide pseudocode for training the DB-VAE in Algorithm 1. At every epoch all inputs x from the original dataset
X are propagated through the model to evaluate the corresponding latent variables z(x). The histograms Q̂i (zi (x) X)
are updated accordingly. During training, a new batch is
drawn by keeping inputs, x, from the original dataset, X,
with likelihood W (z(x) X). Training on the debiased data
batch now forces the classifier into a choice of parameters
that work better in rare cases without strong deterioration of
performance for common training examples. Most importantly, the debiasing is not manually specified beforehand
but instead based on learned latent variables.

Algorithm 1 Adaptive re-sampling for automated debiasing
of the DB-VAE architecture
Require: Training data {X, Y }, batch size b
1: Initialize weights {φ, θ}
2: for each epoch, Et do
3:
Sample z ∼ qφ (z X)
4:
Update Q̂i (zi (x) X)
Q
1
5:
W(z(x) X) ← i Q̂ (z (x) X)+α
i i
6:
while iter < nb do
7:
Sample xbatchP∼ W(z(x) X)
8:
L(φ, θ) ← 1b i∈xbatch Li (φ, θ)
9:
Update: [w ← w − η∇φ,θ L(φ, θ)]w∈{φ,θ}
10:
end while
11: end for

Intuitively, the parameter α as tuning the degree of debiasing. As α → 0, the subsampled training set will tend
towards uniform over the latent variables z. As α → ∞, the
subsampled training set will tend towards a random uniform
sample of the original training dataset (i.e., no debiasing).

4

Experiments

To validate our debiasing algorithm on a real-world problem
with significant social impact, we learn a debiased facial detector using potentially biased training data. Here we define
the facial detection problem, describe the datasets used, and
outline model training, debiasing, and evaluation.
For the facial detection problem, we are given a set of
paired training data samples Dtrain = {(x(i) , y (i) )}ni=1 ,
where x(i) are the raw pixel values of an image patch and
y (i) ∈ {0, 1} are their respective labels, indicating the presence of a face. Our goal is to ensure that the set of positive
examples used to train a facial detection classifier is fair and
unbiased. The positive training data may potentially be biased with respect to certain attributes such as skin tone, in
that particular instances of those attributes may appear more
or less frequently than other instances. Thus, in our experiments, we train a full DB-VAE model to learn the latent
structure underlying the positive (face) images and use the
adaptive resampling approach outlined in Algorithm 1 to debias the model with respect to facial features. For negative
examples, we only train the encoder portion of our network,
as described in Section 3. We evaluate the performance of
our debiased models relative to standard, biased classifiers
on the PPB dataset and provide estimates of the precision
and bias of each model as performance metrics.

Datasets
We train our classifiers on a dataset of n = 4 × 105 images,
consisting of 2 × 105 positive (images of faces) and negative (images of non-faces) examples, split 80% and 20%
into training and validation sets, respectively. Positive examples were taken from the CelebA dataset (Liu et al. 2015)
and cropped to a square based on the annotated face bounding box. Negative examples were taken from the ImageNet
dataset (Deng et al. 2009), from a wide variety of non-human

 Validation Accuracy (%)

100

Loss

0.15
0.10
0.05
101

102

103

104

95

90
102

Timestep
No debiasing

= 0.1

103

104

Timestep
= 0.05

= 0.01

= 0.001

Figure 3: Loss evolution and validation accuracy. Convergence of the total loss on the training set (left) and classification accuracy on the validation set (right) for models with
varying degrees of debiasing.
categories. All images were resized to 64 × 64.
After training, we evaluate our debiasing algorithm on the
PPB test dataset (Buolamwini and Gebru 2018), which consists of images of 1270 male and female parliamentarians
from various African and European countries. Images are
consistent in pose, illumination, and facial expression, and
the dataset exhibits parity in both skin tone and gender. The
gender of each face is annotated with the sex-based “Male”
and “Female” labels. Skin tone annotations are based on
the Fitzpatrick skin type classification system (Fitzpatrick
1988), with each image labeled as “Lighter” or “Darker”.

Training the Models
For the classical facial detection task, we train a convolutional neural network, with four sequential convolutional
layers (5 × 5 filters with 2 × 2 strides) for feature extraction.
Final classification is done with an additional two fully connected layers with 1000 and 1 hidden neurons in each layer
respectively. All layers in the network use ReLU activation
and batch normalization (Ioffe and Szegedy 2015). Our DBVAE architecture shares this same classification network for
the encoder, except for the final fully connected layer which
now outputs an additional k latent variables for a total of
2k + 1 activations. A decoder, which mirrors the encoder
with 2 fully connected layers and 4 de-convolutional layers,
is then used to reconstruct the original input image. We train
our models by minimizing the empirical training loss as defined in Eq. 2 with L2 reconstruction loss.
In our experiments, we additionally block all gradients
from the decoder network when y = 0, i.e., for negative
examples, as we only want to debias for positive face examples. In addition to training the standard classification network with no debiasing, we trained DB-VAE models with
varying degrees of debiasing, defined by the parameter α,
for 50 epochs and evaluated their performance on the validation set. Models were re-trained from scratch 5 times each
for added statistical robustness of results.

Automated Debiasing of Facial Detection Systems
We explore the output of the debiasing algorithm and provide extensive evaluation of our learned models on the

PPB dataset. We consider the resampling probabilities,
W(z(x) X), that arise from learning a debiased model. As
shown in Fig. 4A, as the probability of resampling increases,
the number of data points within the corresponding bin decreases, suggesting that those images more likely to be resampled are those characterized by ‘rare’ features.
Indeed, as the probability of resampling increases, the corresponding images become more diverse, as evidenced by
the four sample faces from each frequency bin in Fig. 4A.
This observation is further validated by considering the ten
faces in the training data with the lowest and highest resampling probabilities (Fig. 4B,C respectively). The ten faces
with the lowest resampling probability appear quite uniform,
with consistent skin tone, hair color, forward gaze, and background color. In contrast, the ten faces with the highest resampling probability display rarer features such as headwear
or eyewear, tilted gaze, shadowing, and darker skin. Taken
together, these results imply that our algorithm identifies and
then actively resamples those data points with rarer, more diverse features based on a learned latent representation.
We observed that the DB-VAE is able to learn facial features such as skin tone, presence of hair, and azimuth, as well
as other features such as gender and age by slowly perturbing the value of a single latent variable and and feeding the
resulting encoding through the decoder (Fig. 5A). This supports the hypothesis that our DB-VAE algorithm is capable
of debiasing against such features since the resampling probabilities are directly defined based on the probability distributions of individual learned latent variables (Alg. 1).
To evaluate the performance of our debiasing approach,
we utilized classification accuracy (positive predictive
value) as a metric, and tested our models on the PPB dataset.
For this evaluation, we extracted patches from each image
using sliding windows of varying dimension, and fed these
extracted image patches to our trained models. We output a
positive match of a face if the classifier identifies a face in
A

Samples
10

Number of Faces

0.20

10
10
10
10
10

6

5

4

3

2

1

0.0000

0.0001

0.0002

0.0003

0.0004

0.0005

Probability of Resampling

B

Top 10 faces with Lowest
Resampling Probability

C

Top 10 faces with Highest
Resampling Probability

Figure 4: Sampling probabilities over the training
dataset. Histogram over resampling probabilities showing
four samples from each bin (A). The top ten faces with the
lowest (B) and highest (C) probabilities of being sampled.

 A

B

Gender
(Female to Male)
Hair
(Hair to Bald)
Azimuth
(Left to Right)

Accuracy (%)

Skin Color
(Light to Dark)

100
95
90
85
80

Dark Male
No debiasing

Dark Female
α = 0.1

Light Male
α = 0.05

Light Female
α = 0.01

Overall
α = 0.001

Figure 5: Increased performance and decreased categorical bias with DB-VAE. The model learns latent features such as
skin color, gender, hair (A) and demonstrates increased performance and decreased categorical bias with learned debiasing (B).
any one of the subpatches within the image.
To demonstrate debiasing against specific latent features, we quantified classification performance on individual demographics. Specifically, we considered skin tone
(light/dark) and gender (male/female). We denote A as the
set of classification accuracies of a model on each of the four
intersectional classes. We compared the accuracy of models
trained with and without debiasing on both individual demographics (race/gender) and the PPB dataset as a whole, and
provide results on the effect of the debiasing parameter α on
performance (Fig. 5). Recall that no debiasing corresponds
to the limit α → ∞, where we uniformly sample over the
original training set without learning the latent variables.
Conversely, α → 0, corresponds to sampling from a uniform
distribution over the latent space. Error bars (standard error
of the mean) are provided to visualize statistical significance
of differences between the trained models.
As shown in Fig. 5, greater debiasing power (decrease
α) significantly increased classification accuracy on “Dark
Male” subjects, consistent with the hypothesis that adaptive
resampling of rare instances (e.g., dark faces) in the training
data results in less algorithmic discrimination. This suggests
that our algorithm can debias for a qualitative feature like
skin tone, which has significant social implications for its
utility in improving fairness in facial detection systems.
In contrast to the trend observed with dark male faces,
the classification accuracy on “Light Male” faces remained
nearly constant for both the biased and debiased models. Additionally, the accuracy on light male subjects was higher
than the three other groups, consistent with (Buolamwini
and Gebru 2018). This suggests that our debiasing algorithm
does not sacrifice performance on categories which already

Table 1: Accuracy and bias on PPB test dataset.
E[A]
V ar[A]
(Precision) (Measure of Bias)
No Debiasing 95.13
28.84
α = 0.1
95.84
25.43
α = 0.05
96.47
18.08
α = 0.01
97.13
9.49
α = 0.001
97.36
9.43

have high precision. Importantly, the high, near constant accuracy suggests that an arbitrary classification model trained
on the CelebA dataset may be biased towards light male subjects, and further supports the need for approaches that seek
to reduce such biases.
Although the DB-VAE improved accuracy on dark males
significantly, it never reached the accuracy of light males.
Despite the fact that we debias our training data with respect to latent variables such as skin tone, there are inherently fewer examples of dark male faces in our data. Our
model is simply limited by infrequency of these examples
but we note that increasing the overall size of our training
dataset may further mitigate this effect.
We summarize the key trends in overall performance with
DB-VAE in Table 1. As confirmed by Fig. 5, the overall
precision, E[A], increased with increased debiasing power
(decreasing α). Additionally, we observed a decrease in the
variance in accuracy between categories, indicative of decreased bias with greater debiasing. Together, these results
suggest effective debiasing with DB-VAE.

5

Conclusion

In this paper, we propose a novel, tunable debiasing algorithm to adjust the respective sampling probabilities of individual data points while training. By learning the underlying
latent variables in an entirely unsupervised manner, we can
scale our approach to large datasets and debias for latent features without ever hand labeling them in our training set.
We apply our approach to facial detection to promote algorithmic fairness by reducing hidden biases within training
data. Given a biased training dataset, our debiased models
show increased classification accuracy and decreased categorical bias across race and gender, compared to standard
classifiers. Finally, we provide a concrete algorithm for debiasing as well as an open source implementation of our
model.
The development and deployment of fair and unbiased
AI systems is crucial to prevent unintended discrimination
and to ensure the long-term acceptance of these algorithms.
We envision that the proposed approach will serve as an additional tool to promote systematic, algorithmic fairness of
modern AI systems.

 References
[Abdullah et al. 2017] Abdullah, N. A.; Saidi, M. J.; Rahman, N. H. A.; Wen, C. C.; and Hamid, I. R. A. 2017. Face
recognition for criminal identification: An implementation
of principal component analysis for face recognition. In AIP
Conference Proceedings, volume 1891, 020002.
[Bailey, Elkan, and others 1994] Bailey, T. L.; Elkan, C.;
et al. 1994. Fitting a mixture model by expectation maximization to discover motifs in bipolymers.
[Berk, Sorenson, and Barnes 2016] Berk, R. A.; Sorenson,
S. B.; and Barnes, G. 2016. Forecasting domestic violence:
A machine learning approach to help inform arraignment decisions. Journal of Empirical Legal Studies 13(1):94–115.
[Blei 2012] Blei, D. M. 2012. Probabilistic topic models.
Communications of the ACM 55(4):77–84.
[Bolukbasi et al. 2016] Bolukbasi, T.; Chang, K.-W.; Zou,
J. Y.; Saligrama, V.; and Kalai, A. T. 2016. Man is to computer programmer as woman is to homemaker? Debiasing
word embeddings. In Neural Information Processing Systems, 4349–4357.
[Buolamwini and Gebru 2018] Buolamwini, J., and Gebru,
T. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on
Fairness, Accountability and Transparency, 77–91.
[Caliskan, Bryson, and Narayanan 2017] Caliskan,
A.;
Bryson, J. J.; and Narayanan, A. 2017. Semantics derived
automatically from language corpora contain human-like
biases. Science 356(6334):183–186.
[Calmon et al. 2017] Calmon, F.; Wei, D.; Vinzamuri, B.;
Ramamurthy, K. N.; and Varshney, K. R. 2017. Optimized
pre-processing for discrimination prevention. In Advances
in Neural Information Processing Systems, 3992–4001.
[Courtland 2018] Courtland, R. 2018. Bias detectives: the
researchers striving to make algorithms fair. Nature.
[Deng et al. 2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.;
Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer
Vision and Pattern Recognition, 248–255. IEEE.
[Felzenszwalb, McAllester, and Ramanan 2008]
Felzenszwalb, P.; McAllester, D.; and Ramanan, D.
2008. A discriminatively trained, multiscale, deformable
part model. In Computer Vision and Pattern Recognition,
2008. CVPR 2008. IEEE Conference on, 1–8. IEEE.
[Fitzpatrick 1988] Fitzpatrick, T. B. 1988. The validity and
practicality of sun-reactive skin types i through vi. Archives
of dermatology 124(6):869–871.
[Hardt et al. 2016] Hardt, M.; Price, E.; Srebro, N.; et al.
2016. Equality of opportunity in supervised learning. In
Neural information processing systems, 3315–3323.
[Ioffe and Szegedy 2015] Ioffe, S., and Szegedy, C. 2015.
Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint
arXiv:1502.03167.
[Khandani, Kim, and Lo 2010] Khandani, A. E.; Kim, A. J.;
and Lo, A. W. 2010. Consumer credit-risk models via

machine-learning algorithms. Journal of Banking & Finance
34(11):2767–2787.
[Kilbertus et al. 2017] Kilbertus, N.; Carulla, M. R.; Parascandolo, G.; Hardt, M.; Janzing, D.; and Schölkopf, B. 2017.
Avoiding discrimination through causal reasoning. In Advances in Neural Information Processing Systems, 656–666.
[Kingma and Welling 2013] Kingma, D. P., and Welling, M.
2013. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114.
[Klare et al. 2012] Klare, B. F.; Burge, M. J.; Klontz, J. C.;
Bruegge, R. W. V.; and Jain, A. K.
2012.
Face
recognition performance: Role of demographic information.
IEEE Transactions on Information Forensics and Security
7(6):1789–1801.
[Liu et al. 2015] Liu, Z.; Luo, P.; Wang, X.; and Tang, X.
2015. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision.
[Lu, Guo, and Feldkamp 1998] Lu, Y.; Guo, H.; and Feldkamp, L. 1998. Robust neural learning from unbalanced
data samples. In IEEE Neural Networks Proceedings.
[Mazurowski et al. 2008] Mazurowski, M. A.; Habas, P. A.;
Zurada, J. M.; Lo, J. Y.; Baker, J. A.; and Tourassi, G. D.
2008. Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural networks 21(2-3):427–436.
[Miller 2015] Miller, C. C. 2015. When algorithms discriminate. New York Times.
[More 2016] More, A. 2016. Survey of resampling techniques for improving classification performance in unbalanced datasets. arXiv preprint arXiv:1608.06048.
[Nalisnick et al. 2016] Nalisnick, E.; Mitra, B.; Craswell, N.;
and Caruana, R. 2016. Improving document ranking with
dual word embeddings. In Proceedings of the 25th International Conference Companion on World Wide Web, 83–84.
[Nguyen, Bouzerdoum, and Phung 2008] Nguyen, G. H.;
Bouzerdoum, A.; and Phung, S. L. 2008. A supervised
learning approach for imbalanced data sets. In International
Conference on Pattern Recognition, 1–4. IEEE.
[Rezende, Mohamed, and Wierstra 2014] Rezende, D. J.;
Mohamed, S.; and Wierstra, D. 2014. Stochastic backpropagation and approximate inference in deep generative
models. arXiv preprint arXiv:1401.4082.
[Sattigeri et al. 2018] Sattigeri, P.; Hoffman, S. C.; Chenthamarakshan, V.; and Varshney, K. R. 2018. Fairness gan.
arXiv preprint arXiv:1805.09910.
[Zafeiriou, Zhang, and Zhang 2015] Zafeiriou, S.; Zhang,
C.; and Zhang, Z. 2015. A survey on face detection in the
wild: past, present and future. Computer Vision and Image
Understanding 138:1–24.
[Zemel et al. 2013] Zemel, R.; Wu, Y.; Swersky, K.; Pitassi,
T.; and Dwork, C. 2013. Learning fair representations. In
International Conference on Machine Learning, 325–333.
[Zhou and Liu 2006] Zhou, Z.-H., and Liu, X.-Y. 2006.
Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions
on Knowledge and Data Engineering 18(1):63–77.