Dataset Representativeness and Downstream Task Fairness
Abstract
Our society collects data on people for a wide range of applications, from building a census for policy evaluation to running meaningful clinical trials. To collect data, we typically sample individuals with the goal of accurately representing a population of interest. However, current sampling processes often collect data opportunistically from one or more data sources (e.g., hospitals in geographically disparate cities), which can lead to datasets that are biased and not representative, i.e., the collected dataset does not accurately reflect the distribution of demographics present in the true population. This is a concern because subgroups within the population can be under- or over-represented in a dataset, which may harm generalizability and lead to an unequal distribution of benefits and harms from downstream tasks that use such datasets (e.g., algorithmic bias in medical decision-making algorithms). In this paper, we assess the relationship between dataset representativeness and group-fairness of classifiers trained on that dataset. We demonstrate that there is a natural tension between dataset representativeness and classifier fairness; empirically we observe that training datasets with better representativeness can frequently result in classifiers with higher rates of unfairness. We provide some intuition as to why this occurs via a set of theoretical results in the case of univariate classifiers. We also find that over-sampling underrepresented groups can result in classifiers which exhibit greater bias to those groups. Lastly, we observe that fairness-aware sampling strategies (i.e., those which are specifically designed to select data with high downstream fairness) will often over-sample members of majority groups. These results demonstrate that the relationship between dataset representativeness and downstream classifier fairness is complex; balancing these two quantities requires special care from both model- and dataset-designers.
1Vanderbilt University, Nashville, TN, USA
2Washington University in St. Louis, St. Louis, MO, USA
3ByteDance Research, San Jose, CA, USA
* These authors contributed equally to this work.
1 Introduction
Representation biases, where certain subpopulations appear more, or less, frequently in a dataset than they do in a target population of interest is a foundational problem. Failure to adequately diversify data can induce numerous downstream effects, such as the creation of data-based models that are unfair in their performance [27, 18, 1]. Yet, this is not a recent phenomenon. The Framingham Heart Study (FHS), initiated in 1948, provided revolutionary insight into cardiovascular disease over time. It enabled the development of disease risk prediction tools like the Framingham Risk Score that were widely applied in practice to recognize and proactively manage patients at risk for coronary heart disease [53]. However, the original study cohorts were nearly all of white race [30], until more racially diverse participants started to be recruited in 1994 [34]. Analyses found that applying FHS risk coefficients yielded inaccurate risk predictions for non-white populations [33, 16]. Using the same risk factors, but deriving the actual risk coefficients from racially diverse cohorts, yielded comparable predictive performance across racial groups [24]. These findings indicate that disparities in group-wise predictive accuracy stemmed from insufficient representation of minority groups in FHS. In high-stakes domains like healthcare, these inaccuracies can cause quantifiable harm to underrepresented groups [38]. Though known for some time, this phenomenon has become increasingly accentuated because of an increased societal reliance on automated systems learned via aggregated datasets. Studies of genomic datasets have shown vast differences in downstream predictive performance between highly- and underrepresented groups [44, 8, 48]. Nevertheless, the relationship between subgroup-specific representation and downstream performance has not been fully explored.
Dataset representativeness yields multiple different types of benefits. As noted above, representative datasets promote generalizability and validity of findings to the entire population of interest. Researchers often aim to discover generalizable results, while large biomedical datasets, like the All of Us Research Program, have increasingly focused on recruiting diverse populations [35]. In addition to its downstream benefits, representativeness engenders legitimacy, as seen in policymaking [3]. A putative mechanism for this effect is that representativeness supports procedural fairness, the concept of equal treatment of individuals by systems and processes [12]. Conversely, unrepresentative biomedical datasets may undermine trust in the research enterprise [38]. We measure representation intuitively through first-order information of the true population and the constructed dataset, specifically via the difference between the average occurrence of each sensitive feature in the population and in the dataset. We formulate this concept rigorously in Definition 1, where a perfectly representative dataset (i.e., one where the proportions of every group are identical in the dataset and population) would have zero difference.
We focus on the practical example of multi-site data collection, where data or individuals are sampled from a set of sites across a limited number of iterations . Multi-site projects like PCORnet and the All of Us Research Program enable unprecedented access to human subjects data and represent billions of dollars in investment [20, 49]. The response distribution, affected by both underlying site demographics (which may be known or estimated a priori) and by the willingness of demographic groups to participate in the study, at each site starts as an unknown. With each iteration, the data-collector selects a site (or sites) to obtain data from, and then yields a number of examples to add to their dataset.
In this study, we address several contemporary issues surrounding multi-site dataset construction. First, we propose an algorithm to construct a representative dataset from several available sites and compare it to baselines. Then, we assess how varying group representation affects algorithmic fairness and how the multi-site framework alters the representativeness-fairness relationship. Finally, we analyze cases where more representative datasets do not yield fairer classifiers and discuss alternative approaches to improve fairness and representativeness.
Our paper is organized as follows:
-
•
In Section 2 we formalize the problem collecting a representative dataset via site-based sampling.
- •
-
•
Next, in Section 4 we begin our investigation into the relationship between fairness and representatives with a case study on single variable classifiers.
-
•
Section 5 outlines our experimental methodology.
-
•
Lastly, Section 6 provides our primary experimental results: showing the effectiveness of our proposed algorithm for representative sampling , as well as investigate the relationship between representativeness and fairness.
1.1 Related Work
There have been numerous investigations into what it means for a given collection of samples to be representative or for an algorithm to be fair. Representativeness is typically defined either as 1) a statistical distance from a goal or true distribution [22, 13, 41] or 2) a measure of coverage of attribute combinations [5, 26]. [45] provide an extensive survey on methods to measure and address representation bias. When the target population is unknown, but researchers are still interested in assessing group disparities, sampling from groups equally is an efficient method [47].
When individuals may be selected according to their attributes, methods for selecting representative cohorts have been proposed for specific use cases: hiring processes, citizens’ assemblies, and record selection from a single database [23, 19, 10]. Given uncertain site-specific population distributions, our problem of representative dataset construction via sequentially sampling sites is similar to the multi-armed bandit problem [6, 11] with concave reward structure [2]. The most closely related work to ours in this regard is by [37], who utilize a bandit-based approach to achieve a desired attribute distribution in multi-site data collection when faced with uncertain site attribute distributions. This algorithm constructs a reward function with higher values for samples containing individuals from minority groups, in order to achieve a desired distribution.
Like with representation, numerous definitions have been proposed for algorithmic fairness. Many definitions of fairness originate from Rawlsian theories of justice, which eschew inequalities between individuals [42]. [17] adapted this concept to ensure similar individuals receive similar algorithmic outcomes. Similarly, [21] defined fairness through equal odds and equal opportunity, requiring the equalization of true positive rates and false positive rates between demographic groups, respectively. Parity based measures of fairness now exist for every common decision and prediction measure for an algorithm [36]. However, there is little consensus on how to best measure algorithmic fairness, and different measures can be impossible to simultaneously satisfy except under trivial conditions [29]. Some definitions of fairness (e.g., worst-group performance) have been used to guide data collection [1, 46, 39]. During the data collection process, these approaches presume both the hypothesis class of the downstream model as well as the predictive task. Similarly, post-hoc subgroup re-balancing may improve algorithmic fairness in certain downstream tasks [25, 55], but post-hoc corrections may also severely impact predictive accuracy [54]. In practice, datasets are used for a multitude of model types and predictive tasks, and as such, a dataset which is fair for one combination of predictive task and model may be unfair for other types.
The relationship between representation and fairness is less explored than its two constituent concepts. On the one hand, it is well-known that classifier performance tends to be poor for underrepresented groups and that increasing representation of these groups in training data can improve performance [8, 32, 51]. Yet, these notions do not establish an optimal level of representation to best support fairness. A naïve approach may be to equalize group proportions or sample more data points, but these techniques do not necessarily improve fairness [32]. [14] propose a decomposition of discrimination — a generalization of unfairness — into bias, variance, and noise terms, each with unique remediation strategies: increasing model capacity, sampling from the disadvantaged group(s), and collecting additional features. While this is a useful and intuitive way to categorize causes of unfairness, determining which factor(s) drives discrimination relies on having a Bayes-optimal classifier, which is often computationally impractical.
2 Preliminaries
To formalize our setting, let be a domain of features , sensitive features and binary labels . Let be a distribution over , i.e., is the true population distribution. The data collector does not know the distribution , but may know its mean. Let be the set of sites, where each site is associated with an underlying site-specific population distribution over . Importantly, the distribution for every site is unknown to the data collector. Over the course of timesteps, the data collector will sequentially recruit samples from sites , with the objective of building a representative final dataset. Each sample from site constitutes a draw . After rounds, the data collector has a dataset Given a target demographic vector , which represents the ideal mean of , the data collector aims to sample such that is as close to as possible. Thus, we conceptualize representativeness as inversely proportional to the distance from to ; as this distance decreases, representativeness increases. For example, suppose there are two binary features of interest: gender (Male or Female) and age (Young or Old). A target vector of implies that an ideal dataset is 30% Male and 70% Young. Therefore if is the -norm then a dataset which is 25% Male and 60% Young would be -distant with respect to . We next formally define representativeness.
Definition 1.
(Representativeness): The representativeness of a dataset with respect to a target demographic vector and distance metric is inversely proportional to , where is the mean vector of the demographics in the dataset.
Given target vector , the objective of sampling the most representative dataset can be expressed as
(1) |
We limit to distance measures which are convex in the collected set of sensitive features , including all -norms with and KL-divergence. It should be recognized that a key challenge with representative sampling is that the objective in Problem 1 is not supermodular, even for convex , as a function of . This is due to the nonlinear nature of the average , with respect to samples.
3 Convex Formulation and Prior-Based Sampling
In this section, we first demonstrate how the data collector’s sampling problem can be formulated through the framework of multi-armed bandit with concave reward (convex loss in our case). Utilizing this particular problem structure, we present our algorithm for constructing representative datasets. Our strategy for optimizing this objective is to provide a modified form of the objective in Equation 1 which is convex with respect to the samples collected at each time step. To do this, we first note that each iteration returns data points111The convex formulation holds when, in expectation, each iteration yields data points., and thus the final dataset will consist of examples, and the average demographic vector of the dataset can be written as
(2) |
where is the average demographic vector present in the sample collected at time .
With this fact, the data collector’s objective can be expressed as a function simply of the sum of the means from each sample,
(3) |
Theorem 1.
We defer this proof to appendix C. Since the samples returned by each site at time can now be thought of as a single vector , and the loss function is convex with respect to those sample vectors, the problem of representative sampling can be naturally formulated as a multi-armed bandit problem with convex loss. We next discuss Bayesian sampling procedure which can capitalize on both this convex formulation as well as site-wise prior information.
3.1 Prior-based Bayesian Representative Sampling (PBRS)
Before outlining the details of our algorithm, we first discuss the motivation behind PBRS (Alg. 1), which is twofold. First, in many real-world domains where representativeness is a salient issue, a wealth of summary data is available, which allows data collectors to form reasonably accurate priors over the distributions at each site. Second, the Bayesian nature of our approach always for dynamic control over how aggressively the prior distributions are updated after each sample, this is particularly useful in settings where the distributions at sites may change over time (a common occurrence in the real-world), such shifts are discussed in Section 5.3. The full PBRS algorithm (Alg. 3) is in appendix A.
PBRS works by maintaining an estimate of the distribution of groups at each site , which corresponds to a multinomial distribution, when sensitive features are binary and a multivariate-normal distribution when sensitive features are continuous. In the former , where gives the probability that an individual sampled from site will have sensitive feature equal to . In the latter, where and are the mean and covariance of sensitive features at site . In both cases, each distribution is initialized via a prior estimate of the true distribution at site . In the case that no prior is provided, a default prior can be induced by either assigning uniform values to each parameter (e.g., and ), or as values from the target vector (e.g., for all ). Throughout the course of constructing the dataset, the samples obtained at each time step can be used to update these distributions to more accurately reflect the true distribution of each site. To do this, we use the conjugate prior of each distribution to iterative update the estimation . In the case of binary group features, the conjugate prior is represented by a Dirichlet distribution , and in the case of continuous group features, the conjugate prior is represented by an inverse Wishart distribution .
At each time step , the estimated distribution is induced by sampling parameters from the corresponding conjugate prior, and is then used to compute the expected improvement to for each site. PBRS selects the site , corresponding to the maximum expected improvement. The sample from site is then used to update conjugate prior. To better anticipate the possibility for site bias, we incorporate a hyperparameter which modifies the procedure through which conjugate distributions are updated by increasing the strength of samples from minority groups by a factor of roughly . This hyperparameter incentivizes PBRS to more aggressively search for sites which yield individuals from minority groups, thus helping to circumvent site bias towards those groups.
3.2 Distributed Prior-based Bayesian Representative Sampling (D-PBRS)
D-PBRS (Alg. 3, appendix A) modifies PBRS to allow multiple sites to be sampled from simultaneously in a single timestep, still limited to total samples per timestep. D-PBRS distributes the budget according to a vector , which is selected to maximally decrease given all previously collected samples, with the constraint that . In the sampling step, total samples are divided among the sites according to with fractional sample allocations rounded down, and assigned to the site that minimizes . For example, int he case of two sites and a budget of , implies collecting samples from the first site, and from the second site.
3.3 Fair Arm-Based Sampling
We introduce a third arm sampling procedure (Alg. 2), one designed to optimally improve minmax algorithmic fairness. We enact this goal by first training a classifier on the available dataset, then evaluating its group-specific performance on a set of validation data. Next, we identify the group with the lowest AUC and sample from the arm with the highest proportion of that group. This algorithm represents an adaptation of previous work by [1] and [46] to our arm-based selection process. The full fair sampling algorithm (Alg. 4) is in appendix A.
4 Univariate Case Study
To build intuition for the relationship between representatives and fairness we examine classification when the predictive features are single variable, i.e., . Note that univariate classification and multivariate classification are equivalent in the sense that can correspond to the output of a score function applied to the multidimensional feature , i.e. .
We being by demonstrating the existence of a trade-off between fairness and representatives. This trade-off stems from relative difficulty in learning the joint, ; that is, as the relative difficulty of learning the joint increases, so does the trade off between representatives and fairness.
To capture the difficulty of learning the joint, let the relationship between and be defined as where gives the noise of the label . Let be the distribution over features and labels for group .
Theorem 2.
Suppose there are and samples collected from groups and respectively. Let be the optimal classifiers learned on these samples (in terms of expected accuracy). Let , i.e., the difference in accuracy between groups. Then
The key takeaway from Theorem 2 is that it allows us to quantify expected unfairness in terms of both the number of samples collected from each group and the relatively noisiness of each groups’ labels . The expression of expected unfairness immediately yields the following result.
Proof.
We defer the proof to appendix C. ∎
Theorem 3.
Suppose the optimal classifier trained on samples from group and samples of group has an unfairness of at most , then it must be the case that
and
This theorem indicates that in order to limit the accuracy-disparity between groups to be no greater than , the sampling rates between the two groups cannot be too different (where “too different” is dictated by the relative noise levels of each group, ).
Proof.
We defer the proof to appendix C. ∎
Theorem 4.
In order to achieve an unfairness of , the sample ratio between the two groups must be .
This theorem demonstrates that achieving an expected unfairness close to may not be possible within a budge of total samples (i.e., ). To see this, imagine a case in which , i.e., group has vastly higher noise than group . Then, the sampler will not be able to collect enough samples to ensure that .
Proof.
This result follows directly from Theorem 2. ∎
5 Methodology
Dataset | Sensitive Features | Target Feature | Location | Size |
---|---|---|---|---|
Law School | Race, Gender, Age, Family Income | Pass Bar | School | 20,454 |
Lending Club | Housing Status, Occupation | Repay Loan | ZIP Code | 124,040 |
Intensive Care | Race, Gender, Age | ICU Recovery | Hospital | 48,612 |
Texas Salary | Race, Gender | Earn $75k | Office | 142,981 |
Adult Income | Race, Gender, Age | Income $50k | — | 46,447 |
Community Crime | Race Proportion | Low Crime Risk | — | 1,994 |
5.1 Datasets
We evaluated our methodology on six commonly-used datasets: 1) Law School [52], 2) Lending Club [15], 3) Intensive Care [40], 4) Texas Salary [50], 5) Adult Income [7], and 6) Community Crime [43]. Each dataset contains features that differentiate between groups of interest, as well as location-based information (Tab. 1) when available. For datasets 1-4, we partition the dataset into disjoint sets sharing the same location, inducing sites (i.e., arms). The Law School, Intensive Care, and Texas Salary datasets include location information corresponding to actual sites, such as the student’s law school. For the Lending Club dataset, we induce sites by U.S. state. The Adult Income and Community Crime datasets do not have applicable location information, so they are not used to evaluate our sampling algorithms. Nevertheless, these two datasets have well-documented algorithmic fairness limitations, making them ideal case studies for our fairness analyses. Sites with fewer than 1,000 records were excluded from analysis due to small sample size limitations.
5.2 Sampling Procedure and Algorithms
For a target demographic vector and a distance measure , we iteratively select a site (or mix of sites) and receive data points randomly sampled from the partition corresponding to that site. After repeating the process times, we combine the data points into a single dataset and compute the distance between the target demographic vector and the average demographics of the constructed dataset . To demonstrate the improved efficacy of PBRS (BY(H) and BY(L) for high- and low-noise priors) and D-PBRS (DS(H) and DS(L) for high- and low-noise priors), we compare to three baselines: 1) -Greedy (GRD): which randomly selects a site with probability and otherwise selects the site which has the maximum expected decrease in error; 2) UCB-LCB [2] (UCB): which is a UCB-based algorithm [6] for solving multi-armed bandit problems with convex loss; and 3) OL-Vec [28] (VEC), which derives a one dimensional function to approximate the distance measure and uses online convex minimization to select the site at each timestep. In addition to the aforementioned baselines, we also compared to random site selection Random (RND), and OPT, a policy that has full information and selects the site corresponding to the maximum expected decrease in error. This baseline serves as the best possible myopic sampling scheme when the data collector is limited to a single site per timestep. To test our representative sampling algorithms, we analyzed a setting in which there are 20 arms (achieved via either randomly subsampling or duplicating sites, depending on the number of sites in the dataset), 50 time steps, and sample sizes of individuals. Based on this setup, the constructed dataset corresponds to 2,000 examples. We use a class-balanced target vector, , and average performance across 100 experiments. To measure how effective each sampling algorithm is at producing a representative dataset with respect to a target demographic vector , we use the -norm, ).
5.3 Site Variations
No bias is our baseline. In this setting, site response distributions are induced by the location-based partitions and do not change over time.
Response bias occurs when certain demographics appear at sites with disproportionately high (or low) frequencies compared to other groups. For example, as shown in [4] the ratio of individuals identifying as ethnic minorities is substantially lower at the majority of law schools compared to the population. Response bias can be modeled using coefficients and , where members of majority groups are -times more likely to respond at sites compared to their base response rate at those sites. For example suppose there is one binary feature (i.e., two groups), , and , then individuals from the majority group are -times more likely to appear in a sample from half of the sites. The no variation setting is recovered when . We evaluate the representativeness of the final datasets constructed by the tested algorithms across a range of from to .
Lastly, causal distribution shifts occur when demographic distributions at each site change over time as the result of the data collector’s decisions. When selection is desirable (e.g., monetary compensation for participating in trials), individuals may modify their behavior in order to be selected. Causal distribution shifts affect response probability of each individual at site with coefficient s.t. . We evaluate the representativeness of the final datasets constructed by the tested algorithms across a range of from to using such that there is a response bias to causally magnify.
5.4 Arm Sampling and Downstream Fairness
To asses data quality with respect to downstream tasks, we compare the predictive efficacy of datasets produced by optimal arm-based sampling with OPT, arm-based fair sampling, location-agnostic stratified random sampling (SRS), and fair direct sampling. Each data domain is partitioned into four folds, generating four 75%/25% train/test splits. Then, 200 desired sensitive feature group fractions linearly spaced from 0 to 1 are generated. For SRS, a 2000-record sample is selected from the training set for each sensitive feature fraction. In arm-based sampling, the training set is partitioned by site, then 2000-record samples are generated for each sensitive feature fraction using OPT. Unlike the representative sampling algorithms, the fair sampling algorithms do not target a specific sensitive feature group balance. Instead, a group balance emerges secondarily as a result of selecting records from the sensitive feature group (either or , as each analysis studies only one sensitive feature at a time) with lower performance. Fair arm-based sampling is achieved via algorithm 2 and is repeated five times in each train/test fold. Fair direct sampling adapts the algorithm proposed by [1] to successively identify the worse-performing group and draw mini-batches of 5 examples from it. We initialize the fair direct sampling algorithm with four examples, one for each of the two sensitive features and two labels.
Two model classes, Logistic Regression (LR) and Gradient Boosted Decision Tree Classifiers (GBC), are fit to a single binary prediction task where are all non-sensitive features and is the target feature in table 1. Models are trained on each sampled dataset using default hyperparameters in scikit-learn and weighted to be class balanced in because of inherent label imbalance in our training datasets. Model AUCs are computed for the population and groups and . We evaluate for algorithmic fairness by assessing the disparity in AUC between groups and .
5.5 Fairness and Complexity Analysis
To delve further into the relationship between representation and fairness, we study three datasets with known unfairness: Law School, Adult Income, and Community Crime. We start similarly to our previous studies of arm sampling and downstream fairness, with some key modifications. Because Adult Income and Community Crime do not have locations, we do not partition these datasets into sites and, thus, we only apply SRS to sample for representation. Thus, we alleviate the restriction that each site must have 1,000 records. Because this analysis is more flexible with respect to record selection, we partition the datasets into ten folds (90% train / 10% test) and average results across these folds. To accommodate large differences in record counts between these datasets, we fix the training set size to the size of the smallest sensitive feature group in the training fold. This is the largest possible training set that still allows a cohort to be made up entirely of one group. We build training sets for 21 linearly spaced group proportions from 0 to 1, representing 5% proportion increments within the training data. As before, we train GBC models to the binary prediction task . In addition to presenting population and group-specific AUCs for various group proportions, we present true positive rates (TPR, i.e., sensitivity) and true negative rates (TNR, i.e., specificity). TPR and TNR parity are widely-used measures of algorithmic fairness and supplement AUC parity for the purposes of this analysis [21, 36].
When modifying group representation does not decrease a significant performance disparity between groups, other factors must be limiting fairness. We theorize difficulty of learning plays a significant role in driving unfairness (Thm. 2). To assess this theorem empirically, we evaluate the fairness of classifiers differing in their ability to capture complex relationships between features and labels. The capability of individual decision trees to capture complex relationships is driven primarily by the number of internal nodes, which is in turn driven by the depth of the tree [31, 9]. To control GBC complexity, we limit both maximum tree depth from 1 to 8 (default: 3) and the number of estimation steps from 1 to 500 (default: 100); then we assess models constrained simultaneously by both of these limits. We hypothesize a more complex classifier can capture more difficult learning relationships, and subsequently improve AUC, TPR, and TNR parity. We measure the group difference (i.e., ) of AUC, TPR, and TNR for each sensitive feature in each of the three datasets outlined in this section. It is well-known that certain fairness constraints on models can harm model accuracy [14], so we also assess the total test set AUC of our classifiers as we vary model complexity.
6 Experimental Results
6.1 Sampling Algorithm Evaluation
Full results for all four datasets are available in appendix D.1. In Figure 1a, we show the representativeness of the dataset constructed over time by each approach in a no-bias situation. While performance is similar between all algorithms, D-PBRS yields the most representative samples, often approaching fully informed OPT. Response bias () induces increased response rates of majority groups, i.e., individuals from majority groups are -times more likely to appear in a sample from biased sites compared to the group’s true distribution at that site. Significant response bias in either direction harms the representativeness of the final cohort (1b). Yet, D-PBRS, and to a lesser extent PBRS, consistently yields more representative datasets than other sampling algorithms. Figure 1c depicts dataset representativeness as a function of the casual bias (); as increases, sampling a site increases the probability the member of majority groups will appear in future samples from that site. Similar to response bias, representativeness decreases as the bias becomes more pronounced. Unlike response bias, causal bias results in distribution shifts over time, increasing the difficult of accurately assessing which arm is best to sample. Due to this shift, the advantage of PBRS and D-PBRS over other algorithms diminishes but is still present.
6.2 Arm Sampling and Downstream Fairness
Over- and underrepresentation of particular groups in training data is a well known cause of unfairness within models trained on that data. In figure 2 we present population and group-specific test set AUC as a function of the split for each sensitive feature in the training dataset for our Intensive Care example. In figures 2a-c, the training dataset is constructed via arm based sampling using OPT to achieve the desired group proportions; in figures 2d-f, the training dataset is sampled from all available training data using stratified random sampling (SRS) to achieve the desired group proportions. The outlined points throughout figures 2a-c and 2d-f indicate results for fair arm-based sampling and fair direct sampling, respectively. Both variations of fair sampling achieve the desired goal of selecting a mix of groups and that minimizes performance difference between the two groups. In the SRS case, this confirms the expected result that improving a group’s proportion in training data will improve, or at least not hinder, that group’s performance. However, a group’s performance improvement from increased representation can be quite limited at times. Figure 2e shows how AUC increases for group as group proportion in the training data increases, but there is no significant concomitant decrease in group AUC. Moreover, age (Fig. 2e) is the only sensitive feature for which there is some group proportion that equalizes AUC for groups and in the SRS analysis. The SRS analyses of ethnicity and gender show consistently better classifier performance of groups and , respectively, regardless of the training set proportions of these groups. Thus, there must be additional factors affecting algorithmic fairness beyond group representation. Given the theoretic results from the univariate case study in section 4, this is not unexpected if the noise values of the two groups are drastically different.
Another key result from this analysis is that the way datasets are constructed impacts the relationship between representation and algorithmic fairness. The SRS results show the expected behavior: as proportion increases, test set AUC improves and test set AUC deteriorates,though the effects may not always be statistically significant. On the other hand, arm-based sampling breaks this trend: when looking at both ethnicity and gender as sensitive features, increasing the training set proportion beyond its test set proportion causes deterioration of classifier performance for all groups. Thus, attempting to achieve a desired group representation through adaptive sampling across multiple sites may yield unexpected downstream results. We also note little difference between sampling with OPT and D-PBRS (Fig. 2), which indicates that the site-based framework, and not the representative sampling strategy, causes the discrepancy between SRS and arm-based methods.
6.3 Fairness and Model Complexity
The Adult Income dataset shows significant AUC and TPR unfairness across all three tested sensitive features of race, age, and gender (Fig. 3). Notably, modifying the training set proportions of groups and has limited effect on subgroup performance, except at the extremes (i.e., group proportions of 0 or 1). Thus, this dataset highlights the practical case where modulating representation will not adequately address fairness concerns. We shift our attention to increasing model complexity to better capture difficult-to-learn relationships between the features and labels. We show how increasing complexity through greater tree depth and more estimation steps can reduce TPR unfairness between gender groups (Fig. 4). A more complete complexity analysis shows similar results for AUC and TNR (Fig. 5), and other sensitive features within the Adult Income dataset show similar patterns. As tree depth and estimation steps increase, disparities in AUC, TPR, and TNR generally decrease, regardless of group representation in the training data. Moreover, this decrease in unfairness through increased model complexity does not come at the expense of overall model performance. In fact, classifier accuracy tends to improve with increasing complexity (appendix D.3, Fig. 31). While the highest complexities — estimators and depth — show a moderate decrease in AUC, this is beyond the regions where we see the most substantial improvements in AUC unfairness. We attribute these simultaneous improvements in both classifier accuracy and fairness with increased model complexity to the model being able to capture more complex data relationships.
7 Discussion
Representative datasets yield several benefits such as legitimacy, validity, equity, and generalizability. In machine learning, generalizability is closely related to algorithmic fairness, a measure of prediction or performance parity between different groups. In this paper, we analyze the relationship between representation and downstream algorithmic fairness in classification tasks across several datasets. Contrary to our expectations, we find that more representative datasets rarely yield fairer classifiers. Likewise, we find that datasets constructed to promote algorithmic fairness rarely are representative of the overall population. We theorize that this tension between representativeness and fairness exists when groups differ significantly in their difficulty to learn. If a large difficulty gap exists between groups, adding data points from the more difficult group may not be sufficient to overcome the disparity in classifier performance. We show how an alternate approach, increasing model complexity, can help close this performance gap. Thus, both representation and fairness may be simultaneously achieved.
In this paper, we also expand upon existing techniques for building a representative dataset from multiple data sources (e.g., multi-site clinical trial recruitment) through a Bayesian multi-armed bandit framework. Our methods succeed at generating representative cohorts across a variety of biases and distributional shifts. However, we find that downstream classifier performance differs significantly when cohorts are selected in a multi-site procedure to achieve a certain subgroup proportion compared to stratified random sampling of all records to achieve the same proportion. The distribution of features, sensitive features, and labels over sites influences classifier fairness. Thus, it is important to consider how a dataset is constructed beyond its demographics matching a target distribution.
Despite the contributions of this work, there are some key limitations to note. Representative sampling, as we have formulated it, focuses on matching a dataset’s attribute means to a target population; however, the underlying distributions of the dataset and target population may differ substantially. When it is important to match the shape of the dataset and target distributions, alternative measures for representation may perform better. Moreover, it is important to consider what it means to match attribute means of a dataset to a ground truth population. Such matching may be intuitive for physical or biological variables like age but becomes much more complicated for social variables like race, where the notion of ground truth does not necessarily apply. Finally, it is important to note that increasing model complexity will not always substantially improve algorithmic fairness. In fact, [14] show that if a Bayesian optimal classifier is algorithmically unfair, further fairness cannot be enforced without loss of performance. While we show that including additional data points from the disadvantaged group may not improve fairness, we echo their suggestion to collect additional features, if possible, in this situation. Future work may include broader definitions of representation that are not group-centric, as well as expanding these results to additional definitions of fairness like procedural fairness as opposed to classifier parity measures.
We conclude that the relationship between dataset representativeness and downstream fairness is complicated and influenced by numerous factors. While increasing a group’s representation in a dataset sometimes improves that group’s performance substantially, the practical constraints of dataset generation may sometimes cause the opposite effect. Sometimes, changing a group’s representation in a dataset has little impact on classifier performance; as shown, this may be due to learnability differences between groups. In these cases, we suggest that one way of addressing this particular unfairness is to increase model complexity to more adequately capture complex data relationships.
References
- [1] Jacob Abernethy, Pranjal Awasthi, Matthaus Kleindessner, Jamie Morgenstern, Chris Russell, and Jie Zhang. Active sampling for min-max fairness. arXiv preprint arXiv:2006.06879, 2020.
- [2] Shipra Agrawal and Nikhil R Devanur. Bandits with concave rewards and convex knapsacks. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 989–1006, 2014.
- [3] Sveinung Arnesen and Yvette Peters. The Legitimacy of Representation: How Descriptive, Formal, and Responsiveness Representation Affect the Acceptability of Political Decisions. Comparative Political Studies, 51(7):868–899, June 2018. Publisher: SAGE Publications Inc.
- [4] American Bar Association. Aba portriate of the legal profession. https://www.abalegalprofile.com/legal-education.php, 2021.
- [5] Abolfazl Asudeh, Zhongjun Jin, and HV Jagadish. Assessing and remedying coverage for a given dataset. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 554–565. IEEE, 2019.
- [6] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002.
- [7] Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20.
- [8] Amy R Bentley, Shawneequa L Callier, and Charles N Rotimi. Evaluating the promise of inclusion of african ancestry populations in genomics. NPJ genomic medicine, 5(1):5, 2020.
- [9] Candice Bentéjac, Anna Csörgő, and Gonzalo Martínez-Muñoz. A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review, 54(3):1937–1967, March 2021.
- [10] Victor A Borza, Ellen Wright Clayton, Murat Kantarcioglu, Yevgeniy Vorobeychik, and Bradley A Malin. A representativeness-informed model for research record selection from electronic medical record systems. In AMIA Annual Symposium Proceedings, volume 2022, page 259. American Medical Informatics Association, 2022.
- [11] Sebastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.
- [12] Kevin Burke and Steve Leben. Court Review: Volume 44, Issue 1/2 – Procedural Fairness: A Key Ingredient In Public Satisfaction. Court Review: The Journal of the American Judges Association, January 2007.
- [13] L. Elisa Celis, Vijay Keswani, and Nisheeth Vishnoi. Data preprocessing to mitigate bias: A maximum entropy based approach. In Proceedings of the 37th International Conference on Machine Learning, pages 1349–1359. PMLR, November 2020. ISSN: 2640-3498.
- [14] Irene Chen, Fredrik D Johansson, and David Sontag. Why Is My Classifier Discriminatory? In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- [15] Lending Club. Lending club, peer to peer lending, 2020.
- [16] Ralph B. D’Agostino, Sr, Scott Grundy, Lisa M. Sullivan, Peter Wilson, and for the CHD Risk Prediction Group. Validation of the Framingham Coronary Heart Disease Prediction Scores: Results of a Multiple Ethnic Groups Investigation. JAMA, 286(2):180–187, July 2001.
- [17] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, ITCS ’12, pages 214–226, New York, NY, USA, January 2012. Association for Computing Machinery.
- [18] Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 259–268, 2015.
- [19] Bailey Flanigan, Paul Golz, Anupam Gupta, Brett Hennig, and Ariel D Procaccia. Fair algorithms for selecting citizens’ assemblies. Nature, 596(7873):548–552, 2021.
- [20] Christopher B. Forrest, Kathleen M. McTigue, Adrian F. Hernandez, Lauren W. Cohen, Henry Cruz, Kevin Haynes, Rainu Kaushal, Abel N. Kho, Keith A. Marsolo, Vinit P. Nair, Richard Platt, Jon E. Puro, Russell L. Rothman, Elizabeth A. Shenkman, Lemuel Russell Waitman, Neely A. Williams, and Thomas W. Carton. PCORnet® 2020: current state, accomplishments, and future directions. Journal of Clinical Epidemiology, 129:60–67, January 2021.
- [21] Moritz Hardt, Eric Price, and Nati Srebro. Equality of Opportunity in Supervised Learning. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
- [22] Zhe He, Patrick Ryan, Julia Hoxha, Shuang Wang, Simona Carini, Ida Sim, and Chunhua Weng. Multivariate analysis of the population representativeness of related clinical studies. Journal of biomedical informatics, 60:66, April 2016. Publisher: NIH Public Access.
- [23] Daniela Huppenkothen, Brian McFee, and Laura Noren. Entrofy your cohort: A transparent method for diverse cohort selection. Plos one, 15(7):e0231939, 2020.
- [24] Laura P. Hurley, L. Miriam Dickinson, Raymond O. Estacio, John F. Steiner, and Edward P. Havranek. Prediction of cardiovascular death in racial/ethnic minorities using Framingham risk factors. Circulation. Cardiovascular quality and outcomes, 3(2):181–187, March 2010.
- [25] Badr Youbi Idrissi, Martin Arjovsky, Mohammad Pezeshki, and David Lopez-Paz. Simple data balancing achieves competitive worst-group-accuracy. In Conference on Causal Learning and Reasoning, pages 336–351. PMLR, 2022.
- [26] Zhongjun Jin, Mengjing Xu, Chenkai Sun, Abolfazl Asudeh, and H. V. Jagadish. MithraCoverage: A System for Investigating Population Bias for Intersectional Fairness. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 2721–2724, Portland OR USA, June 2020. ACM.
- [27] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International conference on machine learning, pages 2564–2572. PMLR, 2018.
- [28] Thomas Kesselheim and Sahil Singla. Online learning with vector costs and bandits with knapsacks. In Conference on Learning Theory, pages 2286–2305. PMLR, 2020.
- [29] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent Trade-Offs in the Fair Determination of Risk Scores. In DROPS-IDN/v2/document/10.4230/LIPIcs.ITCS.2017.43. Schloss-Dagstuhl - Leibniz Zentrum für Informatik, 2017.
- [30] Paul E. Leaverton, Paul D. Sorlie, Joel C. Kleinman, Andrew L. Dannenberg, Lillian Ingster-Moore, William B. Kannel, and Joan C. Cornoni-Huntley. Representativeness of the Framingham risk model for coronary heart disease mortality: A comparison with a national cohort study. Journal of Chronic Diseases, 40(8):775–784, January 1987.
- [31] Jean-Samuel Leboeuf, Frédéric LeBlanc, and Mario Marchand. Decision trees as partitioning machines to characterize their generalization properties. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, pages 18135–18145, Red Hook, NY, USA, December 2020. Curran Associates Inc.
- [32] Nianyun Li, Naman Goel, and Elliott Ash. Data-Centric Factors in Algorithmic Fairness. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’22, pages 396–410, New York, NY, USA, July 2022. Association for Computing Machinery.
- [33] Y. Liao, D. L. McGee, and R. S. Cooper. Prediction of coronary heart disease mortality in blacks and whites: pooled data from two national cohorts. The American Journal of Cardiology, 84(1):31–36, July 1999.
- [34] Syed S Mahmood, Daniel Levy, Ramachandran S Vasan, and Thomas J Wang. The Framingham Heart Study and the epidemiology of cardiovascular disease: a historical perspective. The Lancet, 383(9921):999–1008, March 2014.
- [35] Brandy M. Mapes, Christopher S. Foster, Sheila V. Kusnoor, Marcia I. Epelbaum, Mona AuYoung, Gwynne Jenkins, Maria Lopez-Class, Dara Richardson-Heron, Ahmed Elmi, Karl Surkan, Robert M. Cronin, Consuelo H. Wilkins, Eliseo J. Pérez-Stable, Eric Dishman, Joshua C. Denny, Joni L. Rutter, and the All of Us Research Program. Diversity and inclusion for the All of Us research program: A scoping review. PLOS ONE, 15(7):e0234962, July 2020.
- [36] Shira Mitchell, Eric Potash, Solon Barocas, Alexander D’Amour, and Kristian Lum. Algorithmic Fairness: Choices, Assumptions, and Definitions. Annual Review of Statistics and Its Application, 8(1):141–163, 2021. eprint: https://doi.org/10.1146/annurev-statistics-042720-125902.
- [37] Fatemeh Nargesian, Abolfazl Asudeh, and HV Jagadish. Tailoring data source distributions for fairness-aware data integration. Proceedings of the VLDB Endowment, 14(11):2519–2532, 2021.
- [38] Engineering National Academies of Sciences, Policy and Global Affairs, Engineering Committee on Women in Science, Committee on Improving the Representation of Women and Underrepresented Minorities in Clinical Trials Research, , Kirsten Bibbins-Domingo, and Alex Helman. Why Diverse Representation in Clinical Research Matters and the Current State of Representation within the Clinical Research Ecosystem. In Improving Representation in Clinical Trials and Research: Building Research Equity for Women and Underrepresented Groups. National Academies Press (US), May 2022.
- [39] Laura Niss, Yuekai Sun, and Ambuj Tewari. Achieving representative data via convex hull feasibility sampling algorithms. arXiv preprint arXiv:2204.06664, 2022.
- [40] Tom J Pollard, Alistair EW Johnson, Jesse D Raffa, Leo A Celi, Roger G Mark, and Omar Badawi. The eicu collaborative research database, a freely available multi-center database for critical care research. Scientific data, 5(1):1–13, 2018.
- [41] Miao Qi, Owen Cahan, Morgan A Foreman, Daniel M Gruen, Amar K Das, and Kristin P Bennett. Quantifying representativeness in randomized clinical trials using machine learning fairness metrics. JAMIA open, 4(3):ooab077, 2021.
- [42] John Rawls. Justice as fairness. The Philosophical Review, 67(2):164–194, 1958. Publisher: [Duke University Press, Philosophical Review].
- [43] Michael Redmond. Communities and Crime. UCI Machine Learning Repository, 2009. DOI: https://doi.org/10.24432/C53W3X.
- [44] Tabea Schoeler, Doug Speed, Eleonora Porcu, Nicola Pirastu, Jean-Baptiste Pingault, and Zoltan Kutalik. Participation bias in the uk biobank distorts genetic associations and downstream analyses. Nature Human Behaviour, pages 1–12, 2023.
- [45] Nima Shahbazi, Yin Lin, Abolfazl Asudeh, and H. V. Jagadish. Representation Bias in Data: A Survey on Identification and Resolution Techniques. ACM Computing Surveys, page 3588433, March 2023.
- [46] Shubhanshu Shekhar, Greg Fields, Mohammad Ghavamzadeh, and Tara Javidi. Adaptive sampling for minimax fair classification. Advances in Neural Information Processing Systems, 34:24535–24544, 2021.
- [47] Harvineet Singh and Rumi Chunara. Measures of Disparity and their Efficient Estimation. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’23, pages 927–938, New York, NY, USA, August 2023. Association for Computing Machinery.
- [48] Giorgio Sirugo, Scott M. Williams, and Sarah A. Tishkoff. The Missing Diversity in Human Genetic Studies. Cell, 177(1):26–31, March 2019.
- [49] The All of Us Research Program Investigators. The “All of Us” Research Program. New England Journal of Medicine, 381(7):668–676, August 2019. Publisher: Massachusetts Medical Society eprint: https://www.nejm.org/doi/pdf/10.1056/NEJMsr1809937.
- [50] The Texas Tribune. The texas tribune government salary dataset. https://salaries.texastribune.org, 2021.
- [51] Angelina Wang and Olga Russakovsky. Overwriting Pretrained Bias with Finetuning Data. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3934–3945, Paris, France, October 2023. IEEE.
- [52] Linda F Wightman and Henry Ramsey Jr. Lsac research report series, 1998.
- [53] P.W.F. Wilson, R.B. D’Agostino, D. Levy, A.M. Belanger, H. Silbershatz, and W.B. Kannel. Prediction of coronary heart disease using risk factor categories. Circulation, 97(18):1837–1847, 1998.
- [54] Blake Woodworth, Suriya Gunasekar, Mesrob I. Ohannessian, and Nathan Srebro. Learning Non-Discriminatory Predictors. In Proceedings of the 2017 Conference on Learning Theory, pages 1920–1953. PMLR, June 2017. ISSN: 2640-3498.
- [55] Haoran Zhang, Natalie Dullerud, Karsten Roth, Lauren Oakden-Rayner, Stephen Pfohl, and Marzyeh Ghassemi. Improving the fairness of chest x-ray classifiers. In Conference on Health, Inference, and Learning, pages 204–233. PMLR, 2022.
Appendix A Full Sampling Algorithms
Algorithm 3 outlines the full procedures for PBRS and D-PBRS. Notably, the difference between these two algorithms is in the “allocateAndSample” function, which decides whether fractional allocation is allowed.
Appendix B Data Preprocessing and Computing Infrastructure
For each dataset we follow a uniform procedure when preprocessing the raw data files. Ordinal features (e.g., an individual’s income) are scaled between 0 and 1. Non-ordinal categorical features (e.g., an individual’s occupation) are one-hot encoded. Binary features (e.g., ) are encoded as 0 and 1. All sensitive features are treated as binary or categorical. Only age and family income are non-categorical features in the raw datasets. In order to binarize these features we threshold on the mean age (family income) of the dataset and define categories of Young (Low Income) and Old (High Income). All analyses presented in this work were performed on an Apple M1 Max processor. Source code was written using Python 3.10.12.
Appendix C Proofs
Provide full proofs, for the theoretical results presented in the main body.
Proof of Theorem 1.
The objective in Equation 1 is
and the objective in Equation 3 is
To first prove equivalence between these two objectives when each sample yields individuals, we restate the derivation provided in the main
as such, we see that for any ,
and the two objectives have equal optimums.
To show the convexity of the data collector’s objective w.r.t. the samples , we note that , is convex in , and thus for any linear function , the composition is also convex in . The function is linear in the collection of samples . Thus, is convex in the samples . ∎
Proof of Theorem 2.
Let be one datapoint, i.e., a feature and label respectively. Suppose that for a given the label is induced via where . Let be a dataset of such examples from group and such examples from group . Let be the classifier with the highest accuracy on .
Then the expected unfairness of with respect to each group’s true distribution over features and labels , can be written as
In the case that is a threshold classifier acting on both groups, the classifier with the highest accuracy on data will have the propriety that
where is the mean value of all features in which correspond to group . Thus, each error term is proportional to the empirical mean and the true feature mean . By the Mean Absolute Difference for normal distributions, this value is
for each group. Thus the expected difference in error rates is
∎
Appendix D Experimental Results
D.1 Sampling in Other Datasets
As an extension of main paper figure 1, we show performance of the nine algorithms on all four tested datasets in figures 6, 7, and 8. In all cases, the fully-informed algorithm OPT achieves the best performance, typically followed by D-PBRS, then PBRS and UCB-LCB.
Response Bias
Recall that response bias is defined by parameters the increased probability of majority group members responding in a sample, and the number of sites which have -bias. We can convert to a proportion scaling factor through the transformation . To implement response bias for binary sensitive features , we choose to represent the larger group. For example if there are two features, age (Old or Young) and gender (Male or Female), where 70% of individuals are Old and 60% are Female, then corresponds to an individual who is both Old and Female. For example if there are two features, age (Old or Young) and gender (Male or Female), where 70% of individuals are Old and 60% are Female, then corresponds to an individual who is both Old and Female. When sampling from site , rather than selecting examples uniformly at random from the associated data partition, examples are selected randomly with weights proportional to . Thus an individual with features in each majority group (i.e., ) has times more sample weight than an individual with features from each minority group (i.e., ). When , then and this sum reduces to for all individuals and the no-bias setting is recovered.
Causal Distribution Shift Bias
Recall that casual distribution shifts are defined by parameter where the response probability of each individual at site is scaled by when sampled. To implement this for binary groups, we again represent each majority group with value and minority groups with value . Similar to the case of response bias, we re-weight the sample probabilities of the data partition associated with each site. The sample probabilities for each individual at site , after sampling iterations, is proportional to , where is the initial response probability of the individual, determined as described in the response bias section above. As or increase, members from minority groups are less likely to appear in repeated samples from the same site.
D.2 Arm Sampling and Downstream Fairness in Other Datasets
We present analyses for the population and group-wise accuracy of classifiers trained on datasets which vary in proportion of each sensitive feature for the arm sampling data domains not included in the main body (Law School, Lending Club, and Texas Salary). Figures 9-15 show population, and group-wise, performance as a function the fraction of samples in the training data which are from (shown on each plot). For each dataset we present two analyses: one with a gradient boosted classifier (GBC) and one with a logistic regression (LRG) classifier. Each figure shows three sampling strategies: OPT, D-PBRS, and SRS, paralleling the methods and results for figure 2 from the main body. As discussed in the main body, there are two key observations in these figures. First, an increase in the representation of a given group does not always significantly improve downstream performance on that group even in the SRS case, e.g., Black race in the Texas Salary dataset (Fig. 11 bottom left). However, in other cases improved representation results in better performance for that group, e.g., Sex in the Texas Salary dataset (Fig. 11 bottom right). Second, the sampling method plays a crucial role in the relationship between downstream fairness and representation. The arm-based sampling methods OPT and D-PBRS often show very different subgroup performance than SRS (Fig. 11) Overall, results of the analyses with logistic regression exhibit similar results patterns to those for gradient boosted classifiers.
D.3 Fairness and Complexity in Other Datasets
Baseline Unfairness
We present analyses of population and group-wise accuracy, true positive rates, and true negative rates of classifiers trained on the remaining datasets known to have unfairness that are not included in the main body (Law School, Community Crime). The methodology and presentation of these results parallels main body figure 3. There is significant TPR and TNR unfairness for groups determined by race in both datasets (Figs. 16 and 17).
Complexity Analysis
We include results for complexity analyses of all sensitive features on all datasets known to have unfairness (Law School, Adult Income, Community Crime). The methodology and presentations of these results parallels main body figure 5. In general, increased model complexity yields better AUC, TPR, and/or TNR parity — thus, fairer models (Figs. 18, 19, 20, 22, and 23). However, there are a couple cases where increasing model complexity does not significantly improve fairness (Figs. 21 and 24). Nevertheless, it does not appear that increasing model complexity harms fairness, making it at least a potentially beneficial intervention from a fairness perspective.
Performance and Complexity Analysis
As discussed in the main body, improvements in algorithmic fairness can often come at the cost of overall classifier performance. To analyze whether any fairness gains we see from increased model complexity harm classifier performance, we show the overall test set AUC of models with varying complexity. Each figure in this section parallels a figure in appendix D.3 or main body figure 5. In general, overall classifier AUC does not substantially degrade with increasing model complexity. The highest complexity levels (estimators and depth ) sometimes show moderate degradation in performance (Fig. 25). However, substantial fairness gains can be realized at lower complexity levels (matched Fig. 18).