Update 15 Mar 2021: This paper has just won the Best Paper Award at the 2021 ACM CHI Conference on Human Factors in Computing Systems

With the surge in literature focusing on the assessment and mitigation of unfair outcomes in algorithms, several open source “fairness toolkits” recently emerged to make such methods widely accessible. However, little studied are the differences in approach and capabilities of existing fairness toolkits, and whether they are fit-for-purpose in commercial contexts. As these toolkits are to be integrated into developers' model build process, they have the potential to help improve fairness testing and mitigation at-scale across domains (if and where appropriate). On the other hand, there is a risk of these toolkits being applied to an inappropriate use case, misinterpreted without considering the assumptions or limitations of the implemented methods, and/or misused (deliberately or otherwise) as a flawed certification of an algorithm's fairness.

There is currently no available general and comparative guidance on which tool is useful or appropriate for which purpose or audience. This limits the accessibility and usability of the toolkits and results in a risk that a practitioner would select a sub-optimal or inappropriate tool for their use case, or simply use the first one found without being conscious of the approach they are selecting over others.

This blog post summarises a recent paper by Lee and Singh (2020) on the “Landscape and gaps in open source fairness toolkits.” In a mixed-method study, they use focus group, interviews, and surveys to assess the current capabilities of open source fairness toolkits and report on the key gaps and limitations that are crucial to address in order to help industry practitioners effectively test and mitigate unfair outcomes in their machine learning models.

Strengths and weaknesses of open source fairness toolkits

The below table, reproduced from the paper, compares the features of six toolkits: scikitfairness / scikit-lego, Fairness 360, Aequitas tool, What-if tool, audit-ai, and Fairlearn. These toolkits were identified and reviewed through an Ethics DataDive with DataKind UK.

The table contains the list of toolkits and the types of models covered: regression problems, classification problems (binary only or multi-class), and/or problems with multi-class protected features. A subset of the toolkits handle regression (predicting a continuous variable, e.g. income) as well as classification (predicting a discrete variable, e.g. loan approved or denied). Some toolkits can only handle binary protected / sensitive features (e.g. male vs. female), while others support multi-class features (e.g. age or racial groups). As will be discussed in the next section, practitioners search for tools that are compatible with their model, and if working on a regression problem, two of these toolkits can be ruled out immediately. The table also contains the fairness metrics and mitigation techniques supported by the tool. The most comprehensive of them is Fairness 360 with more than 70 metrics, although its focus is on binary classification problems with some multi-class classification support and no support for regression.

One potential point of confusion is the differences in terminologies and definitions for the same metric. For example, equal opportunity difference is synonymous with false negative rate difference, and equal odds tests for both false positive and false negative rate disparities.

Most of these tools are also focused on group-level fairness metrics, while only What-if tool has a focus on individual-level fairness. Fairness 360 supports some individual fairness metrics (sample distortion).

The variety of fairness metrics renders it especially challenging for the user to know what metric is appropriate for each use case. The toolkits have different approaches to guiding users on which metrics is appropriate for any use case.

Gaps in open source fairness toolkits

Key gaps emerged from both the interviews and the surveys, reported below in 3 sections of user-friendliness, toolkit features, and contextualisation.

  1. Steep learning curve required to use the toolkits and limited guidance on metric selection
  2. Information overload vs. over-simplification of complex results
  3. Need for “translation” for a non-technical audience
  4. Accessibility of toolkit search process
  5. Limited coverage of the model pipeline
  6. Limited information on possible mitigation strategies
  7. Limited adaptability of existing toolkits to a customised use case
  8. Challenges in integrating the toolkit into an existing model pipeline

This blog post will provide a summary of each of these 8 identified gaps.

Gaps: User-friendliness

1. Steep learning curve required to use the toolkits and limited guidance on metric selection

In contrast with the perceived importance for guidance, the six toolkits’ ratings for “guidance for users unfamiliar with fairness literature” scored an average of 2.87-3.67 out of 5. The below table contains the average System Usability Scale (SUS) score out of 100 and its standard deviation. A systematic survey of the SUS found “products which are at least passable have SUS scores above 70, with better products scoring in the high 70s to upper 80s.” Almost half of the interviewees agreed or strongly agreed with: “I needed to learn a lot of things before I could get going with this [fairness toolkit] system.”

2. Information overload vs. over-simplification of complex results

In reviewing the visualisation and guidance, interviewees and survey respondents were often split on their preference on the level of detail provided. Some respondents found the amount of information provided prohibitively complex, while others had a strong preference in favour of the detailed interface. One survey respondent said of Fairlearn dashboard that “this makes everything look clear-cut, which it really isn't ‘in the wild.’”

Given there may be differences in the level of detail each user requires for his or her purpose, an ideal toolkit should have both (i) a number of options on the user interface that allows the user to deep-dive and slice and dice the analyses, and (ii) an easy-to-use interface that guides the user step-by-step.

3. Need for “translation” for a non-technical audience

As well as being challenging for data science practitioners with no fairness background, the toolkits were overwhelmingly rated as challenging for a non-technical user, especially in producing visualisations, guidance, and user interface that can be navigated by those without a background in math, statistics, and computer science. This results in a gap between the analysis done by the practitioners and what can be understood by the business function.

4. Accessibility of toolkit search process

Almost all interviewees claim to “use a search engine and iterate through the results until one that meets their criteria are found, and no further search is conducted.” Only one interviewee reported to “comprehensively search for all available toolkits to compare the strengths and weaknesses before selecting the optimal tool.” All survey respondents were asked whether they had used any of the six toolkits (multiple selection allowed) and whether there were any other toolkits they were familiar with that were not listed. Only one respondent said they knew of another toolkit: FAT Forensics, released in late 2019 resulting from a collaboration between the Uni of Bristol and Thales. Overall, that there were no other toolkits the practitioners were familiar with suggests that this landscape coverage was sufficiently representative to explore such issues.

Gaps: Toolkit features 

1. Limited coverage of the model pipeline

There is an apparent focus of the toolkits on the model building and evaluation process as compared with the remaining model lifecycle. Some gaps specifically mentioned were checking whether the data set is representative of the broader population and whether there were features acting as proxies of protected features, e.g. postcode for race or occupation for gender. An interviewee claimed a major gap was in the lack of benchmarking data sets or a reference point for whether there is a selection bias in the data collection process. Another interviewee suggested there should be a way to understand which input features are potentially acting as proxies for protected features, “especially when the feature engineering has been done by a human.” A survey respondent also noted “the analysis needs to explore the idea of proxies, something we do manually today.”

2. Limited information on possible mitigation strategies

There was a strongly mixed amount of enthusiasm for tools that offered “debiasing” pre-processing, in-processing, and post-processing implementations. Several interviewees were skeptical of these techniques, with one claiming it is “dangerous because it looks simple but doesn't solve any problem. It's like a gimmick… it doesn't solve the underlying issues of bias you may have.” Another interviewee noted that some of the bias mitigation tools could be inconsistent with anti-discrimination laws, especially any that explicitly use a protected feature (e.g. race) to give preferential treatment, and emphasised that mitigation “may not always be a technical solution.”

On the other hand, several other interviewees viewed these implementations favourably. One interviewee noted that some tools' “lack of mitigating action leaves a huge knowledge gap for data scientists to fill.”

Gaps: Contextualisation

1. Limited adaptability of existing toolkits to a customised use case

The strongest consensus regarding the ideal fairness toolkit was the importance of the “ability to adapt to a context-specific use case and data.” The existing toolkits were rated on this criterion at an average of 3.24 out of 5, with audit-ai scoring the lowest at 2.71 and Fairness 360 the highest at 3.73, with several interviewees noting that additional work would be needed for the toolkits to be applicable to their use cases.

2. Challenges in integrating the toolkit into an existing model pipeline

Another point of consensus was the importance of the ease of integration of a toolkit into the model building workflow and pipeline. However, the toolkits were rated an average of 3.24 in their ease of integration, with the lowest score at 2.47 for What-if tool and the highest score at 3.93 for Scikit-fairness. A part of the integration challenge lay in the requirement to load the data sets outside of the local desktop.

What-if tool’s FAQ states: “WIT [What-if tool] uses pre-trained models and runs entirely in the browser. We don't store, collect or share datasets loaded into the What-if tool. If using the tool inside TensorBoard, then access to that TensorBoard instance can be controlled through the authorized_groups command-line flag to TensorBoard. Anyone with access to the TensorBoard instance will be able to see data from the datasets that the instance has permissions to load from disk. If using WIT inside of colab, access to the data is controlled by the colab kernel, outside the scope of WIT.”

Similarly, Aequitas tool, while it has a desktop version available, also has a web-based application through which a user can upload a data set, with the caption: “Data you upload is used to generate the audit report. While the data is deleted, we host the audit report in perpetuity. If your data is private and sensitive, we encourage you to use the desktop version of the audit tool.”

This was seemingly a barrier for many interviewees and survey respondents. One survey respondent said any toolkit with any processing off-premise, even if the data set is not stored, “would need a very large amount of governance and security validation to be allowed to be used with corporate data.”

Additional considerations

For the focus group and interview, practitioners with expertise in fairness were purposefully recruited and sampled; therefore, the results are only representative of those with pre-existing understanding of the typical fairness challenges. However, the fact that both these stages found gaps and limitations, especially in user-friendliness and interpretability of the toolkits and their guidance, suggests that the learning curve may actually be much steeper for an average practitioner with more limited exposure to fairness metrics.

It was clear that a user interface with a one-size-fits-all tailoring toward practitioners with prior understanding of fairness limits the accessibility of these toolkits. Different users have varying preferences and needs from their interface. A key example of this is the high standard deviation in the survey ranking of the importance of mathematical definitions in a toolkit guidance (mean: 4.04/8, standard deviation: 2.79) and the ranking of the importance of  visualisations that are helpful for a non-technical audience (mean: 4.48/8, standard deviation: 2.57). As flagged in the interview, some practitioners with a background in statistics may want a detailed mathematical definition, while those looking for a quick proof-of-concept may want a simple user interface for business stakeholders' review.

It is also important to consider whether the toolkits with necessarily reductionist definitions of fairness are appropriate and beneficial from a societal standpoint. Several academics have objected to the "automation'' of fairness assessments because these tools fail to consider the socio-technical system, the nuanced philosophical and ethical debates, and the legal context of what it means to be fair. For Fairness 360, in answering whether the tool should be used at all, the guidance warns that the tool “applies to limited settings and is intended as a starting point for wider discussion.” The practitioners in the interview and survey were generally positive in their reaction to the notion of a fairness toolkit to help navigate an extremely complex issue, but several expressed concern for “fairness gerrymandering” (“fairness washing”), selecting the metric based on which ones were satisfied, and for the false confidence the toolkits may give to the model developer based on an incomplete or partial assessment of fairness. Future work could examine in-depth the disclaimers and limitations described for each of the toolkits and whether they align to the academic understanding of suitability of each implemented method.

Conclusion

Fairness toolkits are a fairly recent phenomenon, particularly in the last two years, and several interviewees were surprised to learn about their availability and diversity. Only 54% survey respondents had used any open source fairness toolkit before, despite the study's sampling of groups with likely exposure to fairness-related concerns. With the growing attention on issues of fairness, it is important that any fairness toolkits are accessible, usable and fit for purpose. This paper can help inform future tool development in order to bridge the gap between the introduction of methodologies in academia and their applicability in real-life industry contexts.

Industry practitioners are still struggling with finding a way to identify and mitigate potential unfairness in their models and systems. Only by keeping close to the practitioners' requirements and preferences can the open source developers ensure widespread adoption of their toolkits. The toolkits were developed to encourage model developers to be more cognisant of the potential ethical implications of their algorithms in relation to their impact on societal inequalities. An effective fairness toolkit could foster the culture among practitioners to consider and assess unfair outcomes in their models, while a poorly framed or designed toolkit could engender false confidence in flawed algorithms. Future development of toolkits should remain vigilant to ensure their adoption is aligned to the over-arching goal: to ensure our algorithms reflect our ethical values of non-discrimination and fairness.

If you have any questions, contact michellealee@deloitte.co.uk