Beyond black box AI: Pitfalls in machine learning interpretability
New research exposes the hidden dangers of black box AI models, highlighting their vulnerability to manipulation and raising concerns about fairness and transparency
As artificial intelligence and machine learning models become more prevalent in business decision-making, there is growing concern about their lack of transparency. In 2015, for example, Amazon developed an AI tool to help streamline its hiring process. By 2018, the company discovered that the tool was biased against women for technical roles like software development. The model had been trained on resumes submitted to Amazon over a 10-year period, most of which came from men, reflecting the gender imbalance in the tech industry.
When Amazon tried to edit the algorithm to be neutral to gender, they found they couldn’t be sure the AI wouldn’t find other ways to discriminate. This case demonstrates the challenge of detecting and correcting bias in black-box AI models (a term used to describe artificial intelligence systems whose internal workings and decision-making processes are not transparent) – even for tech giants with significant resources. While Amazon eventually abandoned this tool, it raises questions about how such a model might have been interpreted or explained to hiring managers.
The widespread use of black-box AI models in critical decision-making processes and the potential dangers of relying on potentially manipulable interpretation methods have been highlighted in new research conducted by UNSW Business School and the University of Pennsylvania. The research examined how partial dependence plots – a common tool for interpreting machine learning models – can be manipulated through adversarial attacks. The findings raise important questions about the reliability of AI interpretations and their use in high-stakes business decisions, according to one of the research paper’s co-authors, Dr Fei Huang, Senior Lecturer in the School of Risk and Actuaries Studies at UNSW Business School and Lead – Data and AI Tech at UNSW Business AI Lab.
“The inspiration for this research emerged while examining current regulations around fairness and interpretability in high-stakes decision-making processes, such as insurance pricing," Dr Huang explained. “We noticed a significant gap in guidance regarding the appropriate models and tools that could effectively meet these fairness and transparency requirements. This led us to investigate the vulnerabilities inherent in popular interpretation methods, particularly permutation-based techniques (such as partial dependence plots) used for explaining black-box machine learning and AI models.”
Dr Huang said the goal of the research was not only to highlight the risks these methods pose, but also to offer valuable insights and recommendations for regulators and practitioners aiming to promote responsible, data-driven decision-making.
Read more: Resilience in AI leadership: In conversation with Stela Solar
The hidden dangers of black box models
Many industries have rapidly adopted complex AI models to enhance efficiency and accuracy in decision-making. While these models can deliver significant benefits, their inner workings are often opaque, according to the authors of the research paper, Why You Should Not Trust Interpretations in Machine Learning: Adversarial Attacks on Partial Dependence Plots. Co-authored by Dr Huang together with Xi Xin, a PhD candidate at the School of Risk and Actuarial Studies at UNSW Business School together with Giles Hooker, Professor of Statistics and Data Science at The Wharton School at the University of Pennsylvania, the research paper noted that the lack of transparency in black-box AI models has raised concerns from both regulators and consumers, particularly when black-box AI is used for critical decisions.
They warned that this lack of transparency becomes especially problematic in sectors like finance, insurance, and healthcare, where decisions can have profound impacts on individuals’ lives. To address these transparency issues, organisations frequently employ interpretation methods from the growing field of interpretable machine learning. Tools like partial dependence plots aim to provide insights into how model inputs relate to outputs.
The research paper noted that regulators and practitioners in industries such as insurance and finance have recommended using such tools when implementing black-box models. These interpretation methods are seen as a way to bridge the gap between the complexity of advanced AI models and the need for explainable decision-making processes. However, the study suggests that this approach may provide a false sense of security, as these interpretation tools themselves can be subject to manipulation.
Read more: Building socially intelligent AI: Insights from the trust game
“Black-box AI models, by their nature, are often opaque, making it difficult to fully understand how decisions are being made,” Dr Huang said. “This lack of transparency, combined with their vulnerability to manipulation, can lead to significant risks. In the insurance industry, for instance, relying on black-box models without clear explanations and proper auditing can result in unfair premium calculations or unjust denial of claims. In finance, black-box models may inaccurately assess creditworthiness, potentially leading to biased lending decisions and unequal access to financial services.”
Manipulating model interpretations in the auto insurance industry
The study focused specifically on partial dependence plots, which display the marginal effect of a feature on model predictions. The researchers developed an adversarial framework that could manipulate these plots to conceal discriminatory behaviours while preserving most of the original model’s predictions. “Our results show that it is possible to intentionally hide the discriminatory behaviour of a predictor and make the black-box model appear neutral through interpretation tools like partial dependence plots while retaining almost all the predictions of the original black-box model,” the authors stated. This finding is particularly concerning as it suggests that organisations could potentially use these techniques to make biased models appear fair when scrutinised.
Using real-world datasets including auto insurance claims and criminal offender data, the researchers demonstrated how their framework could effectively fool partial dependence plots. The auto insurance claims dataset is particularly relevant given recent controversies in the insurance industry. For example, in 2023, the Illinois Department of Insurance fined Allstate $1.25 million for using an algorithm that resulted in higher auto insurance premiums for minority policyholders and those in minority neighbourhoods.
Insurance companies often use complex models to determine premiums, and regulators rely on interpretations of these models to detect unfair practices. The research suggests that even if an insurer had provided partial dependence plots or other interpretations to regulators, these could potentially have been manipulated to hide discriminatory pricing while maintaining the underlying biased model.
Implications for businesses and regulators
These research findings have significant implications for organisations relying on AI interpretations for decision-making or regulatory compliance. The ability to manipulate interpretation tools raises questions about their reliability for detecting unfair or discriminatory practices. Businesses that have invested heavily in AI systems and rely on these interpretation methods to ensure compliance and fairness may need to re-evaluate their approaches. The researchers cautioned: “We advise against employing partial dependence plots as a means to validate the fairness or non-discrimination of sensitive attributes. This is particularly important in adversarial scenarios where the stakeholders providing and utilising interpretation methods have opposing interests and incentives,” the researchers said.
Read more: How insurers can mitigate the discrimination risks posed by AI
For businesses using black-box models, the study highlights the importance of not over-relying on a single interpretation method. The authors recommended using multiple complementary tools to gain a more holistic understanding of model behaviour. This multifaceted approach can help uncover potential biases or inconsistencies that might be hidden when relying on a single interpretation technique. They also emphasised that interpretation tools should be seen as a last resort when transparency is critical: “We argue that resorting to the third strategy – relying on interpretation methods to explain black-box models – should only be considered if accuracy is critical and the interpretation requirement is relatively low in the application context.”
Mitigating risks and enhancing model transparency
To address the vulnerabilities identified, the researchers proposed strategies for practitioners and regulators. These recommendations provide a roadmap for organisations looking to enhance the transparency and fairness of their AI systems:
- Enhance partial dependence plots with individual conditional expectation curves to help identify anomalies. This approach provides a more granular view of how the model behaves for individual data points, making it harder to conceal discriminatory patterns.
- Carefully assess dependencies between features in the data before applying interpretation methods. Understanding these relationships can help identify potential areas where manipulation might occur and inform more robust interpretation strategies.
- Consider using inherently interpretable models like generalised additive models when possible, rather than relying on post-hoc interpretations of black-box models. These models offer a balance between predictive power and interpretability, reducing the need for complex explanation tools.
- Adopt hybrid approaches that leverage black-box models solely for feature engineering while using interpretable models for predictions. This strategy allows organisations to benefit from the power of advanced AI for data processing while maintaining transparency in the final decision-making process.
“Practitioners should avoid the overuse of black-box models if interpretable models can achieve the same level of model performance,” the authors advised. They noted that in many cases, alternatives like generalised additive models can provide similar predictive accuracy with greater transparency. This insight challenges the common assumption that more complex models are always better and encourages businesses to critically evaluate whether the benefits of black-box models truly outweigh the costs in terms of interpretability and potential fairness issues.
For regulators, the study underscores the need for caution in relying on interpretation tools to assess model fairness. The researchers suggested examining correlations between features and using multiple complementary interpretation methods when evaluating black-box models. This more comprehensive approach to model evaluation could help prevent the concealment of discriminatory practices and ensure more robust regulatory oversight of AI systems.
Towards more responsible AI adoption
As AI becomes increasingly central to business operations, ensuring transparency and fairness in automated decision-making is critical. This study highlights that current interpretation tools are not a panacea for black-box AI and may themselves be vulnerable to manipulation. For organisations leveraging AI, the findings emphasise the importance of a holistic approach to model interpretability and fairness. Rather than relying solely on post-hoc interpretations, businesses should carefully consider the trade-offs between model complexity, accuracy and transparency from the outset.
“Responsible AI is about balancing innovation with integrity,” said Dr Huang. “Companies need to go beyond accuracy and performance metrics and ask: can we explain this model’s decisions to regulators, customers, and ourselves? Auditing AI and any data-driven decision-making regularly for bias and vulnerabilities is critical. Just as businesses have checks and balances in their financial systems, the same rigor should apply to AI models that drive critical business decisions."
Subscribe to BusinessThink for the latest research, analysis and insights from UNSW Business School
In practical terms, the research suggested businesses should consider implementing the following steps to enhance their AI governance:
- Conduct regular audits of AI models using multiple interpretation methods to cross-validate findings and identify potential inconsistencies.
- Invest in training for data scientists and decision-makers on the limitations of interpretation tools and the importance of holistic model evaluation.
- Develop clear guidelines for when to use black-box models versus more interpretable alternatives, based on the specific context and stakes of the decision-making process.
- Engage with regulators and industry peers to develop best practices for AI transparency and fairness that go beyond reliance on potentially vulnerable interpretation tools.
- Consider establishing external advisory boards or partnering with academic institutions to provide independent oversight and validation of AI systems and interpretation methods.
By taking these steps, the research paper suggested organisations can move towards a more responsible and robust approach to AI adoption, one that balances the power of advanced analytics with the critical need for transparency, fairness, and accountability in decision-making processes.
Dr Fei Huang is Senior Lecturer in the School of Risk and Actuaries Studies at UNSW Business School and Lead – Data and AI Tech at UNSW Business AI Lab, which partners with organisations to discover new business opportunities, overcome challenges to AI adoption and accelerate the next generation of leaders and entrepreneurs. For more information, please contact Professor Mary-Anne Williams.