Why We Keep Demanding Explainable AI in All the Wrong Places

In 1912, the Titanic sank because its watertight compartments weren't actually watertight. The ship's designers believed they'd engineered safety through compartmentalization – seal off one flooded section and the rest stays dry. But water rose above the compartment walls and spilled over, flooding the ship sequentially. The safety mechanism became irrelevant once the failure mode exceeded its design parameters.

A century later, we're building AI safety mechanisms with the same conceptual flaw. Regulators demand explainability – show us why the model decided X instead of Y. But explainability tools are compartments designed for one failure mode (biased decisions we can audit) while the actual risks increasingly involve different failure modes entirely (models that perform well on benchmarks but fail catastrophically in edge cases we never anticipated). We're demanding transparency in contexts where it can't actually prevent the harms we care about, while underinvesting in it where it would matter most.

The explainability-performance tradeoff isn't a technical constraint we'll eventually engineer away. It's a fundamental tension that reveals something deeper about how organizations actually use AI versus how they pretend to use it. Understanding where this tension comes from – and when it actually matters – separates companies building robust AI systems from those building compliance theater.

Start with the economic reality. A model that's 2% more accurate at predicting customer churn is worth millions to a subscription business. A model that can explain why it flagged a particular transaction as fraudulent but catches 5% fewer fraudulent transactions is worth... well, it depends who you ask. The compliance team loves it. The CFO calculating losses from missed fraud hates it.

This creates a ratchet effect. Once you've deployed a model achieving 94% accuracy, switching to one with 89% accuracy feels like going backward, even if the new model offers perfect explainability. Revenue teams revolt. Product managers get nervous. The CEO asks uncomfortable questions about why we're deliberately choosing worse performance.

Banks learned this the hard way with credit scoring. In the 1990s, credit decisions used relatively simple scorecards you could print on a single page – twenty variables, weights you could explain to a loan officer, clear thresholds. Then ensemble methods arrived: random forests, gradient boosted trees, neural networks. Each generation of models approved 3-4% more good loans while rejecting the same bad ones, translating to tens of millions in additional revenue for large lenders.

But these models became progressively harder to explain. A random forest might use 500 trees, each considering different variable combinations. Sure, you could extract "feature importance" metrics, but try explaining to a rejected applicant why the 347th decision tree voted against them. The mathematical operations producing the decision are transparent – you can trace every calculation – but the reasoning remains opaque. We know how the machine reached its conclusion but not why that conclusion makes sense.

Most banks chose performance. They had to. Their competitors were approving more loans with lower default rates. The business logic overwhelmed the explainability concern.

The performance gains aren't symmetric, which makes this dynamic particularly sticky. Moving from a simple linear model to a complex ensemble might boost accuracy by 5-8%. But moving back sacrifices that entire gain. There's no middle ground where you get 90% of the performance with 90% of the explainability. The tradeoff curve is convex – you give up a lot of performance for modest gains in interpretability, or you give up a lot of interpretability for modest gains in performance.

This explains why regulatory requirements for explainability so often become compliance theater. Banks subject to fair lending regulations now deploy complex models and bolt on post-hoc explanation tools – LIME, SHAP, counterfactual explanations. These tools generate something that looks like an explanation: "Your loan was denied because your debt-to-income ratio (impact: -12 points) and recent credit inquiries (impact: -8 points) lowered your score."

But these post-hoc explanations aren't how the model actually reasons. They're simplified stories reverse-engineered from the model's behavior. SHAP values tell you which features, if changed, would most alter the output. That's useful information, but it's not the same as understanding the model's internal logic. The model doesn't "think" in terms of SHAP values any more than your brain thinks in terms of fMRI activations.

The push for explainable AI rests on an implicit theory: transparent models are safer models. If we can see why the model decided something, we can catch biased or erroneous decisions before they cause harm.

This theory holds in some contexts and completely fails in others. Understanding which is which requires thinking clearly about failure modes.

Consider three scenarios:

Scenario 1: Resume Screening A model reviewing job applications flags fewer candidates from certain universities. An explainability tool reveals the model weights "university ranking" heavily. HR reviews this, recognizes the bias (plenty of great candidates come from non-elite schools), and adjusts the model. Explainability caught a harmful pattern.

Scenario 2: Cancer Diagnosis A radiology model achieves 96% accuracy detecting lung nodules – 4% better than the previous model. Radiologists trust it. Then a researcher discovers it's partially keying off metadata like patient positioning rather than actual pathology. The high performance masked a fundamental flaw. Post-hoc explainability tools like attention maps would have shown "the model focuses on the right regions of the image," missing the metadata dependency entirely.

Scenario 3: Autonomous Vehicle A self-driving car navigates millions of miles safely, then suddenly accelerates through a red light, causing a fatal accident. Investigators examine the model. They can trace every calculation. They can see which sensors triggered which responses. They can identify the exact parameters that led to the decision. None of this explains why the model misinterpreted the situation or prevents the next unpredicted failure mode.

Explainability helped in Scenario 1 because the failure mode was simple bias that showed up in feature weights. It failed in Scenarios 2 and 3 because the failure modes involved either spurious correlations (cancer diagnosis) or unpredicted edge cases (autonomous vehicle) that looked fine when you examined model behavior.

This pattern keeps repeating. Explainability tools catch the failures they're designed to catch – biased feature weights, obviously problematic decision rules. They miss the failures that actually cause catastrophic harm – models that work brilliantly in the training distribution but fail unpredictably outside it.

Medical AI offers the clearest examples. A model trained to detect diabetic retinopathy from retinal images achieved excellent performance in trials. Doctors trusted it. Then hospitals discovered it performed poorly on images from different cameras. The model had partly learned to recognize camera artifacts rather than pathology.

Could explainability have caught this? In principle, yes – attention maps might have shown inappropriate focus on image borders or artifacts. In practice, no – these patterns only became obvious after the deployment failure, when researchers knew what to look for. Before deployment, the attention maps looked plausible enough. The model appeared to focus on relevant anatomical regions.

This reveals the core limitation. Post-hoc explainability tools show you what the model is doing but not whether what it's doing is right. They catch failures that violate your prior expectations (biased weights on protected attributes) but miss failures that look reasonable until they suddenly aren't (focusing on camera artifacts that happen to correlate with disease in the training data).

Watch what's actually happening in AI deployment, and you'll notice two separate trajectories emerging. They're diverging fast.

Auditable AI for High-Stakes Human Decisions

These systems assist human decision-makers in regulated domains: credit, hiring, criminal justice, medical diagnosis. Performance matters, but so does defending decisions to regulators, courts, and affected individuals. Banks need to explain denied loans. Hospitals need to justify treatment recommendations. Employers need to defend hiring decisions.

Here, explainability isn't optional – it's the product. A credit model that can't explain its decisions isn't deployable, regardless of performance. JPMorgan Chase and Wells Fargo both maintain simpler, more explainable models for certain lending decisions despite having more accurate alternatives, because the regulatory and litigation risk overwhelms the performance gain.

This creates a ceiling on model complexity. The ceiling isn't technical – these organizations could deploy larger neural networks tomorrow. The ceiling is institutional. Legal teams, compliance officers, and risk managers constrain model choices because they're accountable for decisions they can't explain.

Autonomous AI for Operational Decisions

Meanwhile, models deciding which warehouse items to restock, how to route network traffic, which ads to serve, or how to optimize chemical reactions face no such constraints. Nobody sues because they saw the wrong advertisement. No regulator demands an explanation for why the datacenter routing algorithm chose path A over path B.

Performance is everything. If a more complex model increases ad revenue by 0.3%, you deploy it. If a larger neural network reduces datacenter costs by 2%, you deploy it. Explainability is irrelevant because there's no institutional need to explain decisions.

These systems are becoming genuinely autonomous – making millions of micro-decisions with minimal human oversight. Recommendation algorithms shape what billions of people see online. Trading algorithms execute billions of dollars in transactions per second. Manufacturing AI optimizes production processes across global supply chains.

The complexity ceiling doesn't exist here. Models grow as large as they need to be for performance gains, constrained only by computational costs and engineering feasibility. GPT-4 has over a trillion parameters. Nobody can explain why it generates a particular sentence. Nobody needs to. Users evaluate outputs directly – does this response help me or not?

This divergence will accelerate. Auditable systems face mounting regulatory pressure for transparency. The EU AI Act mandates explainability for high-risk applications. US agencies increasingly require algorithmic impact assessments. These requirements will push Track 1 toward simpler, more interpretable models even as Track 2 systems grow exponentially more complex.

We're heading toward a world where the AI making your loan decision is simpler and more explainable than it was five years ago, while the AI deciding what you see online, optimizing logistics networks, or controlling industrial processes is orders of magnitude more complex and less interpretable.

Forget regulatory compliance for a moment. When does explainability actually make your AI system more robust? It depends on your failure modes and correction mechanisms.

Explainability helps catch bias in feature weights. If your model shouldn't consider race, gender, or zip code, inspecting feature importance catches this immediately. Simple, effective, worth doing.

You're debugging systematic errors. A model consistently fails on a subset of inputs. Explainability tools help identify what these inputs have in common – maybe they all have missing values for certain features, or fall outside the training distribution in some way.

You're building trust with domain experts. Radiologists won't adopt a black-box cancer detection system, regardless of accuracy. Showing them that the model focuses on the same image regions they find important builds warranted trust. This isn't just psychology – it enables domain experts to catch errors the model makes by recognizing when its reasoning diverges from medical knowledge.

Explainability doesn't help when:

You're trying to prevent unknown unknowns. The autonomous vehicle running the red light, the cancer detector keying off camera artifacts, the trading algorithm causing a flash crash – these failures emerged from complexity interacting with edge cases. Post-hoc explanations can't prevent them because you don't know what to look for until after the failure.

It doesn't help with genuine complexity. Some tasks are irreducibly complex. Human radiologists can't fully explain how they detect certain cancers – they "just see it" after thousands of cases. Demanding explainability here means demanding we handicap AI to match human cognitive limitations rather than letting it develop its own effective strategies.

And it doesn't help in adversarial environments. Fraud detection, spam filtering, security systems – as soon as you explain how the model works, adversaries learn to evade it. The explainability requirement directly undermines security. Banks don't explain exactly how their fraud detection works for good reason.

The sophisticated approach recognizes these distinctions. Google's medical AI team deploys highly interpretable models for clinical decision support (where doctors need to understand recommendations) but less interpretable models for image preprocessing and quality checks (where robustness testing matters more than understanding individual decisions).

The naive approach treats explainability as universally good and demands it everywhere, regardless of whether it actually improves outcomes. This is how you get compliance theater – organizations checking an explainability box without actually using those explanations to make better decisions.

Ruthless testing in deployment conditions. Continuous monitoring for distribution shift. Rapid rollback mechanisms. Genuine accountability for failures.

These things are boring. They're not intellectually exciting. They don't involve clever algorithms or fascinating research papers. They're engineering discipline, institutional processes, operational excellence.

But they work. Netflix doesn't make its recommendation algorithm explainable – it A/B tests relentlessly and monitors engagement metrics in real-time. Google doesn't explain why Search ranks results the way it does – it measures clicks, dwell time, and user satisfaction across billions of queries daily. These systems get safer through empirical feedback loops, not transparency.

Meanwhile, organizations demanding explainability often skip the hard work. They want explanation tools that let them audit models quarterly, check boxes for compliance, and move on. They're not building systems that detect when models degrade in production, automatically roll back harmful deployments, or maintain human oversight for high-stakes decisions.

Explainability often substitutes for accountability. It's easier to demand "show me why the model did X" than to build robust testing infrastructure, establish clear ownership for model failures, or maintain genuine human oversight of consequential decisions. Explainability feels like safety without requiring the institutional changes that would actually create safety.

This isn't an argument against explainability. It's an argument for deploying it strategically where it actually prevents harm, while investing equally in the unglamorous engineering work that catches the failures explainability misses.

The financial sector is slowly learning this. After algorithmic trading caused several flash crashes, regulations didn't focus on making algorithms explainable – they focused on circuit breakers, testing requirements, and "kill switches" for rapid intervention. These mechanisms don't explain why failures occur, but they limit damage when they inevitably do.

Autonomous vehicle developers are learning it too. Waymo doesn't primarily focus on explaining individual driving decisions. They focus on simulation testing across millions of scenarios, sensor redundancy, conservative behavior in uncertain situations, and maintaining human operators who can intervene. The safety comes from robust engineering and conservative deployment, not from understanding every model decision.

Five years from now, we'll have AI systems you can fully explain and AI systems you can't explain at all. The gap between them will be enormous – not because of technology, but because of diverging institutional needs.

In regulated domains touching individual rights – credit, employment, criminal justice, healthcare – expect simpler models, mandatory explanations, and slower adoption of cutting-edge techniques. The regulatory ratchet only tightens. Every algorithmic bias scandal produces new requirements for transparency. These requirements will increasingly constrain model complexity.

Banks might eventually maintain two parallel systems: complex models for risk assessment and portfolio management (Track 2, where explainability doesn't matter), and simpler models for individual loan decisions (Track 1, where it does). The complex models will be ten times more accurate, but they'll never touch individual credit decisions because they can't meet explainability requirements.

Meanwhile, in operational domains where AI systems operate with minimal human oversight, expect exponential growth in model complexity. These systems will make millions of decisions daily that nobody examines individually. Their safety will come from empirical performance metrics, robust testing, and rapid correction mechanisms – not from explainability.

The interesting frontier is systems that span both tracks. An AI assistant that helps doctors diagnose diseases (Track 1, needs explainability) but also manages hospital resource allocation (Track 2, doesn't need it). A hiring system that screens resumes (Track 1, heavily regulated) but also optimizes job posting placement (Track 2, not regulated). These systems will need institutional boundaries deciding which decisions require explanations and which don't.

Organizations that navigate this successfully will recognize explainability as a tool, not a virtue. They'll demand it where it prevents real harm or meets genuine institutional needs. They'll skip it where it constrains performance without improving outcomes. They'll invest equally in the boring infrastructure that catches failures explainability misses.

Those that don't will build compliance theater – models that generate impressive-looking explanations nobody actually uses, while cutting corners on the unglamorous work that would actually make systems safer. And when those systems fail catastrophically, they'll point to their explainability tools and wonder why transparency didn't prevent disaster.

The answer will be the same as it always was: transparency shows you what happened, but only robust systems prevent it from happening in the first place. We keep confusing the two because demanding transparency feels like taking action without requiring the hard institutional work that actual safety demands.