From Black Box to Arguable System: A Practical Framework for Clinician-AI Collaboration

On January 27, I wrote about the "clinician's black box." The tension was simple. We keep asking clinicians to trust systems that cannot explain themselves in clinical language. The output may be 95% accurate. But the path from input to output stays hidden. In practice, this creates three reactions: blind trust, quiet rejection, or that uncomfortable feeling of being overruled by a machine you do not understand.

Since January, I have been thinking less about explainability as a technical patch and more about argumentation as a design principle. What if we stop trying to make models perfectly transparent? What if we make them arguable instead?

What Makes a System Arguable:
An arguable system does not claim to be right. It presents a claim, the evidence it used, and its own doubts. Then it steps back and lets the clinician agree, push back, or say "you missed something." That is collaboration. That is not automation.

The Double Black Box Is Real

Federation Opacity:

We are piling complexity onto complexity. Federated learning protects privacy, yes. But as Hatherley, Søgaard, Ballantyne, and Pauwels pointed out in their 2025 paper, it also creates "federation opacity." You now have a black box model trained on invisible data across invisible sites. The clinician sees a prediction for an ECG or a dengue patient but has no idea what distribution the model learned from. That is not one black box. That is two.

When a model flags "High Risk" and the patient looks stable, there is no middle ground. You either trust a system you cannot see into, or you ignore it completely. Neither is safe.

What Makes a System Arguable

Three things, from what I have tested in my own work.

Three Pillars of Arguable Systems

Rule-Augmented Neural Networks: You let the neural network learn complex patterns from thousands of cases. But you run its output through a symbolic layer of actual clinical rules. The rules act as guardrails. If the neural network finds a correlation that violates WHO guidelines or basic physiology, the rule layer flags it. The model cannot just output a probability. It has to show which rules it used and which ones it broke.
Structured Uncertainty Reporting: A prediction without a confidence interval is just a guess. But a single uncertainty score is not enough either. Clinicians need to know what kind of uncertainty they are looking at. Is the input data noisy? Has the model seen a case like this before? Do the rules and the learned patterns disagree? An arguable system tells you the difference.
A Claim You Can Push Back On: The system should present its output as a position, not a verdict. Something like: "I think this is high risk, here is why, but I am uncertain about these two things." That invites a response. That keeps the clinician in charge.

A Small Example from My Dengue Work

Rule-Augmented Dengue Triage Chatbot:

Let me walk you through something we actually built.

We had a symptom triage chatbot for dengue in Bangladesh. The first version was a simple decision tree. It followed WHO warning signs directly: platelet thresholds, vomiting, abdominal pain, bleeding. The rules were transparent but brittle. The model missed subtle combinations that local clinicians spotted immediately.

So we tested a rule-augmented version. The neural network learned embeddings from patient reported symptoms and basic lab trends across thousands of cases. The symbolic layer still enforced the exact rules that clinicians expect. When the hybrid system flagged a case, it showed three things together:

Claim: Moderate risk of dengue with warning signs.
Evidence from model: Platelet trend downward over 48 hours. Persistent fever day 4.
Rule alignment: Matches WHO criteria for warning signs. Does not meet criteria for severe dengue (no bleeding, no plasma leakage).
Uncertainty: Moderate. The patient's description of "bone pain" in Bangla colloquial terms was underrepresented in our training data.

Now the clinician can argue. One reviewer looked at that output and said: "In my experience, that phrasing plus this patient's address in an endemic urban slum raises concern for early plasma leakage. I am moving this to high risk."

The system did not argue back. It simply accepted the override and logged the disagreement for retraining. That is the difference between an oracle and a colleague. The model made its logic available for challenge. The final decision stayed with the human.

From Our Fairness Aware ECG Work:

I have seen the same pattern in our fairness aware ECG work. The neural representation learning picks up fine waveform patterns. The rule layers enforce age and sex specific thresholds from local guidelines. Uncertainty reporting highlights when skin tone or motion artifacts degrade signal quality. The clinician decides whether to trust the output or order a confirmatory test. The model stays open to scrutiny.

Why This Matters Right Now

Recent literature backs this up from multiple angles.

Evidence from Recent Research:

Parsons, Zuiderwijk, and Orchard published a systematic review in BMC Medical Informatics late last year on Task-Technology Fit. Their finding was straightforward: AI tools fail when they do not leave room for human expertise. The friction becomes unproductive. An arguable system fits because it mirrors how clinicians already think.

Rosenbacke's PhD work from Copenhagen Business School looked at cognitive challenges in human AI teams. He found that unexamined reliance leads to new errors. Clinicians develop heuristics around when to trust and when to ignore. Those heuristics are often wrong. But when systems invite argument, they keep the clinician's critical thinking engaged.

Even the survey work from Ojha and colleagues in ACM Transactions on Computing for Healthcare points the same direction. Trustworthiness is not just about accuracy. It is about seeing uncertainty and being able to engage the reasoning behind the result.

We Need Different Success Metrics

If we adopt this framework, we have to change how we measure success. Accuracy on a held out test set is no longer enough.

Responsibility Preserving Performance:

We should track what I call responsibility preserving performance. Does the system leave clinicians better equipped to defend their decisions? Does it surface the right moments for human override? When the AI and the clinician disagree, does that disagreement lead to better outcomes or just more frustration?

These are harder metrics. They take longer to collect. But they match what actually happens at the bedside.

A Simple Pipeline to Build Arguable Systems

1 Train your base neural model normally. Track where it struggles.

2 Identify clinically meaningful rules. Pull them from guidelines, local protocols, or expert interviews. Do not guess. Go talk to the clinicians who will use the system.

3 Build a symbolic layer that checks agreement and conflict between the neural output and those rules.

4 Force the system to produce structured output: claim, evidence, rule alignment, and uncertainty type.

5 Test with real clinicians. Watch how they interpret the output. Watch where they argue. Fix those places.

The last step is the one everyone skips. A model that is slightly less accurate but more arguable will almost always lead to better real world decisions.

Closing Thought

We do not need perfectly transparent models. That goal is probably impossible anyway. Neural networks are messy. So are human brains. The difference is that clinicians are used to arguing with each other. They present evidence. They admit uncertainty. They change their minds when new data arrives.

We should build AI that does the same thing.

The black box is not going away. But we can wrap it in rules, structure, and uncertainty reporting that makes it contestable. An arguable system does not ask for blind trust. It invites scrutiny. And in a dengue ward or an ICU or a primary care clinic, that is exactly what we should want.

References

Hatherley, J., Søgaard, A., Ballantyne, A., & Pauwels, R. (2025). Federated learning, ethics, and the double black box problem in medical AI. arXiv:2504.20656.

Parsons, C.S., Zuiderwijk, A., Orchard, N.A. et al. (2025). Task-Technology Fit of Artificial Intelligence-based clinical decision support systems: a review of qualitative studies. BMC Medical Informatics and Decision Making, 25, 397.

Rosenbacke, R. (2025). Cognitive Challenges in Human-AI Collaboration: A Study on Trust, Errors, and Heuristics in Clinical Decision-Making. PhD Series No. 04.2025, Copenhagen Business School.

Ojha, J., Presacan, O., Lind, P.G., Monteiro, E., & Yazidi, A. (2025). Navigating Uncertainty: A User-Perspective Survey of Trustworthiness of AI in Healthcare. ACM Transactions on Computing for Healthcare, 6(3), 1-32.

Thamson, K., & Panahi, O. (2025). Bridging the Gap: AI as a Collaborative Tool between Clinicians and Researchers. Journal of Biomedical Advancement Scientific Research, 1(2), 1-08.

From Black Box to Arguable System: A Practical Framework for Clinician–AI Collaboration