Technical 9 min read

How to Design a PHI Redaction System for Clinical AI

A clinical AI tool that sends patient names to an external API is a regulatory problem looking for an incident. PHI redaction is not a feature you add to a clinical AI product — it is part of the architecture. This is what the literature says it should look like, and how we built it for ClientJourney.

A psychologist drafts a progress note. It contains their client’s name, date of birth, employer, GP, partner’s name, and four sessions of clinical reasoning about a complex presentation. They click “Generate progress letter.” Two minutes later they have a clean, well-structured document ready to send to the referring GP.

The question that matters: where did the names go?

If the answer is “they were sent to an AI provider in plaintext along with the rest of the note,” that practitioner has a problem. Not a hypothetical problem — a regulatory one. AHPRA and the Psychology Board’s Code of Conduct require clinicians to protect client information across every system they use. The Australian Psychological Society’s recent guidance on integrating AI into practice is explicit that the choice of tool, including its data handling, is the practitioner’s responsibility.

PHI redaction is the architectural answer to that responsibility. It is not a feature you add to a clinical AI product. It is part of how the product is built, or it is missing.

This post is a design framework for clinical PHI redaction — what the components are, why they need to work together, and how we applied the framework to ClientJourney, the CBT-committed clinical AI tool we are preparing for UAT.

Why this matters more for clinical AI than other AI

There are three reasons clinical AI is a harder case than the generic “send sensitive data to ChatGPT” problem most businesses are working through.

The regulatory bar is higher. AHPRA and the Psychology Board do not care whether your AI tool has a strong privacy policy. They care whether client information was handled appropriately. A privacy policy is a promise. Architecture is a fact. Practitioners cannot point to a contract with an AI provider as evidence of compliance — they need to be able to demonstrate that PHI did not leave the boundary in identifiable form.

The downstream content is generative. A clinical AI tool generates an entire document — a progress letter, an intake assessment, a discharge summary — that the practitioner then reviews and sends. If the AI hallucinates a fact, attaches a real name to it, and the practitioner does not catch the error, the consequences compound. Hallucination plus identity is doubly harmful in clinical contexts.

The trust gap is wider. Clinicians are, correctly, sceptical of AI. The tools that earn trust in this space are the ones that respect the practitioner’s professional obligations rather than asking them to take privacy on faith. The audience for clinical AI reads documentation before they sign up.

What the literature says about redaction

We commissioned a literature review on local PHI masking before we built ClientJourney’s redaction layer. The review is published alongside this post as a resource. Seven principles emerged consistently across the academic record (Uzuner et al., 2007; Dehghan et al., 2015; Dernoncourt et al., 2017; Moore et al., 2023).

Layered systems outperform single-technique systems. No single approach — pattern matching, statistical models, neural language models — covers every category of identifying information well. The strongest systems combine multiple techniques because different categories of identifying information fail in different ways.
Catching the whole name matters, not just most of it. A system that masks the first name but misses the surname has still leaked a name. Privacy-critical evaluation has to measure whole-entity success, not partial credit. “Got most of the letters” is not the same as “got the name.”
Performance has to be broken down by category, not averaged. A system can report 95% accuracy overall and still miss half the names if names are a small share of total identifying information. Per-category breakdown — names separately, dates separately, organisations separately — is the only way to see where the system is actually weak.
Better to over-mask than under-mask. A missed name leaves identifying information in the text. An over-masked word removes content that did not need to be removed. Privacy-critical systems prioritise catching everything, even at the cost of occasionally masking words that did not need it.
The system has to track recurring people. The same person mentioned by full name in the intake assessment is the same person referred to by first name in the progress note three months later. Without that connection, later mentions leak through.
Replacements should preserve structure. Blanking every name out destroys the grammar and relationships in the text. Replacing names with consistent, role-aware codes maintains the structure the AI needs to produce useful output — and lets the system distinguish between the patient, their partner, their GP, and their psychiatrist even though the AI never sees any actual names.
Real-world testing matters more than benchmark scores. Note styles vary between practitioners, regions, and disciplines. A system that performs well on a published benchmark may not perform well on Western Australian psychology notes. Production-grade systems have to be validated against the documents they will actually see.

These are not engineering preferences. They are the consensus of two decades of clinical de-identification research, and they describe what a system that takes its job seriously looks like.

The components of a working system

Translated into architecture, a PHI redaction system designed against those principles has five distinct components, running in a specific order.

1. Structured pre-pass — the practitioner is in charge

The most reliable layer is the one that does not need to guess. Most clinical software already holds key identifying information in structured fields — client name, DOB, contact details, employer, GP, psychiatrist, emergency contact. Before any text analysis happens, the redaction system walks those fields and extracts every known identifier directly. The system does not need to figure out whether “Sarah” is a name when the client record says the client’s name is Sarah.

This layer does the majority of the work, and it works because the practitioner controls it. ClientJourney’s client details screen captures the standard identifying fields automatically. A separate relationships screen lets the practitioner add the people who come up repeatedly in therapy — the partner, parents, siblings, manager, key team members, anyone whose name appears regularly in session content. Each entry is tagged with its relationship to the client, so the system can replace the name with a code that still tells the AI who the person is (partner, mother, manager) without telling it who the person is.

This is deliberate. The practitioner knows their clients better than any model does. Giving them a place to declare the recurring named people in a client’s life means those mentions are caught reliably, every time, without depending on a language model to recognise the name.

Every report generated by ClientJourney also shows what was redacted and what was not. The practitioner sees the audit trail. If a name slipped through, they can add it to the relationships screen and regenerate. The system gets stronger with use, and the practitioner is the one driving it.

2. Rule-based recognisers

The next layer catches PHI with predictable surface forms: phone numbers, email addresses, Medicare numbers, ABNs, AHPRA registration numbers, ICD-10 codes, context-anchored dates of birth, addresses. These categories are stable across institutions. A phone number looks like a phone number. The literature is clear that for highly regular PHI classes, rule-based recognisers outperform statistical models because the patterns are deterministic.

This is also where Australian-specific identifiers earn their keep. Generic clinical AI tools built for the US market do not recognise Medicare numbers or AHPRA provider numbers as PHI. A clinical AI tool intended for Australian practitioners has to know what identifiers exist in Australian clinical text.

3. Named entity recognition — the safety net

By the time the first two layers have run, the majority of identifying information is already masked. The structured pre-pass has handled the client, their relationships, and their care team. The pattern-based layer has handled phones, emails, dates, and registration numbers. What is left is the residue — the people mentioned once in passing. The shopkeeper named in a session about social anxiety. The uncle mentioned in a single line of family history. The previous therapist named once in an intake.

This is where a small language model runs locally on the practitioner’s device. Its only job is spotting proper nouns the earlier layers missed — names of people, organisations, hospitals, places. It does no clinical reasoning, no diagnosis, no analysis. One model, one task, running in the browser.

Running the model on the practitioner’s device is the critical architectural choice. A redaction system that performs this step on a remote server has already sent the data off the practitioner’s machine. The privacy boundary has been crossed before the redaction begins. ClientJourney runs this layer inside the browser, which means the entire redaction process — including the language model — happens before any data leaves the practitioner’s machine.

4. Propagation

Once entities are identified, the system sweeps the entire note set for any remaining mentions of values it has already found. The patient mentioned by full name in the intake assessment is the same patient referred to by first name in the progress note. Without propagation, those later mentions leak through.

Propagation is what the literature calls a second-pass patient-specific dictionary — a temporary, in-memory map built from high-confidence detections in the first pass, used to catch residual mentions in the second pass. Dehghan et al. (2015) attribute measurable recall gains to this technique. It costs almost nothing computationally and substantially reduces residual leakage.

5. Replacement that preserves meaning

The final component decides what the redacted text actually looks like. The naive approach — replacing every name with [REDACTED] — destroys the structure the AI needs to do useful work. The patient and their partner both become [REDACTED]. The GP and the psychiatrist both become [REDACTED]. The AI loses the ability to tell anyone apart.

The right approach replaces each name with a consistent code that carries the person’s role. The patient becomes one code, the partner becomes another, the GP becomes a third — and the same person always gets the same code, every time they appear, across every report. The AI sees the same grammar and the same relationships it would see in the original text. It can still tell who is who. It just never sees an actual name. When the AI response comes back, the practitioner’s browser swaps the codes for the real names before anything appears on screen.

This is the layer that lets a redacted progress letter still read like a progress letter — and lets the practitioner trust that the document was generated from a coherent understanding of who is who, even though the AI never saw a real name.

Why ClientJourney does it this way

ClientJourney is a drafting tool, not a system of record. The practitioner’s official notes still live in their primary practice management system. ClientJourney’s job is to take clinical content the practitioner has already captured and synthesise it into the 17 clinical documents practitioners draft repeatedly: intake assessments, case formulations, treatment plans, progress reviews, SOAP/BIRP/DAP notes, discharge summaries, referrer letters, supervision reflections.

That positioning lowers the regulatory bar — ClientJourney is not the source of truth — but raises the trust bar. The practitioner has to be confident that nothing identifying leaves their device, because the alternative is not using the tool at all.

The five-component architecture above is how that confidence is earned. The structured pre-pass walks the schema. Rule-based recognisers handle Australian identifiers. The local language model catches the residue. Propagation links recurring mentions of the same person. Role-preserving replacement maintains the structure the AI needs to produce a useful document. Names never reach the AI provider, and the entire redaction pass happens before the data leaves the practitioner’s machine.

An earlier version of this architecture has been running in production since April 2026 for CoachIQ, our executive coaching platform — same on-device redaction pipeline, applied to a different professional context. The clinical version refines two things specifically: the structured pre-pass is elevated to a first-class layer with practitioner control over the relationships register, and the AHPRA-relevant identifier set (Medicare, AHPRA registration numbers, Australian provider numbers) is built into the rule-based layer rather than added as a domain extension. We wrote about the original architecture in How We Built On-Device De-Identification So AI Never Sees Real Names.

This is grounded in the literature, not in marketing copy. We commissioned the lit review before we wrote the redaction code because we wanted the architecture to follow the evidence, not the other way around.

What we are doing next

ClientJourney is entering UAT shortly. Two further evaluations will follow, documented to the same standard as the literature review:

Redaction performance under realistic clinical text. How does the five-component system perform on real Australian psychology notes — not benchmark corpora? Recall on rare names, professions, organisations. Performance broken down by category.
Document fidelity under redaction. Do the 17 clinical reports degrade in quality when generated from redacted input compared to unmasked input? Where does redaction hurt the output, and where doesn’t it?

These may be published together as a single paper or separately, depending on how the findings sit alongside each other. Either way, both will be fact-checked before publication using the same methodology applied to the lit review itself — every numerical claim traceable to its source, every limitation acknowledged in the open.

If you are a psychologist, counsellor, or allied mental health practitioner with clinical experience who would like to participate in ClientJourney’s UAT, the call for testers is on the portfolio page. Benefits are early access during testing and 12 months free subscription on launch.

The literature review on local PHI masking is available as a resource — it is the source material for everything in this post and the foundation under ClientJourney’s redaction architecture.

Supervised Autonomy: The Middle Path for AI Architecture

Two architecture stories dominate the conversation about AI inside operating businesses, and they're both incomplete for most operators. The middle path is the one most regulated and quality-sensitive operators actually need.

Evaluation 5 min read

The State of Applied AI in Mid-2026

We published a literature review on applied AI in mid-2026, surveying ten capability categories, three independent fact-check passes, written for operational leaders and regulated professionals. Here is what it covers and how to use it.

Building 9 min read

How We Built On-Device De-Identification So AI Never Sees Real Names

Most AI privacy is a policy. Ours is architecture. We run a named entity recognition model inside the browser to strip identifying information before it ever leaves the device. Here is how it works, what we tested, and where it applies.

Technical 7 min read

Your Agency's Clients Are About to Ask Why This Costs So Much

A solo consultant just built in two weeks what your agency quoted eight for. The client doesn't understand AI yet; but they will. The agencies that survive aren't the ones that cut costs. They're the ones that change what they sell.

Adoption 6 min read

What Do You Love Doing? What Do You Hate Doing?

Most AI rollouts fail the same way. Leadership announces efficiency. Staff hear replacement. A developer at a recent peer group meeting offered a reframe that changes everything; the psychology of why it works tells you how to deploy AI without destroying trust.

Technical 7 min read

Why I Don't Use n8n (And What I Do Instead)

If you've been pitched an AI system recently, there's a good chance you saw n8n in the demo. It demos well. But a compelling demo and a reliable production system are different things; and the distance between them is where businesses get hurt.

Technical 10 min read

Your Codebase Was Not Built for AI. That's the Actual Problem.

Amazon's mandatory meeting about AI breaking production isn't an AI tools story. It's an architecture story. The codebases AI is being pointed at were never designed to be understood by anything other than the humans who built them.

Adoption 4 min read

Your Team Has AI Licences. You Don't Have an AI System.

Fifteen people, fifteen separate AI accounts, no shared context. The problem isn't the tool; it's the architecture around it. Here's what fixing it looks like.

Building 7 min read

Your $2,000 Day Starts the Night Before: Our System Keeps You on the Tools, Not on the Phone

Your route is optimised overnight. Your customers are notified automatically. When something changes mid-day, every affected customer gets told without you picking up the phone. A tradie scheduling system that protects your daily rate.

Evaluation 4 min read

The Fastest Way for an Executive to Get Across AI

AI is moving faster than any executive can track. The alternatives: learning it yourself, sitting through vendor pitches, hiring a consultant who arrives with a hammer, all waste your scarcest resource. There is a faster way.

Building 6 min read

Your IT Department Will Take 18 Months. You Need This Working by Next Quarter.

Senior leaders often know exactly what they need built. The gap isn't technical; it's time. A prototype approach gets the tool working now and gives IT a validated blueprint to build from later.

Adoption 4 min read

What If You Had Perfect Memory Across Every Client?

Any practice managing dozens of ongoing client relationships captures more than it can recall. AI gives practitioners perfect memory across every interaction, so preparation time becomes thinking time, not retrieval time.

Building 8 min read

We Built an AI Invoice Verifier. Here's Where It Hits a Wall.

We built an AI invoice verifier and watched a fake beat a real invoice. Here's why document analysis alone cannot stop invoice fraud; the five layers of detection that most businesses never reach.

Building 5 min read

How to Build an AI Chatbot That Doesn't Lie to Your Customers

Woolworths deliberately scripted its AI to talk about its mother. The business fix is simple: be honest about the bot. The technical fix is harder: architecture that prevents fabrication by design, not by hope.

Technical 9 min read

Why AI Safety Features Are Load-Bearing Architecture, Not Political Decoration

The 'woke AI' label came from real failures; but they were engineering failures, not safety failures. Understanding the difference matters for every organisation deploying AI where errors have consequences.

Adoption 3 min read

Woolworths' AI Told a Customer It Had a Mother. That's a Problem.

Woolworths' AI assistant Olive was deliberately scripted to talk about its mother and uncle during customer calls. When callers realised they were talking to an AI pretending to be human, trust broke instantly.

Evaluation 4 min read

Google Is No Longer the Only Way Your Customers Find You

People are using ChatGPT, Perplexity, and Gemini to find businesses. The sites that get cited are structured differently to the sites that rank on Google. Most businesses are optimising for one and invisible to the other.

Evaluation 4 min read

Two Types of AI Assessment: And How to Know Which One You Need

Most businesses considering AI face the same question: where do we start? The answer depends on whether you need to find the opportunities or reclaim the time. Two assessments, two perspectives, one goal.

Evaluation 4 min read

The Personal Workflow Analysis: What Watching a Real Workday Reveals About Automation

When asked how they spend their day, most people describe the work they value, not the work that consumes their time. Recording a typical workday closes that gap, revealing automation opportunities no interview could surface.

Evaluation 4 min read

What a Good AI Audit Actually Delivers

A useful AI audit produces two things: a written report with specific, costed recommendations and a working prototype you can test. Not a slide deck. Not a proposal for more work.

Evaluation 4 min read

Your Website Looked Great Five Years Ago. Now It's Costing You Customers.

The signals that used to build trust online (polished design, stock imagery, aggressive calls to action) now trigger scepticism. Most businesses don't realise their digital presence is working against them.

Evaluation 4 min read

AI Audit That Starts With Your Business

Most AI consultants arrive with a toolkit and look for places to use it. An operations-first audit starts with how your business actually runs, and only recommends AI where the evidence says it will work.

Building 6 min read

What Production AI Teaches You That Demos Never Will

The gap between AI that works in a demo and AI that works in your business is where the useful lessons live. Architecture, framing, privacy, and adoption; the patterns are the same every time.

Adoption 6 min read

The Psychology of Why Your Team Won't Use AI

You buy the tool, run the demo, and three months later nobody is using it. The reason is not the technology; it is five predictable psychological barriers. Each one has a specific strategy that overcomes it.

Technical 4 min read

Stop Telling AI What NOT to Do: The Positive Framing Revolution

Most businesses get poor results from AI because they instruct it with constraints and prohibitions. Switching from negative framing to positive framing transforms output quality, and the principle comes from psychology, not computer science.

Building 5 min read

How We Turned Generic AI Into a Specialist: And What That Means for Your Business

Most businesses get mediocre AI output and blame the model. The fix is almost never a better model; it's a better architecture. Three structural changes that transform AI from 'fine' to 'actually useful.'

Evaluation 5 min read

Your Business Has 9 Customer Touchpoints. AI Can Fix the 6 You're Dropping.

You are spending money to get customers to your door. Then you are losing them because you cannot personally follow up with every lead, nurture every client, and ask for every review. AI can handle the touchpoints you are dropping: quietly, consistently, and at scale.

Technical 5 min read

What Happens to Your Data When You Press 'Send' on an AI Tool

Most businesses are sending customer data, financials, and internal documents to AI tools without understanding what happens during processing. The spectrum of AI privacy protection is wider than you think; recent research shows that even purpose-built security can have structural flaws.