XML Redaction API | Schema-Aware PII Detection

XML Data Protection at Scale

XML remains the backbone of enterprise data exchange, healthcare systems (HL7, CCD), financial services (SWIFT, FpML), and government communications. RedactionAPI provides schema-aware XML processing that detects and removes PII while guaranteeing structural validity—essential for systems where malformed XML causes failures.

Understanding XML Redaction Challenges

XML's hierarchical structure, namespace complexity, and strict validation requirements make PII redaction more challenging than flat text processing. Simply replacing text can break schema validation, damage document structure, or corrupt cross-references.

Key Technical Challenges

Schema Validation

XML schemas (XSD) define strict rules for element types, lengths, and patterns. Redaction must produce output that remains valid against these schemas.

Namespace Complexity

Documents may use multiple namespaces with prefixes, default namespaces, and inheritance. XPath queries must be namespace-aware.

Cross-References

XML ID/IDREF attributes create document-internal references. Redacting an ID without updating references causes validation failures.

Document Size

Enterprise XML files can be gigabytes in size. DOM-based processing runs out of memory; streaming parsers require careful state management.

Schema-Aware Processing

When you provide an XSD schema, RedactionAPI validates input and ensures redacted output remains compliant:

Schema-Aware Request

{
    "document": "<?xml version=\"1.0\"?><Customer xmlns=\"http://example.com/customer\">...</Customer>",
    "document_type": "xml",
    "schema": {
        "type": "xsd",
        "content": "<xs:schema xmlns:xs=\"http://www.w3.org/2001/XMLSchema\">...</xs:schema>"
    },
    "validation": {
        "validate_input": true,
        "validate_output": true,
        "fail_on_invalid": true
    },
    "pii_types": ["name", "ssn", "email", "phone", "address"]
}

Type-Aware Replacement

Our schema processor understands XSD data types and generates appropriate replacements:

XSD Type	Original	Redacted	Strategy
xs:string	John Smith	[REDACTED]	Placeholder text
xs:date	1985-03-15	1900-01-01	Valid date placeholder
xs:integer	123456789	000000000	Zero-filled
Enumeration	Male	Unknown	Neutral enum value
Pattern (SSN)	123-45-6789	XXX-XX-XXXX	Pattern-preserving mask

XPath Targeting

XPath expressions provide surgical precision in specifying which elements and attributes to redact:

XPath Configuration

{
    "document": "<Order>...</Order>",
    "document_type": "xml",
    "xpath_targets": [
        {
            "xpath": "//Customer/Name",
            "pii_type": "name",
            "action": "redact"
        },
        {
            "xpath": "//Customer/SSN",
            "pii_type": "ssn",
            "action": "mask"
        },
        {
            "xpath": "//BillingAddress/*",
            "pii_type": "address",
            "action": "redact"
        },
        {
            "xpath": "//@email",
            "pii_type": "email",
            "action": "tokenize"
        },
        {
            "xpath": "//Notes[contains(text(), 'CONFIDENTIAL')]",
            "action": "remove_element"
        }
    ]
}

XPath Examples

XPath	Targets
//SSN	All SSN elements anywhere in document
/Order/Customer/Name	Specific path from root
//Customer[@type='individual']/SSN	SSN only for individual customers
//@ssn	All attributes named "ssn"
//Person[Age < 18]/Name	Names of minors only
//ns:Patient/ns:MRN	Namespaced elements

Namespace Handling

XML namespaces require careful handling to correctly target elements:

Namespace Configuration

{
    "document": "<Patient xmlns=\"urn:hl7-org:v3\" xmlns:ext=\"urn:example:extension\">...</Patient>",
    "document_type": "xml",
    "namespaces": {
        "hl7": "urn:hl7-org:v3",
        "ext": "urn:example:extension"
    },
    "xpath_targets": [
        {
            "xpath": "//hl7:Patient/hl7:name",
            "pii_type": "name"
        },
        {
            "xpath": "//ext:SSN",
            "pii_type": "ssn"
        }
    ]
}

Namespace-Agnostic Queries

For documents where namespaces vary or are unknown, use local-name() functions:

// Match any element named "SSN" regardless of namespace
//*[local-name()='SSN']

// Match SSN in any namespace under any Patient element
//*[local-name()='Patient']/*[local-name()='SSN']

// Match any attribute named "email" regardless of element
//@*[local-name()='email']

Processing Examples

Basic XML Redaction

Input Document

<?xml version="1.0" encoding="UTF-8"?>
<Customer>
    <Name>John Smith</Name>
    <SSN>123-45-6789</SSN>
    <Email>[email protected]</Email>
    <Phone>(555) 123-4567</Phone>
    <Address>
        <Street>123 Main Street</Street>
        <City>Springfield</City>
        <State>IL</State>
        <Zip>62701</Zip>
    </Address>
    <Notes>Customer mentioned SSN 987-65-4321 during call.</Notes>
</Customer>

Redacted Output

<?xml version="1.0" encoding="UTF-8"?>
<Customer>
    <Name>[NAME]</Name>
    <SSN>[SSN]</SSN>
    <Email>[EMAIL]</Email>
    <Phone>[PHONE]</Phone>
    <Address>
        <Street>[ADDRESS]</Street>
        <City>[CITY]</City>
        <State>IL</State>
        <Zip>[ZIP]</Zip>
    </Address>
    <Notes>Customer mentioned SSN [SSN] during call.</Notes>
</Customer>

HL7 Clinical Document

Healthcare XML Example

<ClinicalDocument xmlns="urn:hl7-org:v3">
    <recordTarget>
        <patientRole>
            <id extension="12345678" root="2.16.840.1.113883.4.1"/>
            <patient>
                <name>
                    <given>John</given>
                    <family>Smith</family>
                </name>
                <birthTime value="19850315"/>
                <administrativeGenderCode code="M"/>
            </patient>
        </patientRole>
    </recordTarget>
    <component>
        <structuredBody>
            <!-- Clinical content -->
        </structuredBody>
    </component>
</ClinicalDocument>

For HL7/CCD documents, we provide pre-configured profiles that understand healthcare-specific PII locations:

{
    "document": "...",
    "document_type": "xml",
    "profile": "hl7_ccd",
    "redaction_level": "safe_harbor"  // HIPAA Safe Harbor de-identification
}

Streaming Large Files

For files too large for memory-based processing, use streaming mode:

Streaming Configuration

# Using curl with streaming
curl -X POST https://api.redactionapi.com/v1/redact/stream \
    -H "Authorization: Bearer YOUR_API_KEY" \
    -H "Content-Type: application/xml" \
    -H "X-Document-Type: xml" \
    -H "X-Streaming: true" \
    -H "X-PII-Types: name,ssn,email,phone,address" \
    --data-binary @large_file.xml \
    -o redacted_file.xml

Streaming Best Practices

Specify Element Boundaries: Tell us which elements contain independent records (e.g., each <Customer> is complete) for optimal memory usage.
Use XPath Targeting: XPath targets are evaluated incrementally during streaming, avoiding full document scans.
Consider Chunked Processing: For very large files, split by record and process in parallel, then reassemble.
Monitor Progress: Use our progress callback webhooks for long-running stream operations.

XSLT Integration

Incorporate redaction into XSLT transformation pipelines:

XSLT Extension Function

<xsl:stylesheet version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:redact="http://redactionapi.com/xslt">

    <xsl:template match="Customer/Name">
        <Name><xsl:value-of select="redact:redact(., 'name')"/></Name>
    </xsl:template>

    <xsl:template match="Customer/SSN">
        <SSN><xsl:value-of select="redact:mask(., 'ssn')"/></SSN>
    </xsl:template>

    <xsl:template match="@email">
        <xsl:attribute name="email">
            <xsl:value-of select="redact:tokenize(., 'email')"/>
        </xsl:attribute>
    </xsl:template>

    <!-- Identity transform for everything else -->
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

Industry-Specific XML Standards

We provide pre-built profiles for common XML standards:

Healthcare

HL7 v3/CDA: Clinical documents
CCD: Continuity of Care Documents
FHIR XML: Modern healthcare interchange
DICOM SR: Imaging reports

Financial

SWIFT/ISO 20022: Payment messages
FpML: Derivatives trading
XBRL: Financial reporting
FIX/FIXML: Trading messages

Government

NIEM: National Information Exchange
GJXDM: Justice XML
HR-XML: Human resources
UBL: Universal Business Language

E-Commerce

ebXML: Business processes
cXML: Commerce XML
OAGIS: Open Applications Group
RosettaNet: Supply chain

SDK Examples

Python SDK

from redactionapi import RedactionClient

client = RedactionClient(api_key="your_api_key")

# Read XML file
with open("customers.xml", "r") as f:
    xml_content = f.read()

# Redact with XPath targeting
result = client.redact_xml(
    document=xml_content,
    xpath_targets=[
        {"xpath": "//Customer/SSN", "pii_type": "ssn"},
        {"xpath": "//Customer/Name", "pii_type": "name"},
        {"xpath": "//@email", "pii_type": "email"}
    ],
    preserve_formatting=True
)

# Save redacted output
with open("customers_redacted.xml", "w") as f:
    f.write(result.redacted_document)

Start Processing XML Documents

RedactionAPI provides enterprise-grade XML processing with schema validation, namespace support, and streaming capabilities. Protect PII in your XML data while maintaining structural integrity.

Start Free Trial View Documentation

XML Document Redaction

Powerful Redaction Features

Schema-Aware Processing

XPath Targeting

Namespace Support

XSLT Integration

Attribute Handling

Streaming Processing

XML Data Protection at Scale

Understanding XML Redaction Challenges

Key Technical Challenges

Schema Validation

Namespace Complexity

Cross-References

Document Size

Schema-Aware Processing

Schema-Aware Request

Type-Aware Replacement

XPath Targeting

XPath Configuration

XPath Examples

Namespace Handling

Namespace Configuration

Namespace-Agnostic Queries

Processing Examples

Basic XML Redaction

Input Document

Redacted Output

HL7 Clinical Document

Healthcare XML Example

Streaming Large Files

Streaming Configuration

Streaming Best Practices

XSLT Integration

XSLT Extension Function

Industry-Specific XML Standards

Healthcare

Financial

Government

E-Commerce

SDK Examples

Python SDK

Start Processing XML Documents