HTML Redaction API | Web Content PII Protection

Structure-Preserving HTML Document Redaction

HTML documents present unique redaction challenges—sensitive information can appear in text content, attributes, embedded scripts, and structured data. Simple find-and-replace approaches break document structure, invalidate markup, and create security vulnerabilities. RedactionAPI provides DOM-aware processing that maintains valid HTML while comprehensively protecting sensitive data.

Whether you're processing web page exports, HTML email archives, customer portal content, or web-scraped data, our API understands HTML's nested structure and context-dependent encoding requirements. Redacted documents remain fully functional—links work, styles apply correctly, and the content renders properly across browsers while sensitive information is protected.

HTML PII Location Categories

Our processor detects sensitive data across all HTML locations:

Text Content Locations

• Paragraph and heading text
• List items and table cells
• Form labels and help text
• Link anchor text
• Figure captions
• Comment nodes

Attribute Locations

• href and src URLs
• data-* custom attributes
• alt, title, and aria-* text
• Form input values/placeholders
• Meta tag content
• Microdata and RDFa attributes

Embedded Data

• Inline JavaScript objects
• JSON-LD structured data
• CSS content properties
• SVG text elements
• Data URIs
• Server-side include markers

Email-Specific Locations

• Preheader/preview text
• Email header encoded data
• Tracking pixel URLs
• MSO conditional content
• Forwarded message headers
• Signature blocks

DOM-Aware Redaction Processing

Our HTML processor builds a complete Document Object Model representation before making any modifications. This ensures structural integrity throughout the redaction process—tag pairs remain matched, attributes stay properly quoted, and character entities are correctly encoded for their context.

Processing Pipeline

1

Parse

Build DOM tree with error recovery for malformed HTML

2

Detect

Scan text nodes and attributes for PII patterns

3

Redact

Apply context-appropriate redaction to each location

4

Serialize

Output valid, properly escaped HTML document

Context-Aware Encoding

Different HTML contexts require different character encoding. Our processor automatically applies the correct encoding based on where redacted content appears:

Context-Specific Encoding Examples


<a href="mailto:[email protected]">Contact John Smith</a>
<img src="/photos/john-smith.jpg" alt="John Smith, CEO">
<script>
  var userData = { name: "John Smith", email: "[email protected]" };
</script>


<a href="mailto:[EMAIL_REDACTED]">Contact [NAME_REDACTED]</a>
<img src="/photos/[REDACTED].jpg" alt="[NAME_REDACTED], CEO">
<script>
  var userData = { name: "[NAME_REDACTED]", email: "[EMAIL_REDACTED]" };
</script>

// Encoding applied:
// - Text content: HTML entities (&, <, >, etc.)
// - URLs: Percent encoding (%40 for @, %20 for space)
// - JavaScript: Unicode escapes (\u0040) or JSON escaping
// - Attribute values: HTML attribute encoding

Malformed HTML Recovery

Real-world HTML is often malformed—unclosed tags, improper nesting, missing quotes. Our parser implements HTML5's error recovery algorithms to build a valid DOM even from broken input, then produces well-formed output. The redacted document is often structurally cleaner than the original while maintaining visual fidelity.

HTML Attribute Redaction

HTML attributes frequently contain sensitive information that simple text scanning misses. Email addresses in href="mailto:" links, names in image alt text, user data in custom data-* attributes, and PII in URL parameters all require detection and appropriate handling. RedactionAPI scans all attributes with configurable rules for different attribute types.

URL Attributes

Special handling for URL-containing attributes:

href="mailto:[email protected]"

→ Redact email, preserve mailto: protocol

src="/users/john-smith/profile.jpg"

→ Detect names in path segments

href="?user=jsmith&email=..."

→ Parse and redact query parameters

Data Attributes

Processing custom data-* attributes:

data-user-name="John Smith"

→ Detect by attribute name pattern

data-config='{"email":"..."}'

→ Parse embedded JSON, redact values

data-analytics-user-id="12345"

→ Configurable ID redaction rules

Form Field Processing

HTML forms often contain pre-filled PII in value attributes, placeholder text, and autocomplete hints:

Form Field Redaction

// Original form with PII
<form>
  <input type="text" name="fullName" value="John Smith"
         placeholder="Enter your name">
  <input type="email" name="email" value="[email protected]"
         autocomplete="email">
  <input type="tel" name="phone" value="555-123-4567">
  <input type="hidden" name="user_ssn" value="123-45-6789">
  <textarea name="address">123 Main St, Anytown USA</textarea>
</form>

// Redacted form
<form>
  <input type="text" name="fullName" value="[NAME_REDACTED]"
         placeholder="Enter your name">
  <input type="email" name="email" value="[EMAIL_REDACTED]"
         autocomplete="email">
  <input type="tel" name="phone" value="[PHONE_REDACTED]">
  <input type="hidden" name="user_ssn" value="[SSN_REDACTED]">
  <textarea name="address">[ADDRESS_REDACTED]</textarea>
</form>

// Configuration options:
{
  "form_fields": {
    "redact_values": true,        // Clear pre-filled data
    "preserve_placeholders": true, // Keep instructional text
    "redact_hidden_fields": true,  // Process type="hidden"
    "clear_autocomplete": false    // Optionally remove autocomplete hints
  }
}

Embedded Script and Structured Data Processing

Modern HTML documents embed significant data in script blocks—JavaScript configuration objects, JSON-LD for SEO, analytics initialization, and application state. This embedded data frequently contains PII that text-based redaction misses. RedactionAPI parses and processes these embedded formats while maintaining valid syntax.

JSON-LD Structured Data

Schema.org structured data often contains personal information in author, contact, and organization schemas:

// Original JSON-LD with author PII
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Understanding Data Privacy",
  "author": {
    "@type": "Person",
    "name": "Dr. Jane Smith",
    "email": "[email protected]",
    "telephone": "+1-555-123-4567"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Privacy Institute",
    "contactPoint": {
      "telephone": "+1-555-987-6543",
      "email": "[email protected]"
    }
  }
}
</script>

// Redacted JSON-LD preserving schema structure
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Understanding Data Privacy",
  "author": {
    "@type": "Person",
    "name": "[NAME_REDACTED]",
    "email": "[EMAIL_REDACTED]",
    "telephone": "[PHONE_REDACTED]"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Privacy Institute",
    "contactPoint": {
      "telephone": "[PHONE_REDACTED]",
      "email": "[EMAIL_REDACTED]"
    }
  }
}
</script>

Inline JavaScript Processing

JavaScript blocks may contain user data in variable assignments, configuration objects, or template literals:

Detection Targets

• Variable assignments with PII values
• Object literals with user data
• Array elements containing PII
• Template literal interpolations
• Function call arguments

Preservation Requirements

• Valid JavaScript syntax
• Proper string escaping
• Object/array structure
• Variable references
• Function call semantics

CSS Content Processing

CSS content properties and generated content can contain PII, particularly in before/after pseudo-elements used for decorative text or accessibility labels. Our processor scans inline styles and embedded stylesheets for content declarations containing sensitive data.

HTML Email Processing

HTML emails have unique constraints that our processor handles specially. Inline CSS is required for compatibility, table-based layouts remain common, and various email clients have different rendering quirks. RedactionAPI processes HTML emails while maintaining their cross-client rendering fidelity.

Email-Specific Features

Preheader text: Hidden preview text containing recipient names
Inline styles: Preserved during content redaction
Table layouts: Structure maintained for Outlook compatibility
MSO conditionals: Office-specific code processed correctly
Tracking pixels: URL parameters with user IDs redacted

Client Compatibility

Gmail: Inline CSS preserved, data-* attributes maintained
Outlook: VML/Word ML conditionals processed
Apple Mail: Webkit-specific styles preserved
Mobile clients: Responsive breakpoints maintained
Plain text: Option to generate redacted text version

Email HTML Processing Example

// Process HTML email with email-specific rules
POST /api/v1/html/redact
{
  "content": "<!DOCTYPE html><html>...email HTML...</html>",
  "document_type": "html_email",
  "options": {
    "preserve_inline_styles": true,
    "process_preheader": true,
    "redact_tracking_urls": {
      "enabled": true,
      "preserve_base_url": true,
      "redact_parameters": ["uid", "email", "subscriber_id"]
    },
    "process_mso_conditionals": true,
    "output_format": "multipart"  // Include text/plain version
  }
}

// Response includes both HTML and plain text versions
{
  "html_content": "<!DOCTYPE html>...redacted email...</html>",
  "text_content": "...redacted plain text version...",
  "redactions_applied": 23,
  "locations": {
    "body_text": 12,
    "preheader": 1,
    "tracking_urls": 8,
    "alt_attributes": 2
  }
}

Security and XSS Prevention

Processing untrusted HTML requires careful attention to security. Our redaction process not only protects PII but also ensures the output is safe to render—properly escaped to prevent cross-site scripting attacks. Whether the original content was benign or contained malicious payloads, the redacted output is safe for browser display.

XSS Prevention Measures

Input Sanitization

• Strip javascript: and data: URIs from links
• Remove event handler attributes (onclick, onerror)
• Sanitize SVG content for script injection
• Normalize Unicode to prevent homograph attacks

Output Encoding

• Context-appropriate entity encoding
• URL parameter encoding for href/src
• JavaScript string escaping in scripts
• CSS value encoding in style attributes

Content Security Policy Compatibility

Our HTML output is compatible with strict Content Security Policy headers. Redaction markers use plain text that doesn't require unsafe-inline for scripts or styles. Organizations can serve redacted documents with restrictive CSP headers while maintaining full functionality.

Security Options

// Security-focused HTML redaction configuration
{
  "security": {
    "sanitize_output": true,
    "strip_scripts": false,           // Keep scripts, sanitize content
    "strip_event_handlers": true,     // Remove onclick, onerror, etc.
    "strip_javascript_urls": true,    // Remove javascript: hrefs
    "strip_data_urls": false,         // Keep data: for images (optional)
    "encode_entities": true,          // Always encode special chars
    "normalize_unicode": true,        // Prevent homograph attacks
    "validate_urls": true,            // Check URL format validity
    "csp_compatible": true            // Ensure CSP-friendly output
  }
}

HTML Redaction API Usage

Our API accepts HTML content and returns redacted HTML with comprehensive metadata about detected and redacted PII. You can customize redaction behavior for different element types, attribute categories, and embedded data formats.

Complete API Example

// POST /api/v1/document/html
curl -X POST https://api.redactionapi.com/v1/document/html \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "<!DOCTYPE html><html><head><title>User Profile</title></head><body><h1>Welcome, John Smith</h1><p>Email: [email protected]</p><p>Phone: 555-123-4567</p><a href=\"mailto:[email protected]\">Contact</a><img src=\"/avatars/jsmith.jpg\" alt=\"John Smith profile photo\"><script>var user = {name: \"John Smith\", id: 12345};</script></body></html>",
    "options": {
      "detect_in_text": true,
      "detect_in_attributes": true,
      "detect_in_scripts": true,
      "redaction_style": "placeholder",
      "preserve_structure": true
    }
  }'

// Response
{
  "success": true,
  "redacted_content": "<!DOCTYPE html><html><head><title>User Profile</title></head><body><h1>Welcome, [NAME_REDACTED]</h1><p>Email: [EMAIL_REDACTED]</p><p>Phone: [PHONE_REDACTED]</p><a href=\"mailto:[EMAIL_REDACTED]\">Contact</a><img src=\"/avatars/[REDACTED].jpg\" alt=\"[NAME_REDACTED] profile photo\"><script>var user = {name: \"[NAME_REDACTED]\", id: 12345};</script></body></html>",
  "statistics": {
    "total_redactions": 8,
    "by_type": {
      "name": 4,
      "email": 2,
      "phone": 1,
      "path_segment": 1
    },
    "by_location": {
      "text_content": 4,
      "href_attribute": 1,
      "alt_attribute": 1,
      "src_attribute": 1,
      "script_content": 1
    }
  },
  "valid_html": true,
  "processing_time_ms": 45
}

Protect Your Web Content

Start redacting sensitive data from HTML documents with structure-preserving, XSS-safe processing. Web pages, emails, and exported content protected.

Start Free Trial View Documentation

HTML Document Redaction for Web Content Privacy

Powerful Redaction Features

Structure-Aware Processing

Content & Attribute Redaction

Email Template Processing

Embedded Data Handling

XSS-Safe Output

CSS & Style Preservation