Structure-Preserving HTML Document Redaction
HTML documents present unique redaction challenges—sensitive information can appear in text content, attributes, embedded scripts, and structured data. Simple find-and-replace approaches break document structure, invalidate markup, and create security vulnerabilities. RedactionAPI provides DOM-aware processing that maintains valid HTML while comprehensively protecting sensitive data.
Whether you're processing web page exports, HTML email archives, customer portal content, or web-scraped data, our API understands HTML's nested structure and context-dependent encoding requirements. Redacted documents remain fully functional—links work, styles apply correctly, and the content renders properly across browsers while sensitive information is protected.
HTML PII Location Categories
Our processor detects sensitive data across all HTML locations:
Text Content Locations
- • Paragraph and heading text
- • List items and table cells
- • Form labels and help text
- • Link anchor text
- • Figure captions
- • Comment nodes
Attribute Locations
- • href and src URLs
- • data-* custom attributes
- • alt, title, and aria-* text
- • Form input values/placeholders
- • Meta tag content
- • Microdata and RDFa attributes
Embedded Data
- • Inline JavaScript objects
- • JSON-LD structured data
- • CSS content properties
- • SVG text elements
- • Data URIs
- • Server-side include markers
Email-Specific Locations
- • Preheader/preview text
- • Email header encoded data
- • Tracking pixel URLs
- • MSO conditional content
- • Forwarded message headers
- • Signature blocks
DOM-Aware Redaction Processing
Our HTML processor builds a complete Document Object Model representation before making any modifications. This ensures structural integrity throughout the redaction process—tag pairs remain matched, attributes stay properly quoted, and character entities are correctly encoded for their context.
Processing Pipeline
Parse
Build DOM tree with error recovery for malformed HTML
Detect
Scan text nodes and attributes for PII patterns
Redact
Apply context-appropriate redaction to each location
Serialize
Output valid, properly escaped HTML document
Context-Aware Encoding
Different HTML contexts require different character encoding. Our processor automatically applies the correct encoding based on where redacted content appears:
Context-Specific Encoding Examples
<a href="mailto:[email protected]">Contact John Smith</a> <img src="/photos/john-smith.jpg" alt="John Smith, CEO"> <script> var userData = { name: "John Smith", email: "[email protected]" }; </script> <a href="mailto:[EMAIL_REDACTED]">Contact [NAME_REDACTED]</a> <img src="/photos/[REDACTED].jpg" alt="[NAME_REDACTED], CEO"> <script> var userData = { name: "[NAME_REDACTED]", email: "[EMAIL_REDACTED]" }; </script> // Encoding applied: // - Text content: HTML entities (&, <, >, etc.) // - URLs: Percent encoding (%40 for @, %20 for space) // - JavaScript: Unicode escapes (\u0040) or JSON escaping // - Attribute values: HTML attribute encoding
Malformed HTML Recovery
Real-world HTML is often malformed—unclosed tags, improper nesting, missing quotes. Our parser implements HTML5's error recovery algorithms to build a valid DOM even from broken input, then produces well-formed output. The redacted document is often structurally cleaner than the original while maintaining visual fidelity.
HTML Attribute Redaction
HTML attributes frequently contain sensitive information that simple text scanning misses. Email addresses in href="mailto:" links, names in image alt text, user data in custom data-* attributes, and PII in URL parameters all require detection and appropriate handling. RedactionAPI scans all attributes with configurable rules for different attribute types.
URL Attributes
Special handling for URL-containing attributes:
href="mailto:[email protected]"
→ Redact email, preserve mailto: protocol
src="/users/john-smith/profile.jpg"
→ Detect names in path segments
href="?user=jsmith&email=..."
→ Parse and redact query parameters
Data Attributes
Processing custom data-* attributes:
data-user-name="John Smith"
→ Detect by attribute name pattern
data-config='{"email":"..."}'
→ Parse embedded JSON, redact values
data-analytics-user-id="12345"
→ Configurable ID redaction rules
Form Field Processing
HTML forms often contain pre-filled PII in value attributes, placeholder text, and autocomplete hints:
Form Field Redaction
// Original form with PII
<form>
<input type="text" name="fullName" value="John Smith"
placeholder="Enter your name">
<input type="email" name="email" value="[email protected]"
autocomplete="email">
<input type="tel" name="phone" value="555-123-4567">
<input type="hidden" name="user_ssn" value="123-45-6789">
<textarea name="address">123 Main St, Anytown USA</textarea>
</form>
// Redacted form
<form>
<input type="text" name="fullName" value="[NAME_REDACTED]"
placeholder="Enter your name">
<input type="email" name="email" value="[EMAIL_REDACTED]"
autocomplete="email">
<input type="tel" name="phone" value="[PHONE_REDACTED]">
<input type="hidden" name="user_ssn" value="[SSN_REDACTED]">
<textarea name="address">[ADDRESS_REDACTED]</textarea>
</form>
// Configuration options:
{
"form_fields": {
"redact_values": true, // Clear pre-filled data
"preserve_placeholders": true, // Keep instructional text
"redact_hidden_fields": true, // Process type="hidden"
"clear_autocomplete": false // Optionally remove autocomplete hints
}
}
Embedded Script and Structured Data Processing
Modern HTML documents embed significant data in script blocks—JavaScript configuration objects, JSON-LD for SEO, analytics initialization, and application state. This embedded data frequently contains PII that text-based redaction misses. RedactionAPI parses and processes these embedded formats while maintaining valid syntax.
JSON-LD Structured Data
Schema.org structured data often contains personal information in author, contact, and organization schemas:
// Original JSON-LD with author PII
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Article",
"headline": "Understanding Data Privacy",
"author": {
"@type": "Person",
"name": "Dr. Jane Smith",
"email": "[email protected]",
"telephone": "+1-555-123-4567"
},
"publisher": {
"@type": "Organization",
"name": "Privacy Institute",
"contactPoint": {
"telephone": "+1-555-987-6543",
"email": "[email protected]"
}
}
}
</script>
// Redacted JSON-LD preserving schema structure
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Article",
"headline": "Understanding Data Privacy",
"author": {
"@type": "Person",
"name": "[NAME_REDACTED]",
"email": "[EMAIL_REDACTED]",
"telephone": "[PHONE_REDACTED]"
},
"publisher": {
"@type": "Organization",
"name": "Privacy Institute",
"contactPoint": {
"telephone": "[PHONE_REDACTED]",
"email": "[EMAIL_REDACTED]"
}
}
}
</script>
Inline JavaScript Processing
JavaScript blocks may contain user data in variable assignments, configuration objects, or template literals:
Detection Targets
- • Variable assignments with PII values
- • Object literals with user data
- • Array elements containing PII
- • Template literal interpolations
- • Function call arguments
Preservation Requirements
- • Valid JavaScript syntax
- • Proper string escaping
- • Object/array structure
- • Variable references
- • Function call semantics
CSS Content Processing
CSS content properties and generated content can contain PII, particularly in before/after pseudo-elements used for decorative text or accessibility labels. Our processor scans inline styles and embedded stylesheets for content declarations containing sensitive data.
HTML Email Processing
HTML emails have unique constraints that our processor handles specially. Inline CSS is required for compatibility, table-based layouts remain common, and various email clients have different rendering quirks. RedactionAPI processes HTML emails while maintaining their cross-client rendering fidelity.
Email-Specific Features
- Preheader text: Hidden preview text containing recipient names
- Inline styles: Preserved during content redaction
- Table layouts: Structure maintained for Outlook compatibility
- MSO conditionals: Office-specific code processed correctly
- Tracking pixels: URL parameters with user IDs redacted
Client Compatibility
- Gmail: Inline CSS preserved, data-* attributes maintained
- Outlook: VML/Word ML conditionals processed
- Apple Mail: Webkit-specific styles preserved
- Mobile clients: Responsive breakpoints maintained
- Plain text: Option to generate redacted text version
Email HTML Processing Example
// Process HTML email with email-specific rules
POST /api/v1/html/redact
{
"content": "<!DOCTYPE html><html>...email HTML...</html>",
"document_type": "html_email",
"options": {
"preserve_inline_styles": true,
"process_preheader": true,
"redact_tracking_urls": {
"enabled": true,
"preserve_base_url": true,
"redact_parameters": ["uid", "email", "subscriber_id"]
},
"process_mso_conditionals": true,
"output_format": "multipart" // Include text/plain version
}
}
// Response includes both HTML and plain text versions
{
"html_content": "<!DOCTYPE html>...redacted email...</html>",
"text_content": "...redacted plain text version...",
"redactions_applied": 23,
"locations": {
"body_text": 12,
"preheader": 1,
"tracking_urls": 8,
"alt_attributes": 2
}
}
Security and XSS Prevention
Processing untrusted HTML requires careful attention to security. Our redaction process not only protects PII but also ensures the output is safe to render—properly escaped to prevent cross-site scripting attacks. Whether the original content was benign or contained malicious payloads, the redacted output is safe for browser display.
XSS Prevention Measures
Input Sanitization
- • Strip javascript: and data: URIs from links
- • Remove event handler attributes (onclick, onerror)
- • Sanitize SVG content for script injection
- • Normalize Unicode to prevent homograph attacks
Output Encoding
- • Context-appropriate entity encoding
- • URL parameter encoding for href/src
- • JavaScript string escaping in scripts
- • CSS value encoding in style attributes
Content Security Policy Compatibility
Our HTML output is compatible with strict Content Security Policy headers. Redaction markers use plain text that doesn't require unsafe-inline for scripts or styles. Organizations can serve redacted documents with restrictive CSP headers while maintaining full functionality.
Security Options
// Security-focused HTML redaction configuration
{
"security": {
"sanitize_output": true,
"strip_scripts": false, // Keep scripts, sanitize content
"strip_event_handlers": true, // Remove onclick, onerror, etc.
"strip_javascript_urls": true, // Remove javascript: hrefs
"strip_data_urls": false, // Keep data: for images (optional)
"encode_entities": true, // Always encode special chars
"normalize_unicode": true, // Prevent homograph attacks
"validate_urls": true, // Check URL format validity
"csp_compatible": true // Ensure CSP-friendly output
}
}
HTML Redaction API Usage
Our API accepts HTML content and returns redacted HTML with comprehensive metadata about detected and redacted PII. You can customize redaction behavior for different element types, attribute categories, and embedded data formats.
Complete API Example
// POST /api/v1/document/html
curl -X POST https://api.redactionapi.com/v1/document/html \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"content": "<!DOCTYPE html><html><head><title>User Profile</title></head><body><h1>Welcome, John Smith</h1><p>Email: [email protected]</p><p>Phone: 555-123-4567</p><a href=\"mailto:[email protected]\">Contact</a><img src=\"/avatars/jsmith.jpg\" alt=\"John Smith profile photo\"><script>var user = {name: \"John Smith\", id: 12345};</script></body></html>",
"options": {
"detect_in_text": true,
"detect_in_attributes": true,
"detect_in_scripts": true,
"redaction_style": "placeholder",
"preserve_structure": true
}
}'
// Response
{
"success": true,
"redacted_content": "<!DOCTYPE html><html><head><title>User Profile</title></head><body><h1>Welcome, [NAME_REDACTED]</h1><p>Email: [EMAIL_REDACTED]</p><p>Phone: [PHONE_REDACTED]</p><a href=\"mailto:[EMAIL_REDACTED]\">Contact</a><img src=\"/avatars/[REDACTED].jpg\" alt=\"[NAME_REDACTED] profile photo\"><script>var user = {name: \"[NAME_REDACTED]\", id: 12345};</script></body></html>",
"statistics": {
"total_redactions": 8,
"by_type": {
"name": 4,
"email": 2,
"phone": 1,
"path_segment": 1
},
"by_location": {
"text_content": 4,
"href_attribute": 1,
"alt_attribute": 1,
"src_attribute": 1,
"script_content": 1
}
},
"valid_html": true,
"processing_time_ms": 45
}
Protect Your Web Content
Start redacting sensitive data from HTML documents with structure-preserving, XSS-safe processing. Web pages, emails, and exported content protected.