Korean Language Redaction API | Hangul PII Detection

Understanding Korean Language PII Detection

Korean presents unique challenges for automated PII detection due to its Hangul writing system, complex honorific structures, and national identifier formats. RedactionAPI's Korean language support provides comprehensive detection capabilities that understand the nuances of Korean text while ensuring compliance with South Korea's Personal Information Protection Act (PIPA).

The Complexity of Korean Text Processing

The Korean language uses Hangul, a featural alphabet invented by King Sejong in 1443. Unlike Chinese characters that represent meanings or Japanese scripts that mix syllabaries, Hangul consists of jamo (letters) combined into syllable blocks. Modern Korean uses 11,172 possible syllable blocks, created from 24 basic jamo: 14 consonants and 10 vowels.

For PII detection, this structure creates both opportunities and challenges. Korean text has clear syllable boundaries, making word segmentation more predictable than in Chinese or Japanese. However, the language lacks spaces within compound nouns, uses complex honorific systems that modify names, and often mixes Hangul with Hanja (Chinese characters) in formal contexts.

Korean Name Detection Strategies

Korean names typically follow the pattern of a single-syllable family name followed by a two-syllable given name, though variations exist. Approximately 45% of Koreans share just three family names: Kim (김), Lee/Yi (이), and Park (박). Our detection system leverages this concentrated distribution while accounting for the remaining 283+ family names documented in Korean records.

Korean Name Pattern Examples

Standard Patterns

김민준 - Three syllables (family + given)
이서연 - Common structure
박지훈 - Family name Park

With Honorifics

김민준 님 - With honorific suffix
이서연 씨 - Common address form
박 부장님 - Title with family name

Korean Resident Registration Number (RRN) Detection

The Resident Registration Number (주민등록번호) is Korea's primary national identifier, assigned to all citizens at birth and to foreign residents upon registration. This 13-digit number contains encoded personal information and is heavily protected under Korean law.

RRN Structure and Validation

The RRN follows the format YYMMDD-GNNNNNN, where:

YYMMDD - Date of birth (year, month, day)
G - Gender and century indicator:
- 1 = Male born 1900-1999
- 2 = Female born 1900-1999
- 3 = Male born 2000-2099
- 4 = Female born 2000-2099
- 5-8 = Foreign residents
NNNN - Registration location code and sequence
N - Check digit calculated via weighted modulo 11

RRN Checksum Algorithm

Our validation uses the official weighted checksum algorithm mandated by the Korean government. Each of the first 12 digits is multiplied by a weight from the sequence [2,3,4,5,6,7,8,9,2,3,4,5], summed, and the check digit is calculated as (11 - (sum mod 11)) mod 10.

// RRN Validation Algorithm
function validateKoreanRRN(rrn) {
    // Remove hyphen if present
    const digits = rrn.replace('-', '');
    if (digits.length !== 13) return false;

    const weights = [2, 3, 4, 5, 6, 7, 8, 9, 2, 3, 4, 5];
    let sum = 0;

    for (let i = 0; i < 12; i++) {
        sum += parseInt(digits[i]) * weights[i];
    }

    const checkDigit = (11 - (sum % 11)) % 10;
    return checkDigit === parseInt(digits[12]);
}

Korean Address Recognition

South Korea uses two parallel address systems, and many documents and databases contain addresses in either or both formats. Our parser handles the complexity of Korean administrative divisions and building identification.

Road Name Address System (도로명주소)

Introduced in 2014 as the official system, road-name addresses follow a logical structure based on streets rather than land parcels:

서울특별시 강남구 테헤란로 152, 강남파이낸스센터 12층

서울특별시 - Metropolitan city (Seoul Special City)
강남구 - District (Gangnam-gu)
테헤란로 152 - Road name and building number
강남파이낸스센터 12층 - Building name and floor

Land Lot Address System (지번주소)

The traditional system based on land parcel numbers remains in common use, particularly in older documents and rural areas:

서울특별시 강남구 역삼동 737

서울특별시 - Metropolitan city
강남구 - District
역삼동 - Neighborhood (dong)
737 - Land lot number

Korean Phone Number Detection

Korean phone numbers follow specific patterns that our system recognizes across multiple formatting variations:

Phone Number Formats

Mobile Numbers

010-1234-5678 - Standard mobile
010.1234.5678 - Dot separated
01012345678 - No separators

Landline Numbers

02-1234-5678 - Seoul
031-123-4567 - Gyeonggi
051-234-5678 - Busan

PIPA Compliance Requirements

South Korea's Personal Information Protection Act (개인정보 보호법) is one of the world's most stringent data protection regulations. Enacted in 2011 and significantly amended in 2020, PIPA imposes strict requirements on organizations handling Korean personal data.

Unique Identifiers Under PIPA

PIPA designates certain identifiers as "Unique Identifiers" (고유식별정보) requiring enhanced protection. Collection and processing of these identifiers requires explicit consent or legal basis:

PIPA Unique Identifiers

Resident Registration Number (주민등록번호)
Passport Number (여권번호)
Driver's License Number (운전면허번호)
Alien Registration Number (외국인등록번호)

Sensitive Information

Health and medical information
Genetic and biometric data
Race and ethnicity
Political opinions and union membership

RRN Processing Restrictions

Since 2014, PIPA has prohibited the collection of RRNs except in cases specifically authorized by law. Organizations must implement technical measures to ensure RRNs are not collected without legal basis. RedactionAPI helps organizations identify and protect RRNs that may have been collected historically or appear in documents unexpectedly.

API Integration for Korean Text

Processing Korean text through our API is straightforward. The system automatically detects Korean content and applies appropriate detection rules:

{
    "text": "고객님 김민준 님의 주민등록번호 850315-1234567로 본인확인이 완료되었습니다. 연락처: 010-9876-5432",
    "language": "ko",
    "pii_types": ["name", "rrn", "phone"],
    "redaction_style": "mask"
}

Response:

{
    "redacted_text": "고객님 [NAME] 님의 주민등록번호 [RRN]로 본인확인이 완료되었습니다. 연락처: [PHONE]",
    "entities": [
        {
            "type": "name",
            "value": "김민준",
            "position": {"start": 4, "end": 7},
            "confidence": 0.96
        },
        {
            "type": "rrn",
            "value": "850315-1234567",
            "position": {"start": 17, "end": 31},
            "confidence": 0.99,
            "validation": "checksum_valid"
        },
        {
            "type": "phone",
            "value": "010-9876-5432",
            "position": {"start": 48, "end": 61},
            "confidence": 0.98
        }
    ]
}

Handling Mixed Script Content

Korean documents frequently contain mixed content including English, Hanja (Chinese characters), and numbers. Our system maintains accuracy across these mixed contexts:

Mixed Content Examples

Korean-English: "담당자: John Kim (김존)" - Detects both name representations
Hanja Names: "李成桂 (이성계)" - Recognizes Hanja with Hangul reading
Formal Documents: "株式會社" vs "주식회사" - Company designation variants
Dates: "2024년 3월 15일" vs "2024.03.15" - Multiple date formats

Additional Korean Identifiers

Korean Driver's License Numbers

Korean driver's licenses use a 12-character format: regional code (2 digits) + year (2 digits) + sequence (6 digits) + check digits (2 digits). Example: 서울-12-345678-01

Business Registration Numbers

Korean business registration numbers (사업자등록번호) follow a 10-digit format (XXX-XX-XXXXX) with regional office codes and check digit validation.

Korean Passport Numbers

Korean passports use a format starting with M or R followed by 8 digits, with validation based on ICAO standards.

Performance Optimizations for Korean

Our Korean processing pipeline includes several optimizations:

Syllable Block Indexing: Pre-indexed lookup tables for common Hangul patterns enable sub-millisecond pattern matching
Jamo Decomposition: When needed for fuzzy matching, we decompose syllables into constituent jamo without performance penalty
Name Database Optimization: Bloom filters provide rapid elimination of non-name patterns before detailed analysis
Regional Code Caching: Cached mappings for area codes, postal codes, and regional identifiers accelerate address parsing

Enterprise Use Cases

Financial Services

Korean banks and fintech companies use our API to redact RRNs and account information in customer communications, audit logs, and internal documents while maintaining PIPA compliance.

Healthcare

Medical institutions process patient records through our system to remove identification while preserving clinical value for research and analytics.

E-commerce

Korean online retailers redact customer data in order histories, support tickets, and shipping records to minimize data exposure.

Telecommunications

Mobile carriers process call records and customer data through our API to support regulatory compliance and internal data governance requirements.

Start Processing Korean Text Today

RedactionAPI provides the most comprehensive Korean language PII detection available, with native support for Hangul, Korean identifiers, and PIPA compliance. Process thousands of documents per minute with enterprise-grade accuracy.

Get API Access View Documentation

Korean Language Redaction

Powerful Redaction Features

Hangul Script Support

RRN Detection

PIPA Compliance

Korean Address Parsing

Korean Phone Numbers

Korean Name Recognition