Of all the data quality problems we see across customer databases, phone number normalization is simultaneously the most universal and the most underestimated. Every company that collects phone numbers — which is nearly every company — ends up with a patchwork of formats that silently breaks SMS campaigns, caller ID lookups, CRM deduplication, and voice integrations.
This guide is the most comprehensive treatment of phone normalization we're aware of. It covers every major format you'll encounter, the edge cases that trip up most normalization scripts, and the step-by-step logic for building a reliable normalization function.
Why Phone Formats Are So Inconsistent
Phone numbers are inconsistent because there's no universal standard for how humans write them, and because data is collected across many different touchpoints — web forms, manual entry, CSV imports, third-party integrations — each of which applies its own formatting conventions. Here are the formats we see most commonly for a US number like 555-123-4567:
5551234567— raw digits, no formatting(555) 123-4567— North American standard555.123.4567— period-separated555-123-4567— hyphen-separated+1 555 123 4567— E.164 with spaces+15551234567— E.164 compact1-555-123-4567— with country code prefix+1 (555) 123-4567— E.164 with national format5551234567 x123— with extension
And that's just for US numbers. International numbers add an additional layer of complexity with varying digit counts, country codes, and national formatting conventions.
The Three Goals of Normalization
Before writing any normalization logic, it helps to be clear about what you're trying to achieve. Phone normalization has three distinct goals that can sometimes conflict:
- Deduplication readiness. Two phone numbers that reach the same subscriber should normalize to the same string so they match in deduplication logic.
- System compatibility. Your downstream systems — SMS platforms, voice providers, CRM lookup APIs — may require a specific format. Normalize to whatever they expect.
- Human readability. In customer-facing displays and exports for human review, a formatted number like
(555) 123-4567is easier to read than5551234567.
For deduplication purposes, storing an E.164 compact format (+15551234567) is the strongest choice because it's unique, unambiguous, and compatible with most APIs. For human display, apply a local format on top of the stored E.164 value at render time.
Step-by-Step Normalization Logic
Step 1 — Strip all non-numeric characters (except leading +)
Remove parentheses, hyphens, spaces, periods, and slashes. Preserve a leading + sign as it indicates an international prefix. After this step, +1 (555) 123-4567 becomes+15551234567 and (555) 123-4567 becomes 5551234567.
Step 2 — Extract and store extensions separately
Before stripping non-numeric characters, extract any extension. Extensions are typically signaled by "x", "ext", "ext.", or "extension" followed by digits. Store the extension in a separate field. Do not include it in the normalized phone number.
Step 3 — Apply country code logic
After stripping, you have a sequence of digits. Apply the following rules to infer the country code:
- If the number starts with
+(now just the digits after stripping), it already has a country code — parse it using the ITU-T E.164 lookup table. - If the number has 11 digits starting with
1, it's a NANP (North American) number with country code. Strip the leading 1 and treat the remaining 10 as the subscriber number. - If the number has exactly 10 digits and you know your data is US/Canada-only, prepend
+1. - If the number has fewer than 7 digits or more than 15 digits, mark it as invalid.
Step 4 — Validate the result
A valid E.164 number is between 7 and 15 digits with a valid country code prefix. After normalization, validate against these criteria. Numbers that don't pass should be flagged, not silently discarded — you want to know which records have unusable phone fields.
Common Edge Cases That Break Naive Scripts
Toll-free numbers
US toll-free prefixes (800, 888, 877, 866, 855, 844, 833, 822) are valid 10-digit NANP numbers but often can't be used for SMS or outbound calling depending on your platform. Flag these separately from regular numbers.
Premium rate numbers
Numbers starting with 900 (US), 0900 (various EU countries), or 09 (UK) are premium rate numbers that most businesses should never be calling or texting. Flag and exclude these automatically.
Fake test numbers
Common fake numbers like 555-0100 through 555-0199 (the "fictional" range in US/Canada), 07700 900000–900999 (UK fictional range), or simple patterns like 5555555555 should be flagged as likely invalid.
International numbers in US-only datasets
If your data is supposed to be US-only but you encounter a 10-digit number starting with 44 (UK country code), it's likely a UK number that was entered without the + prefix. Your normalization logic should handle this case — either by prompting for confirmation or by applying a heuristic based on the country code probability.
Putting It Into Practice
The cleanest implementation of phone normalization stores two fields per phone number in your database: the E.164 canonical value (used for deduplication, API calls, and indexing) and the display-formatted value (used in UI and exports). The canonical value should never change unless you're correcting an error; the display format can be regenerated at any time from the canonical.
Phone normalization is not a one-time cleanup task. It's an ongoing validation that should run at every entry point. A normalized database that receives un-normalized inputs will be dirty again within weeks.
If you have an existing dataset to clean, our free tool handles phone normalization automatically as part of the cleaning pass — no code required. For ongoing normalization as new records enter your system, our Growth plan's scheduled job feature can apply normalization rules on any incoming feed automatically.
About the author

Dan Okafor
Senior Data Engineer
Dan specializes in CRM data hygiene and B2B sales data architecture. He has helped over 50 companies clean and unify their contact databases.
