Product catalog deduplication is one of the most underestimated problems in e-commerce operations. Unlike contact deduplication — where you're looking for two records that represent the same person — catalog deduplication requires you to recognize that the same physical product can appear in your system under dozens of different identifiers, names, and formats.
This guide covers four proven strategies in order of increasing sophistication, with concrete examples of when to use each. Whether you're managing a 500-SKU boutique catalog or a 2-million-row marketplace feed aggregation, at least one of these approaches will apply.
Why Product Duplicates Are So Damaging
Before diving into strategies, it's worth understanding the specific ways catalog duplicates harm your business. It's not just messy data — each duplicate creates compounding downstream problems:
- Inventory inflation. If a product exists under two internal IDs, your inventory management system may report twice the stock you actually have. This causes overselling and broken fulfillment promises.
- Pricing inconsistency. Two records for the same product often carry different prices, especially when loaded from multiple supplier feeds. Customers who discover price inconsistencies lose trust immediately.
- Broken analytics. When sales are attributed to two different product IDs, your category performance reports, reorder forecasting, and margin analysis all produce incorrect numbers.
- SEO damage. Duplicate product pages with identical or near-identical content hurt search rankings and split link equity.
Strategy 1 — Exact SKU Matching
The simplest and most reliable deduplication strategy is exact matching on a canonical identifier — typically a SKU, GTIN, UPC, or EAN barcode. If two records share the same identifier, they are the same product by definition.
This sounds obvious, but implementation has a few subtleties that teams consistently miss:
- SKUs are often stored with inconsistent leading zeros.
0012345and12345may be the same product. Normalize to a consistent numeric format before matching. - GTIN-13 and GTIN-12 (UPC) can both identify the same product. A GTIN-12 is a GTIN-13 with a leading zero stripped. Your matching logic needs to handle this.
- Some suppliers pad or truncate barcodes in their feeds. Strip all non-numeric characters and normalize length before comparison.
Exact SKU matching is fast, deterministic, and auditable. It should always be your first pass. The limitation is that it only catches duplicates where the canonical identifier is present and consistent — which in practice is never 100% of your catalog.
Strategy 2 — Normalized Title Matching
When SKUs are missing or inconsistent, product title matching is the next line of defense. But raw title matching fails almost immediately because product titles vary enormously across suppliers and internal data entry:
- "Nike Air Max 270 Men's Running Shoe Size 10 Black/White"
- "NIKE AIRMAX 270 - BLACK WHITE - SZ 10 - MENS"
- "Nike Air Max 270 (Black, White) - Men 10"
These are clearly the same product. A raw string comparison gives you zero matches. Normalized title matching involves three pre-processing steps before comparison:
- Lowercase and strip punctuation. Remove all hyphens, slashes, parentheses, and special characters. Convert to lowercase.
- Expand or normalize common abbreviations. "sz" becomes "size", "mens" becomes "men's", brand-specific abbreviations are expanded.
- Extract and separate structured attributes. Pull out size, color, and gender into separate fields so they can be compared independently from the base product name.
After normalization, a simple token similarity score (Jaccard similarity or cosine similarity on word tokens) will identify most title duplicates with high precision.
Strategy 3 — Attribute-Based Fuzzy Matching
For catalogs where titles vary significantly — particularly multi-supplier feeds where each supplier has their own naming conventions — normalized title matching often isn't enough. Attribute-based matching treats the product as a structured record with individual comparable fields, rather than a single string to compare.
The matching key becomes a composite of extracted attributes:
- Brand name (normalized)
- Model or product line name
- Color (mapped to a canonical color set)
- Size (normalized to a common scale)
- Material or variant descriptor
Two records are candidate duplicates if they match on brand + model + color + size, regardless of how the title is phrased. This approach requires more upfront schema work but produces much higher precision on large, multi-source catalogs.
The practical challenge is attribute extraction — pulling "Black/White" out of a free-text title and mapping it to a canonical color value. This is where AI-assisted extraction tools have become genuinely useful in the last two years.
Strategy 4 — Embedding-Based Semantic Matching
For catalogs with highly variable product descriptions — particularly when importing from marketplaces, wholesale directories, or scraped supplier websites — the previous three strategies may still leave a long tail of undetected duplicates. Embedding-based semantic matching is the most powerful but also most computationally intensive approach.
The idea is simple: convert each product record (title + description + attributes) into a high-dimensional vector using a text embedding model. Products that are semantically similar will have vectors that are close together in that space, regardless of the specific words used.
In practice, embedding-based matching works well for:
- Identifying the same product across different language variants in multilingual catalogs
- Matching products where one description is very detailed and another is minimal
- Catching near-duplicate variant products (same base product, different minor configuration)
The trade-off: embedding matching surfaces many candidate pairs that require human review to confirm. It's best used as a second pass after exact and fuzzy matching have handled the high-confidence cases.
Building a Practical Deduplication Pipeline
In a production catalog, these four strategies work best when applied in sequence as a cascade:
- Exact SKU/GTIN match → auto-merge with no review needed
- Normalized title match with similarity > 0.95 → auto-merge
- Attribute-based match with score > 0.85 → review queue
- Embedding similarity with score > 0.80 → manual review
Each tier has a confidence level that determines whether the merge is automatic or requires a human decision. This keeps your auto-merge precision high while still surfacing the hard cases for human judgment.
The goal of catalog deduplication isn't perfection — it's reducing the error rate to a level where the business impact becomes negligible. For most catalogs, getting from 85% accuracy to 98% accuracy delivers the vast majority of the operational benefit.
Getting Started
If you're approaching catalog deduplication for the first time, start with a simple analysis: export your full product list with titles, SKUs, and GTINs to a CSV, then run it through our free duplicate detection tool. It will give you an instant read on your duplicate rate and show you which records are the most likely candidates — without requiring any engineering work on your end.
About the author

Maya Chen
Head of Data Partnerships
Maya has spent 10 years building data pipelines for e-commerce and retail brands. She writes about practical data quality strategies for operations teams.
