How it works

Every run is two stages: a deterministic matching engine clusters likely duplicates, then an AI agent reviews each cluster and explains its verdict.

A run has two stages. A deterministic engine does the matching, and an AI agent does the judgment and the explaining. Keeping these separate is deliberate: the matching is repeatable and fast, and the AI layer is where the "why" comes from.

Stage 1 — deterministic matching

When your CSV lands, the matching engine (GoldenMatch) groups rows that look like the same real-world entity. It does not run a blind, zero-config pass. The columns are read and a curated, column-aware config is built for your file:

  • Field weights. Discriminating fields (surname, email, national identifiers) count for more than weak ones (city, company). Provenance columns like record_id, id, uuid, and a source label are excluded from matching entirely.
  • Multi-pass blocking. Rather than compare every row to every other row, the engine builds candidate blocks several ways at once (a phonetic pass on the strongest name field, exact passes on strong identifiers like email and postcode) and unions them, so a typo in one field doesn't drop a true match.
  • A scored threshold. Each candidate pair gets a similarity score; pairs over the acceptance threshold are linked into the same group.

The output is a set of clusters. Groups with two or more rows are the proposed duplicates; everything else is unique.

Why curated, not zero-config

A blind zero-config pass tends to over-merge multi-source spreadsheets (it leans on things like an exact phone match or the source label and collapses distinct people). Routing every run through one curated, column-aware config is what keeps precision high on real, messy data.

Stage 2 — the AI review agent

The clusters from stage 1 are good, but they're still a machine's guess. Stage 2 hands each duplicate group to an AI agent (Gemini) that reviews it like a careful human would and returns a structured verdict per group:

  • a verdictconfirmed, uncertain, or rejected,
  • a confidencehigh, medium, or low,
  • a one-sentence, plain-language explanation.

rejected is the important one: it's the agent overriding the engine when it thinks two genuinely different entities were merged (for example, two different people who merely share a last name and a city). Confirmed merges, low-confidence groups, and rejected over-merges are all surfaced so you can trust the result instead of taking it on faith. See AI merge review for the full detail.

How the two stages fit together

your.csv
   │
   ▼
[ Stage 1: GoldenMatch ]  → duplicate clusters (groups of row indices)
   │
   ▼
[ Stage 2: AI agent ]     → per-cluster verdict + confidence + explanation
   │
   ▼
clean.csv  (built in your browser — one survivor per group)

Stage 2 is a progressive enhancement: the clusters render as soon as stage 1 finishes, and the agent's verdicts fill in a moment later. If the AI layer is unavailable, you still get the engine's groups and a clean file — just without the per-group explanations.

Was this page helpful?
Edit this page on GitHub