How it works
Every run is two stages: a deterministic matching engine clusters likely duplicates, then an AI agent reviews each cluster and explains its verdict.
A run has two stages. A deterministic engine does the matching, and an AI agent does the judgment and the explaining. Keeping these separate is deliberate: the matching is repeatable and fast, and the AI layer is where the "why" comes from.
Stage 1 — deterministic matching
When your CSV lands, the matching engine (GoldenMatch) groups rows that look like the same real-world entity. It does not run a blind, zero-config pass. The columns are read and a curated, column-aware config is built for your file:
- Field weights. Discriminating fields (surname, email, national identifiers)
count for more than weak ones (city, company). Provenance columns like
record_id,id,uuid, and asourcelabel are excluded from matching entirely. - Multi-pass blocking. Rather than compare every row to every other row, the engine builds candidate blocks several ways at once (a phonetic pass on the strongest name field, exact passes on strong identifiers like email and postcode) and unions them, so a typo in one field doesn't drop a true match.
- A scored threshold. Each candidate pair gets a similarity score; pairs over the acceptance threshold are linked into the same group.
The output is a set of clusters. Groups with two or more rows are the proposed duplicates; everything else is unique.
A blind zero-config pass tends to over-merge multi-source spreadsheets (it leans on things like an exact phone match or the source label and collapses distinct people). Routing every run through one curated, column-aware config is what keeps precision high on real, messy data.
Stage 2 — the AI review agent
The clusters from stage 1 are good, but they're still a machine's guess. Stage 2 hands each duplicate group to an AI agent (Gemini) that reviews it like a careful human would and returns a structured verdict per group:
- a verdict —
confirmed,uncertain, orrejected, - a confidence —
high,medium, orlow, - a one-sentence, plain-language explanation.
rejected is the important one: it's the agent overriding the engine when it
thinks two genuinely different entities were merged (for example, two different
people who merely share a last name and a city). Confirmed merges, low-confidence
groups, and rejected over-merges are all surfaced so you can trust the result
instead of taking it on faith. See AI merge review
for the full detail.
How the two stages fit together
your.csv
│
▼
[ Stage 1: GoldenMatch ] → duplicate clusters (groups of row indices)
│
▼
[ Stage 2: AI agent ] → per-cluster verdict + confidence + explanation
│
▼
clean.csv (built in your browser — one survivor per group)
Stage 2 is a progressive enhancement: the clusters render as soon as stage 1 finishes, and the agent's verdicts fill in a moment later. If the AI layer is unavailable, you still get the engine's groups and a clean file — just without the per-group explanations.