Dedupe endpoint

POST /api/dedupe — upload a CSV and get back duplicate clusters plus a summary. Multipart request, optional auth, structured 402 when a file is over the plan row cap.

The dedupe endpoint takes a CSV upload and returns the duplicate clusters the engine found plus a small summary. It's synchronous and in-process: the response comes back in the same request. This is the same endpoint the dedupe tool calls.

POST/api/dedupeoptional auth

Upload a CSV; get back clusters and a summary.

Authentication

Optional. Send a Clerk session token as a bearer header to be billed at your plan's row cap:

Authorization: Bearer <clerk_session_token>

Without a token the caller is treated as anonymous and gets the Free cap. An authenticated Pro caller gets the Pro cap.

Request

A multipart/form-data body with a single field, file, containing the CSV.

curl -X POST https://api.bensevern.dev/api/dedupe \
  -F "file=@customers.csv"
FieldTypeRequiredNotes
filefile (CSV)yesThe list to dedupe. Header row expected.

Row caps. The per-run row cap is enforced after parsing: Free (and anonymous) is 1,000 rows, Pro is 100,000. Regardless of plan, the endpoint will not pull more than 100,000 rows into a single run; a larger file is truncated to that ceiling and a notice is added to warnings.

Rate limit. 30 requests per hour.

Response — 200

{
  "summary": {
    "total_rows": 1000,
    "duplicate_groups": 87,
    "records_in_groups": 214,
    "duplicates_removed": 127,
    "wall_ms": 412,
    "sample_merges": [
      { "cluster_id": 0, "member_count": 3, "members": [4, 19, 240] }
    ]
  },
  "clusters": [
    { "cluster_id": 0, "members": [4, 19, 240] },
    { "cluster_id": 1, "members": [7, 88] }
  ],
  "golden": [
    { "name": "Acme Inc", "email": "info@acme.com" }
  ],
  "warnings": []
}
FieldTypeMeaning
summary.total_rowsintRows the engine processed (post-truncation). Also the export bound.
summary.duplicate_groupsintNumber of multi-row clusters (the duplicate groups).
summary.records_in_groupsintTotal rows that fell into some duplicate group.
summary.duplicates_removedintRows removed in the cleaned file (records_in_groups − duplicate_groups).
summary.wall_msintEngine wall-clock time in milliseconds.
summary.sample_mergesarrayUp to 20 sample groups for a quick eyeball.
clustersarrayThe duplicate groups. Each is {cluster_id, members}. Capped at 200 in the payload.
clusters[].membersint[]0-based row indices into the processed CSV, in original order.
goldenarrayOne golden record per cluster from the engine, capped at 500.
warningsstring[]Human-readable notices, e.g. a truncation message.

Only multi-row clusters are returned — singletons (unique rows) are not listed. members are row indices, not values; the client maps them back to rows it already parsed.

Errors

StatusBodyWhen
400{ "detail": "<message>" }The upload couldn't be read as a CSV.
402see belowThe file is over the caller's plan row cap.
500{ "detail": "Dedupe failed" }Unexpected engine failure.

A 402 carries a structured quota object so a client can prompt an upgrade instead of failing generically:

{
  "detail": {
    "error": "quota_exceeded",
    "gate": "max_dedupe_rows",
    "limit": 1000,
    "current": 5230,
    "plan": "free"
  }
}

To review and explain the clusters this endpoint returns, pass them to the explain endpoint.

Was this page helpful?
Edit this page on GitHub