Dedupe endpoint

POST /api/dedupe — upload a CSV and get back duplicate clusters plus a summary. Multipart request, optional auth, structured 402 when a file is over the plan row cap.

The dedupe endpoint takes a CSV upload and returns the duplicate clusters the engine found plus a small summary. It's synchronous and in-process: the response comes back in the same request. This is the same endpoint the dedupe tool calls.

Upload a CSV; get back clusters and a summary.

Authentication

Optional. Send a Clerk session token as a bearer header to be billed at your plan's row cap:

Authorization: Bearer <clerk_session_token>

Without a token the caller is treated as anonymous and gets the Free cap. An authenticated Pro caller gets the Pro cap.

Request

A multipart/form-data body with a single field, file, containing the CSV.

curl -X POST https://api.bensevern.dev/api/dedupe \
  -F "file=@customers.csv"

Field	Type	Required	Notes
`file`	file (CSV)	yes	The list to dedupe. Header row expected.

Row caps. The per-run row cap is enforced after parsing: Free (and anonymous) is 1,000 rows, Pro is 100,000. Regardless of plan, the endpoint will not pull more than 100,000 rows into a single run; a larger file is truncated to that ceiling and a notice is added to warnings.

Rate limit. 30 requests per hour.

Response — 200

{
  "summary": {
    "total_rows": 1000,
    "duplicate_groups": 87,
    "records_in_groups": 214,
    "duplicates_removed": 127,
    "wall_ms": 412,
    "sample_merges": [
      { "cluster_id": 0, "member_count": 3, "members": [4, 19, 240] }
    ]
  },
  "clusters": [
    { "cluster_id": 0, "members": [4, 19, 240] },
    { "cluster_id": 1, "members": [7, 88] }
  ],
  "golden": [
    { "name": "Acme Inc", "email": "info@acme.com" }
  ],
  "warnings": []
}

Field	Type	Meaning
`summary.total_rows`	int	Rows the engine processed (post-truncation). Also the export bound.
`summary.duplicate_groups`	int	Number of multi-row clusters (the duplicate groups).
`summary.records_in_groups`	int	Total rows that fell into some duplicate group.
`summary.duplicates_removed`	int	Rows removed in the cleaned file (`records_in_groups − duplicate_groups`).
`summary.wall_ms`	int	Engine wall-clock time in milliseconds.
`summary.sample_merges`	array	Up to 20 sample groups for a quick eyeball.
`clusters`	array	The duplicate groups. Each is `{cluster_id, members}`. Capped at 200 in the payload.
`clusters[].members`	int[]	0-based row indices into the processed CSV, in original order.
`golden`	array	One golden record per cluster from the engine, capped at 500.
`warnings`	string[]	Human-readable notices, e.g. a truncation message.

Only multi-row clusters are returned — singletons (unique rows) are not listed. members are row indices, not values; the client maps them back to rows it already parsed.

Errors

Status	Body	When
`400`	`{ "detail": "<message>" }`	The upload couldn't be read as a CSV.
`402`	see below	The file is over the caller's plan row cap.
`500`	`{ "detail": "Dedupe failed" }`	Unexpected engine failure.

A 402 carries a structured quota object so a client can prompt an upgrade instead of failing generically:

{
  "detail": {
    "error": "quota_exceeded",
    "gate": "max_dedupe_rows",
    "limit": 1000,
    "current": 5230,
    "plan": "free"
  }
}

To review and explain the clusters this endpoint returns, pass them to the explain endpoint.

Was this page helpful?

Edit this page on GitHub

NextExplain endpoint