Dedupe endpoint
POST /api/dedupe — upload a CSV and get back duplicate clusters plus a summary. Multipart request, optional auth, structured 402 when a file is over the plan row cap.
The dedupe endpoint takes a CSV upload and returns the duplicate clusters the engine found plus a small summary. It's synchronous and in-process: the response comes back in the same request. This is the same endpoint the dedupe tool calls.
Upload a CSV; get back clusters and a summary.
Authentication
Optional. Send a Clerk session token as a bearer header to be billed at your plan's row cap:
Authorization: Bearer <clerk_session_token>
Without a token the caller is treated as anonymous and gets the Free cap. An authenticated Pro caller gets the Pro cap.
Request
A multipart/form-data body with a single field, file, containing the CSV.
curl -X POST https://api.bensevern.dev/api/dedupe \
-F "file=@customers.csv"
| Field | Type | Required | Notes |
|---|---|---|---|
file | file (CSV) | yes | The list to dedupe. Header row expected. |
Row caps. The per-run row cap is enforced after parsing: Free (and anonymous)
is 1,000 rows, Pro is 100,000. Regardless of plan, the endpoint will not pull more
than 100,000 rows into a single run; a larger file is truncated to that ceiling
and a notice is added to warnings.
Rate limit. 30 requests per hour.
Response — 200
{
"summary": {
"total_rows": 1000,
"duplicate_groups": 87,
"records_in_groups": 214,
"duplicates_removed": 127,
"wall_ms": 412,
"sample_merges": [
{ "cluster_id": 0, "member_count": 3, "members": [4, 19, 240] }
]
},
"clusters": [
{ "cluster_id": 0, "members": [4, 19, 240] },
{ "cluster_id": 1, "members": [7, 88] }
],
"golden": [
{ "name": "Acme Inc", "email": "info@acme.com" }
],
"warnings": []
}
| Field | Type | Meaning |
|---|---|---|
summary.total_rows | int | Rows the engine processed (post-truncation). Also the export bound. |
summary.duplicate_groups | int | Number of multi-row clusters (the duplicate groups). |
summary.records_in_groups | int | Total rows that fell into some duplicate group. |
summary.duplicates_removed | int | Rows removed in the cleaned file (records_in_groups − duplicate_groups). |
summary.wall_ms | int | Engine wall-clock time in milliseconds. |
summary.sample_merges | array | Up to 20 sample groups for a quick eyeball. |
clusters | array | The duplicate groups. Each is {cluster_id, members}. Capped at 200 in the payload. |
clusters[].members | int[] | 0-based row indices into the processed CSV, in original order. |
golden | array | One golden record per cluster from the engine, capped at 500. |
warnings | string[] | Human-readable notices, e.g. a truncation message. |
Only multi-row clusters are returned — singletons (unique rows) are not listed.
members are row indices, not values; the client maps them back to rows it
already parsed.
Errors
| Status | Body | When |
|---|---|---|
400 | { "detail": "<message>" } | The upload couldn't be read as a CSV. |
402 | see below | The file is over the caller's plan row cap. |
500 | { "detail": "Dedupe failed" } | Unexpected engine failure. |
A 402 carries a structured quota object so a client can prompt an upgrade instead
of failing generically:
{
"detail": {
"error": "quota_exceeded",
"gate": "max_dedupe_rows",
"limit": 1000,
"current": 5230,
"plan": "free"
}
}
To review and explain the clusters this endpoint returns, pass them to the explain endpoint.