Duplicates

Find near-duplicate documents by content coverage, then promote a canonical or disband the group.

Detect duplicate documents#

Find near-duplicate clusters in a collection.

Compares documents by content coverage rather than exact hash, so different versions of the same packet cluster together. Detection runs once and persists groups; re-run after large uploads. New documents that join a duplicate group are excluded from retrieval until you promote one as canonical.

dewey duplicates detect my-docs

Promote a canonical or disband#

Pick a canonical and exclude the rest, or break up a duplicate group.

Each group is a cluster of documents that share substantial content. Promote one document as canonical to make it the version retrieval and research will return; the others stay in the collection but are excluded from query results. Disband the group entirely if the dedup engine got it wrong.

# Promote a specific doc as the canonical for its group
dewey duplicates resolve grp_xyz doc_a1

# Or break up a mis-clustered group
dewey duplicates dismiss grp_xyz

Duplicates

See also