Skip to main content

Brainspace

Duplicate Detection Methods

Brainspace uses three levels of duplicate detection. In decreasing order of strictness, they are:

  • Strict Duplicate Detection (SDD)

    • Two documents are considered strict duplicates of each other if and only if they are identical on all fields that are either (a) marked as analyzed in the Brainspace schema, or (b) specified as usedForExactDup in the Brainspace schema. A strict duplicate group (SDG) consists of all documents in a data set that are strict duplicates of a designated pivot document (see below).

  • Exact Duplicate Detection (EDD)

    • Two documents are considered exact duplicates of each other if they are identical on all fields specified as usedForExactDup in the schema. An exact duplicate group (EDG) consists of all documents in a data set that are exact duplicates of a designated pivot document.

  • Near Duplicate Detection (NDD)

    • A near duplicate group (NDG) is a group of documents where each document in the group has high similarity to the pivot document of the NDG, based on XXXX fields.

Both SDD and EDD use MD5 hashing to test for whether two documents are identical. The two methods differ only in the set of fields used to create the MD5 hash. The following characteristics result from the properties of the identity test:

  • Equivalence: Any two documents in an SDG (or EDG) are identical (on the fields used for duplicate detection) not only to the group’s pivot, but also to each other.

  • Pivot Independence: The pivot document of an SDG (or EDG) is used as a representative of all documents in the group for some purposes. However, the choice of the pivot document is arbitrary in the sense that the pivot does not determine which documents are in the group.

  • Context Independence: If two documents end up in the same SDG (or EDG) when a build is done on any dataset, then those two documents would be in the same SDG (or EDG) in any data set with the same schema.

  • Order Independence: SDGs and EDGs are stable across rebuilds. For a particular schema, a given data set will have the same set of SDG and EDG groupings (though not necessarily the same set IDs) regardless of the build history of the data set. For instance, it does not matter whether all data was input in a single build vs. multiple Incremental Analytics with Ingest operations were used on portions of the data.

NDD operates differently from SDD and EDD, and is in some ways more similar to clustering than to duplicate detection. The NDD algorithm iteratively selects pivot documents and builds NDGs around those pivots. All documents in an NDG have a minimum level (by default 80%) of similarity to the NDG’s pivot documents, as computed by a shingle-based algorithm (https://help.revealdata.com/en/Email-Threading-Overview.html#Grouping-Messages-into-Threads). NDD therefore has very different properties than SDD and EDD:

  • Non-Equivalence: While all documents in an NDG have a specified minimum similarity to the pivot, they may not have that degree of similarity to each other.

  • Pivot Dependence: Which documents are chosen as pivots by the NDD algorithm affects which documents are grouped together in NDGs. Thus anything that affects the choice of pivots will affect the composition of the NDGs.

  • Context Dependence: The fact that two documents occur in the same NDG for a given dataset does not mean that they will occur in the same NDG for some other dataset that contains the two documents.

  • Order Dependence: Because the choice of pivots affects the composition of NDGs, the order in which documents are added to a dataset affects NDG membership. In particular, inputting all documents in a single build vs. doing multiple Incremental Analytics with Ingest builds may lead to different NDGs.