Email Threading Overview
Overview
The purpose of email threading is to:
Identify email messages and attachments in the dataset.
Identify duplicate email messages.
Find messages that belong to the same email thread.
Mark which messages contain unique content not present in any other message.
Determine the hierarchy and sort order of messages within each thread.
Field Mapping
Email threading relies on certain document metadata fields during processing, such as the From, To and CC email headers. Prior to initial processing, it is important to examine the data to be processed and properly configure the field map settings so that these metadata fields are available to the processing engine.
Fields are assigned field categories by the schema, which then controls how the contents of those fields are treated. These categories give the field contents a role in email threading:
Field category | Description |
---|---|
id | Primary key field. |
parent_id | If document A is an attachment to document B, document A will have this field set to the primary key of document B (or be empty if A is not an attachment). For correct handling of attachments, a data set should use either an attachment field or (more commonly) a parent_id field, not both. |
from | Email From: header field. Documents with a non-empty value in this field are considered messages. |
to | Email To: header. |
cc | Email CC: header. |
bcc | Email BCC: header. |
date_sent | Email Date: header. Used as part of determining the sort order of a thread’s messages. |
title | Email Subject: header. |
bodytext | A field containing the text that is used to check for inclusion of message bodies. |
attachment | A field that contains a semicolon separated list of document primary keys corresponding to the attachments. |
Identifying email
Processing begins by scanning the metadata of each document in the dataset to determine which documents are email messages or attachments. Any documents referenced in the attachment field, or which have a non-empty parent_id field, are classified as email attachments. Documents with a non-empty from field that are not attachments are classified as email messages.
Identifying duplicates
Duplicate email messages are detected by the same process that creates exact duplicate groups for the entire data set. The set of fields used for exact duplicate detection is configurable at data set creation time. The attachments of a document are taken into account in exact duplicate detection, though only the contents of the attachments not their key or filename.
Grouping Messages into Threads
The next step finds all messages that belong together in the same email thread. A pair of messages is considered to be part of the same thread if either (a) Conversation Index information is available indicating that they are in the same thread, or (b) as described next, they have a similar subject and share a significant portion of body text in common.
For efficiency, messages are first separated into groups with the same normalized subject. The normalized subject is the original subject with prefixes such as Re: and Fw: removed. Messages with the same normalized subject are then compared by their body content.
To compare the body content of messages an approach called shingling is used. In this approach each document is represented as a set of unique shingles, where shingles are the n-grams found in the body text.
As an example, the list of 3-grams for the document "a rose is a rose is a rose" would be as follows:
“a rose is”
“rose is a”
“is a rose”
“a rose is”
“rose is a”
“is a rose”
Removing duplicates we are left with the following set of unique shingles:
{ “a rose is”, “rose is a”, “is a rose” }
Once documents are represented as sets of shingles, set intersection operations are used to measure the percentage of a message’s content that is contained within another message.
When comparing two messages (A and B) the engine determines the percentage of message A’s shingles that are contained within message B. If this percentage exceeds the containment threshold (100% by default), then messages are assigned to the same thread and message B is marked as a response to message A.
This property setting is important and can help you to fine tune your Email Threading session:
Containment-Threshold - This property sets the percentage of an email’s shingles that need to be contained within another email for it to be considered contained within that email. To make inclusiveness more conservative set the containment-threshold to 1.0.
Identifying unique messages
After assigning messages to threads, the next step is marking each message unique if it contains content that is not contained in any other message in the thread.
A message is marked as a unique message when comparing the body of the message to the bodies of the other messages in the thread and none are found that meet the containment threshold. A message is marked as unique attachment if it has an attachment that is not contained in any of the responses to the message. When comparing attachments, the actual content of the attachments is used, not the filenames or document keys.
Note
Messages are always marked unique unless there’s another message (that is. reply or forward) that contains the text of that message. In the case where a response edits the original text inline, unless the edits were very small or trivial, this causes both the original and response to be marked unique. Date is not used for any unique/nonunique identification.
Assigning hierarchy and sort order
Messages within a thread are assigned an overall sort order and an indentation level. The indentation level of each message is first computed based on the response relationships of the messages. Messages have an indentation level one greater than the message that they are a response to. Messages that are not a response to any other message have an indentation level of 0.
Messages are sorted in a hierarchical manner such that the message with the lowest indentation level and earliest date comes first, followed by all messages that are a response to that message. A second sort order is available that instead puts more inclusive messages earlier in the ranking.
Note
Attachments also get their own unique threading information, that is, ThreadId, Sort Order, Indent Level, etc.