Mining Unstructured Data in Forensic Accounting Investigations – Part 2: Components of Text Mining

In Part 1, I discussed the benefits analyzing unstructured data brings to a forensic investigation or litigation, and Jeremy Clopton and I dove deeper into the concept in a recent article in Fraud Magazine.

I can’t emphasize enough the importance of considering unstructured data in investigations. The oft-quoted stat that 80 percent of an organization’s data is unstructured isn’t marketing hype; it’s a reality that means if the data isn’t being considered, an investigation plan is only 20 percent effective. The power of being able to capture not just the topics and content of communications, but also the emotional state of the participants, is staggering—even more so when incorporating social network theory and leveraging it with artificial intelligence-assisted tools.

Text Mining Components

TText mininghis is a conceptual overview of the processes comprising the core of text mining. Together, these components encompass the science of natural language processing as well as the related concepts of latent semantic analysis and concept searching, among others. Experience shows these components, when working together, are an effective tool set for identifying relevant evidence in a forensic investigation.

Text Mining Components

Predictive coding uses artificial intelligence (AI) to help find related and similar documents in a massive collection of text. The AI is capable of determining the underlying concepts in a document or email, so predictive coding can be performed independent of traditional methods that rely on keyword searches. Perhaps more important than the ability to rapidly find highly relevant content is that predictive coding can reduce the volume of material reviewed by the investigator by as much as 95 percent. The AI and human analyst leveraging each other’s strengths to achieve augmented intelligence makes this possible.

Part-of-speech (POS) tagging is the process of a computer program breaking text into grammatical parts.

By leveraging this function to dissect communications into their grammatical subcomponents, two of the more useful and exciting types of analysis—topic maps and word clouds—are possible. The following example illustrates “a tale of two finance departments”; it doesn’t take much imagination to tell which department may have some issues.

balance struggle

Graphics can be drawn from overall concepts as expressed through nouns or adjectives. Higher-quality systems also incorporate colors to distinguish between positive and negative emotions or events, and the date/time email element allows the investigator to explore the evolution of topics and emotions. Additional analytical leverage is gained by pairing noun topics with their descriptive adjectives to assess the emotional context of a topic.

Tone detection uses adjectives, idioms and phrases to assess the emotional tone of the communication. This ability has powerful implications—an investigator can quickly hone in on red flags without having an initial theory or starting point. Common tones that can be analyzed include tense, vague, nervous, low esteem and conspiratorial, among others.

Because text mining tools can identify grammatical components, they are adept at identifying proper names, places and events. This process is called named entity extraction (NEE) and provides a powerful analysis to the investigator. Because names and events can be pulled from email communications, NEE is useful in relationship mapping—graphically representing relationships among the various subjects of an investigation. For example, without NEE, a relationship map may only show relations between the sender and recipient of an email communication. By adding extracted topics, names and places from the message, the relationship map takes on a new dimension. Some maps become extremely complex, as illustrated here.

Some maps become extremely complex, as illustrated here.

In Part 3 of this series, we’ll delve deeper into the relationship mapping aspect of text mining. We’ll discuss how it goes beyond simple graphs and incorporates unique mathematical principles to help shed light on relationship networks and define their characteristics. We’ll also examine how this is useful in Foreign Corrupt Practices Act and anti-bribery and corruption investigations.


Lanny has experience in computer forensics and electronic data discovery assisting attorneys in litigation and disputes by uncovering electronic data to be admitted into evidence. He performs forensic image copying of computer media, as well as mining, analyzing and reporting on the recovered data.

Lanny Morrow – who has written posts on BKD Forensics.

Leave a Reply

Your email address will not be published. Required fields are marked *