In a recent blog post, Jeremy Clopton discussed the benefits of a “mash-up” between traditional forensic data mining and the relatively new area of unstructured data mining. This is the first of a series of posts that will more deeply explore the concepts, tools and techniques related to unstructured data mining, also called “text mining.”
Mining structured data has become an established component of fraud and forensic accounting investigations. As the name implies, structured data is any kind of data with a consistent, reliable structure; this includes sources such as spreadsheets, databases and most data in accounting information systems. Robust tools such as ACL and IDEA (among others) can handle this data—not just on a sample basis but on the entire population of available data.
Contrast this with unstructured data, which includes everything else—texts, email, documents, social media, audio, video and many forms of Web-based content. According to an Ernst & Young study, unstructured data accounts for about 80 percent of all available data. So, by failing to account for unstructured data in an investigation, we only address 20 percent of the total available population of data. Furthermore, that ignored 80 percent is very rich in human-generated, contextual and even emotion-laden data.
Sources of Text Used in Forensic Investigations
The graphic at above identifies some of the more commonly generated unstructured data available in an investigation. Many, many other sources abound, but these are the most common in corporate investigations.
Chief among textual sources in an investigation is email. This source of evidence not only contains word-for-word communications, but also possesses a date/time element, metadata and even emotional tone as expressed through various idioms, phrases and adjectives. Another often overlooked source of rich unstructured data is the computer hard drive, which not only contains email, documents, audio and video, but also caches of Internet activity, discarded IM and chat sessions, deleted content and often overlooked backup and temporary copies of items. Computer forensics technologies can preserve, identify and produce these more obscure items.
Text Mining Tools & Processes
Handling the sheer volume and complexity of unstructured data requires special tools and processes. Since the majority of useful, relevant material is human communications, analysis should not be limited to mere keyword searches; it also should include extraction of meaning and topics, emotional tone of the conversation and the creation of relationship networks to visualize how key players and topics interact, influence and evolve over time.
The visual I like to use accurately portrays the analysis of text as a central concept, tactically accomplished via a family of tools and processes that work together to tell a story and supplement the more traditional structured data component of an investigation.
The broad categories encompass the science of “natural language processing” and related concepts of latent semantic analysis and concept searching, among others. Experience has shown these components—working together—to be an effective toolset in the identification of relevant evidence in a forensic investigation.
Coming Up Next
In Part 2 of this series, we’ll introduce each component in the family of text mining tools; in subsequent sections, we’ll delve more deeply into each. We’ll supplement the information with case studies from our investigations and explore the idea of incorporating all the analysis elements into a broader time series analysis to explore the evolution of topics, emotions and relationships among key players. We also will see in later installments how data visualization tools and techniques are essential to the managing, understanding and reporting the results of text analysis.