http://fredgibbs.net/extras/text-mining-digital-poster

a gentle introduction to

historical data analysis

introduction

what is text mining?

It has become difficult to have a conversation about digital humanities without hearing of text mining and its potential to reveal hidden patterns, new research questions, and either confirm or counter intuition. Yet it is often discussed in technical and intimidating terms.

While it may be mathematically sophisticated, text mining can also be done in myriad ways that don't involve huge datasets or complex statistical analysis.

There is no right way to "do" text mining. Fundamentally, it allows historians to view, explore, and play with their sources. For best results, use a variety of techniques that work on a variety of scales.

a clear and concise introduction
everything you've wanted to know
what do you do with a million books?
what is text analysis, and why bother?

isn't it too hard?

Bleeding-edge text mining work in the humanities, usually described in terms of automation and algorithmic approaches to huge data sets, make text mining seem like computer science hammers in search of humnaities nails.

But the basic premise behind text mining--sketching out the contours of large or unorganized amounts of text--is something humanists do all the time. There's no way to make it entirely non-technical, but it does not have to be intimidating.

Text mining is neither a sigular process nor methodology, but rather a general approach to inquiry and exploration. It requires neither complex tools nor complex visualizations (though they can help when the dataset gets very large)

topic modeling
topic modeling the Pennsylvania Gazette
MITH's Topic modeling overview
Ted Underwood's introduction to topic modeling
EXAMPLE: Topic modeling Martha Ballard's Diary

other terms you may have heard of
normalized compression distance
machine learning
natural language processing
sentiment analysis

& it's not humanities!

Isn't text mining fundamentally antithetical to the close reading that fuels humanistic inquiry and interpretation?

How do we REALLY know what the computer is doing? How sure can we be that we've set it up to notice something worth noticing?

Counting words does not answer historical questions! True, but neither does reading or digging through archives.

A complex hermenutic process creates evidence from data. Text mining simply helps you analyze historical sources (and more of them) from multiple points of view. Text mining tells us about data--not history--and the extent to which statistical analysis aids in historical analysis must be explained clearly (this is often missing!).

the usual criticism of non-reading
unfortuntate criticism of metadata
good advice for mining big data
EXAMPLE: recognizing literary genres

using complex tools simply

multi-faceted reading

Although text mining is often understood to mean getting the computer to tell you what's going on with a set of texts, it doesn't have to mean only that.

The machine's tolerance for mindless drudgery can be exploited with a highly guided approach that reduces the black-boxiness of an algorithmic approach.

Rather than relying on the computer to outline themes or similarities, another approach is to isolate particular phenomena--in other words, highly targeted reading across a vast amount of text. In other words, you can use a highly mediated form of text mining: an interative approach that uses repeating cycles of efforts to isolate the juicyt bits.

targeted exploration

It's surprisingly easy to use tools to explore texts and greatly improve research efficiency and open new research doors. The following techniques are incredibly useful for a small to intermediate amount of text. These techniques do not scale up to handle huge amounts of data, but then again most historians don't work with huge amounts of data.

One example is using Voyant to explore a single text or set of texts. Let's say I want to explore the use of poison in the 19th century.

First, we need digitized source material that might tell us something. In this case, we can use the Old Bailey Online. We can collect trial documents by downloading cases we're interested in (you can get them 10 at a time from the API). Then, use a ZIP utility to bundle them together; this makes it easier to upload them to Voyant.

Look for words that might be informative. In this case, I searched for drink, drunk, ate, eaten, and et. Zeroing in on those contexts, I see that "coffee" is one of the most common terms, and in fact one of the most common vehicles for poison. And it's VERY easy to see that the notion of poison as reflected in the Old Bailey differs from a broad analysis of Victorian toxicology.

Hardly a groundbreaking discovery, but it's a quick, targeted approach to making sense of documents on a scale not possible without computer mediation, but without expecting the machine to find interesting patterns on its own.

Of course it is possible for more advanced methodologies to find such a correlation without much intervention, a more mediated, exploratory approach lessens the distance between reading and interpreting, and provides finer-grained control.

wordle, the gateway drug to textual analysis Visualization tool reviews
Voyant (formally Voyeur) Overview
more about the using the Old Bailey API and Voyant

power of simple scripting

zero in

Of course you'll eventually want to do something that isn't possible with a generic tool. This is when the power of knowing a little bit of programming (or perhaps more accurately, scritping) can be useful.

Create a research corpus: for instance, download texts from the Internet Archive. To cheer us up after the poisonings, let's tackle the nature of love in Jane Austen. It's easy to get all her novels, but for this I downloaded only Northanger.

A simple tool, grep, can help direct your research by not just finding relevant words or phrases, but also by making it easy to compare them in context. For Mac users, you can do this from your Terminal application. For Windows users, you'll need to download (free) a separate grep application (like WINgrep).

Learning how to create search expressions (like below) is no harder than navigating bilbiographic databases or learning research languages.

grep -o ".\{0,50\}love.\{0,50\}"

Writing simple code is just like following a recipe (get a cheat sheet):

grep is the program that will search for us.
-o = dislay only specific output, not entire lines (as is default)
"." = any character.
\{0,50\} = gets 0-50 occurrences of the preceeding character (here, this applies to the ",")
"love" = the characters we are searching for

The output of this command collates all instances of "love". Across many texts, this is a huge time saver. NOTE: you still need to read and interpret! (and probably do more searching)

a very clear perl introduction

reformat and re-view

What was sinful in the 19th century?

Above, the corpus was well-defined. If a research corpus ranges more widely, another useful approach is to reformat and reorganize the data. In this case, we'll gather texts and isolate contexts around occurrances of mentions of "sinful" in Victorian literature. Though possible with the scripting technique above, I got help from Google by getting snippets of from over a million books that contained a one of many words of interest (like sin). The problem with these results was that they were more or less randomized, making a useful reading of so much data virtually impossible.

A simple PHP script to reformat the results can help take randomized data (often the way it is most efficiently collected or retrieved), and make it more readable. Is it reasonable to expect a historian to create this? Rather than create, one just needs to find solutions to similar problems. Most steps helpful in text mining have been solved already. The historians chore is to apply the ones most appropriate to facilitating interpretation.

The formatted results are ugly, but informative---organized here chronologically and by book.

conclusions

imagine, don't imitate

For too many historians, technology adoption follows only well-trodden paths. As a result, both real and hoped-for results seem unsatisfying because different sources and questions require slightly different techniques. Approaching the process of text mining with an open mind helps us imagine how to use the various appropriate methodologies most effectively.

Text mining is not necessarily about big data and complex algorithms.
Text mining is not meant to be a substitute for reading.
Text mining is not about having the computer do historical analysis for you.
Historical sources are data as much as text. Data can be inquired in different and profitable ways.
Visualization is more about finding new questions than answering traditional questions.
Enhanced searching through basic tools like grep can help you figure out what or where to read.
Learning the basics of a scripting language (e.g. PHP), or command line tools (e.g. grep) facilitate the same kind of research that historians already do and provides entirely new perspectives.

Fred Gibbs | fgibbs@gmu.edu | @fredgibbs