This page presents a list of resources potentially useful for anyone who is relatively new to text mining, wants to see what’s possible, what’s not, and wants to do a bit of playing around with it. This is not a comprehensive list of everything written about text mining. It is geared toward non-technical novices. Articles that deal with particular algorithms or complicated statistics concepts have been omitted to encourage sanity and experimental play.
In general, it’s grouped by what seem to be the most pervasive topics and themes relevant to text mining in general.
A few general articles to explain the big picture:
Ted Underwood, Where to Start with Text Mining.
John Burrows, Textual Analysis from A Companion to Digital Humanities.
Stanford Literary Lab, Quantitative Formalism: an Experiment.
Fred Gibbs, Learning to Read. Again.
Dan Cohen, Searching for the Victorians.
D. Sculley and B. M. Pasanek, “Meaning and Mining: The Impact of Implicit Assumptions in Data Mining for the Humanities,” Literary and Linguistic Computing 23, no. 4 (September 29, 2008): 409–24.
Andrew Goldstone and Ted Underwood , “The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us,” forthcoming in New Literary History.
To see some of the techniques presented in a open an accessible way:
Some traditional-looking articles with text mining as a core method:
Lauren Klein and Jacob Eisenstein, “Reading Thomas Jefferson with TopicViz: Towards a Thematic Method for Exploring Large Cultural Archives,” Scholarly and Research Communications 4, no. 3 (2013).
Some of the easiest ways to get started is to use an online tool that saves you the step of installing software.
If you want to get a bit more serious about text mining, or if you get tired of having to do everything online, there are a few excellent tools that give you tremendous text mining power, but very little in the way of visual interfaces.
One of the scariest conceptual leaps that humanists must make is moving from printed archival materials to digital files that you can’t touch or mimeograph. This shift requires not only a new way of thinking about texts, but also how to organize them, access them, and how they relate to each other. It also requires
Your local libraries may know of local repositories for digital data as well.
One of the neat advantages of text mining is the way you can combine multiple data sources at larger than usual scales. Usually this will mean finding data from multiple sources, and of course they won’t be in the same format. In order to make use of all your texts, you’ll need to make sure the data are suitable for machine processing with standardized formats, not stray characters, etc.
Getting your data to look nice takes FAR LONGER than you want it to, than you think it should, than you think it deserves to. It is arguably the most difficult and crucial part of the process.** Exploratoring your corpus algorithmically is quite fun, but unless you like to geek out with python scripts and regular expressions, cleaning data is decidedly the unfun part of text mining. But the key to success here is to remind yourself that even when preparing data, you are doing history, not trying to get ready for it. It’s the digital vesion of combing through the archives. Not glamorous, but necessary, and there certainly are both art and science aspects to it.
For more on the basic concepts of organizing data: Hadley Wickham, Tidy Data.
Open Refine is a excellent tool for normalizing data. Begin with the 3 tutorial videos to get a sense of what it can do.
Seth van Hooland, Ruben Verborgh, Max De Wilde, Cleaning Data with OpenRefine
If you want to really use text mining methods, you’ll need some facility with data beyond what pre-packaged tools give you. They simply can’t account for all th circumstances you’ll encounter.
But you’re in luck! There is a great tool for getting your data into any format you need! It’s called python, and it’s a progamming language that is easy to learn and very powerful in terms of manipulating data. To get started:
Commmand Line Crash Course
Once you get a better feel for the basics, see what else you can do at The Programming Historian.
Almost all digital text that isn’t born digital goes through some kind of OCR process, which can yield inaccuracies in the transcription process. Depending on the scale of your analysis and what it is your looking for, these may or may not cause difficulties.
If you are just starting out with text mining techniques, these errors are not terribly important.
If you want to do more careful literary or linguistic study, these errors might become sufficiently annoying that they should be corrected.
Mining for the Meanings of a Murder: The Impact of OCR Quality on the Use of Digitized Historical Newspapers.
Ted Underwood, Basic OCR Correction.
Ted Underwood, A Half-Decent OCR Normalizer for English Texts after 1700.
Laura Turner O’Hara, Cleaning OCR’d text with Regular Expressions.
You might lots of photos of documents that are begging to be turned into digital text. But you need to work at scale, which means having some kind of process to deal with your images in bulk rather than one at a time.
Miriam Posner, Batch-processing photos from your archive trip.
To any meaningful text analysis, you’re going to work with non-trivial amounts of texts, documents, files. You don’t need big data! But if it’s more than you can really keep in your head or possibly read in a resonable amount of time, or you would like to be more efficient at figuring out what zero in on, you’ll need to stay organized so that you know what you’ve done to what texts.
If you are going to be processing files directly, either to improve images, do OCR, or subject them to computational methods, you’ll want to be able to keep your files organized on your filesystem–meaning a directory / folder structure that makes sense.
You can also use Zotero or similar organizational tools, but this creates some distance between you and the actual files which can be annoying if you are often working with them directly, either trying to upload them to online tools, or feed them into tools you’ve downloaded to use on your own machine.
William J Turkel, Workflows and Digital Sources (and explore the links!).
One of the techniques most in vogue at the moment is topic modeling. These articles cover the fundamental concepts, most with lots of links to various kinds of explanations (most are very accessbile, but some are probably more technical or mathematical than you want.)
Megan R. Brett, Topic Modeling: A Basic Introduction.
Scott Weingart, Topic Modeling for Humanists: A Guided Tour.
Elijah Meeks and Scott Weingart, The Digital Humanities Contribution to Topic Modeling. Follow the links in this brief introduction!