Plants, Places, and Metadata

Metadata is a cruel mistress. When she has your eye, she’s exhilarating, full of possibilities. When she turns away, or can’t be found, you’re plunged into melancholic despair of hopeless frustration.

One current project funded by the Andrew W. Mellon Foundation will help map the botanical specimens held at the JSTOR plants database (plants.jstor.org) has presented several interesting metadata challenges. More important than the mapping itself, the project has suggested some interesting possibilities for improving metadata when it’s insufficient or simply missing–without saddling an already overworked archival staff with an additional enormous burden. I hope a brief summary of our experience so far will be useful to anyone contemplating similar issues, both in ways of improving metadata and in particular mapping historic placenames. It should be pointed out that the metadata I’m talking about here was created by herbaria all over the world, not JSTOR itself. That JSTOR was able to collect and make this data available is a tremendous asset to the research community.

First, the basic premise: the prototype (not quite ready for public use) retrieves the metadata from JSTOR, geolocates the collection localities based on whatever description the collector recorded on the specimen sheet (not always precisely echoed in the metadata) and displays markers on a Google map. CHNM developer Jim Safley did tremendous work to quickly enable map users to sort, filter, and see the markers change over time. The idea is to get a broader view of collection patterns, otherwise invisible, that suggest new and interesting questions about why they looked the way they did.

Trying to get longitude and latitude coordinates for a random placename (=geolocating) is tricky business, even with freely available web services that do just that. One researcher on the project, Hanni Jalil (working with Gabriela Soto Laveaga) at UCSB, noticed that stripping out some descriptive phrases from the locality field can noticeably improve geolocating results. The kinds of words most encountered for our dataset of the Caribbean and west Africa are perhaps somewhat specific to our geographic foci, but a compiled list of stop words we’ve created could help create more sophisticated algorithms for filtering them out of any dataset. We hope to make such a list available for general use.

Even after some basic processing, the vagaries of location data create geolocation difficulties. One example was noticed by Megan Raby (working with Gregg Mitman at UW-Madison): For mapping specimens in Panama, many locations of specimens are described as having been found with something like “summit of X” where X is some well known geographic feature. It turns out that there is also a place known as “Summit Gardens” in Panama. It turns out that geolocating services typically mistakenly locate the summit of a mountain (that it otherwise could geolocate by name) in Panama, at Summit Gardens instead. So we can strip off “summit of” from the locality field, but of course there are still issues because the locality description might say something slightly different that creates a similar problem. Similarly, Hanni noticed that metadata for Heudelot’s specimens collected during his 1837 Voyage dans la Senegambie list Senegal as the country and Fouta Djallon as the locality. But Fouta Djallon is a highland region in Guinea. The real problem, as Hanni noted, is that the specimens sheet read “Fouta Djallon, Senegambie,” and were produced when the region Senegambia included modern-day Senegal, Gambia, and parts of Guinea. She speculates that perhaps creators of the metadata simply picked Senegal as its modern day equivalent. This discrepancy is easily recognized by someone who knows the region–so wouldn’t it be nice if they could suggest a quick emendation? This is not to say we want to overwrite the original locality descriptor with a modern name (and thus detract from their value has historic witnesses)–but rather provide researchers with a standard way of resolving historic place names. It also underscores how subjective metadata can be.

All this is to say that the process of mapping also helps us easily visualize (and take action) where the metadata is wrong, and this has turned out to be one of the great virtues of the tool. And this is one area in which digital humanities tools might add considerable value beyond their primary purpose (in this case, mapping the specimens). For this project, errors in the metadata, both particular and systematic could be readily identified because the mapping tool facilitated microviews of a huge dataset, a view that could be easily judged as to its accuracy by the researchers. And thus the impossibly large task of correcting difficult metadata can be outsourced in manageable chunks to people with subject matter expertise.

I would suggest that tools designed to visualize data should be built not only with the assumption that the data will need some fixing, but also with the ability for researchers to play a crucial role in the process of improving the metadata–and take advantage of how such a tool makes it easier to identify what’s wrong with the data, or at least how it could be improved for historical research. This would not necessarily be only an act of good will on the researcher’s part, but a way of making their own data more useful for their research, and performing an invaluable service to the research community at the same time–a kind of indirect collaboration–helping to make data more usable for everyone simply by working on one’s own project.

Subject matter expertise is crucial here. Much parsing can be done algorithmically, and it is. But it’s obvious that for even a modest dataset (~900k records), there must be a balance between human intervention and automation. A not uncommon problem with some plants metadata is when a specimen has a collection date or locality on it, but is simply missing in the database. A researcher who knows the collection locations and habits of particular collectors can see at a glance that the metadata is wrong and do something about it. Sure, the data’s curators could view missing fields at a glance, but then what? Allowing researchers who know something about what the data should be, can make tools like this much more useful not only only finding and fixing data errors, but also at training the tool and finding anomalies and thus minimize the need for manual intervention.

Of course this requires the difficult work of an infrastructure and interface so specialists can help. And in a way that the collective effort of correction can be used on a different dataset. Yes, there is a semantic web computing metadata fantasy happing in my description here, but even a few steps in this direction is a huge improvement. Our idea is to store such proposed changes in our own database where we manage the geolocations that we could then send in an agreed format to JSTOR for moderation and ingestion. Of course questions of ownership and terms of use are very real issues, but hopefully data owners will recognize the vast potential of a free labor force working to correct their metadata.

Facilitating such a process of metadata correction will require transparency and reciprocity between historians and archivists to create a viable process for how metadata can be improved and normalized. If researchers are allowed to suggest/make corrections, even with moderation, there needs to be a way of keeping track of who has changed what. For a historic placename, it’s not enough to know that place X equaled Y, but the timeframe for which that was true. Clearly, there is more technical infrastructure required here than a historian will typically care about. But it’s a new kind of collaboration between data providers and their constituents that could help make historical data considerably more usable for researchers across a variety of disciplines that are increasingly experimenting with mapping historical data, either as quick and dirty methods of analysis, or as scholarship in itself.