NB: A version of this post was presented during the English and Film Studies Postdoc Talk Series at the University of Alberta
The UAL’s digitization of The Western Home Monthly lets us do a lot more than just post hilarious old ads and pictures of children preserving things. In fact, most of the work we’ve been doing with this rich resource hasn’t involved looking at pictures at all.
When Digital Initiatives Application Librarian Weiwei Shi generously shared the digitized files with us, our first challenge was to figure out what to do with such a sheer quantity of data. Peel’s Prairie Provinces outsources some of its digitization to Backstage Library Works, which provides “preservation-quality digital images” as well as OCR and archival-standard metadata markup. When we received the drive containing the digital version of The Western Home Monthly it contained: 348 PDFs and METS files—one for each complete issue—and 24,170 jpeg-2000 and ALTO files, one for each individual magazine page. (That’s more than 50,000 files total, with around 314,000,000 lines of xml code.) Not sure what METS and ALTO means? Neither were we.
The METS standard is a flexible schema for describing a complex digital object (like a digitized newspaper issue). METS describes the structure of the object but does not encode the actual textual content of the object. The ALTO standard fills this void by encoding the textual content of a digitized page in great detail, including styles and layouts. As well as encoding the digitized text itself ALTO encodes the spatial coordinates of every column, line, and word as it appears on the page.
The combination of METS and ALTO (often written METS/ALTO) is the current industry standard for newspaper digitization used by hundreds of modern, large-scale newspaper digitization projects (and lots of smaller projects too!)
(Source: What is METS/ALTO?)
With the help of EMiC UA collaborator Matt Bouchard, Nick van Orden wrote a script that would strip out everything encoded as “CONTENT,” producing 348 plain text files. We’ve taken two approaches thus far: Nick has been experimenting with word frequencies across the magazine’s 32-year run, while I’ve been exploring the possibilities of topic models.
Let’s start with word frequencies, which are a little more intuitive. Using an analytics platform called RapidMiner, Nick has produced a single spreadsheet that tracks 70,644 word–some of them OCR-produced gibberish, most of them real–telling us how many times they appear each year. EMiC UA collaborator and DH wizard Harvey Quamen then very kindly and generously wrote us a script that translates this spreadsheet into a format useable by “R,” a statistical computing program that makes it incredibly easy to produce data visualizations. Here’s an example.
This graph demonstrates a marked reversal, around 1920, in the frequency of references to Manitoba and Ontario–a reversal that matches a shift in the magazine’s mandate from local/provincial to increasingly national. In the July 1920 editorial “The Power of the National Magazine,” the editors of The Western Home Monthly insist that the emergent nation is in need of “a strong national sentiment, a sentiment cohesive enough to cement all the parts from Halifax to Vancouver” and that magazines can do this work of generating a community based in shared sentiment more effectively than “any other medium” including newspapers.
Topic modelling is a little less straightforward. I’ve been using MALLET, an application that applies a statistical algorithm to a body of texts through which it defines a set of topics. The researcher can choose how many texts to use, what stop words to remove, how many topics to generate, and how many words to place in each topic, but the algorithm is what determines the content of the topics themselves. A topic, according to MALLET, is a group of words that occur together in a statistically meaningful way—that have a high probability of appearing in the same document at the same time. Let me illustrate with a favourite example, the first topic identified:
95 0.09086 september october vinegar sc tomatoes jars joy onions mother drain peppers sugar apples jimmy seal ivan spices hair cucumbers
If you’re familiar with my penchant for preserves, you’ll recognize this right away.
MALLET provides you with two files: one is a list of topics, formatted just like this. The other is a spreadsheet of the relative frequency of each of those topics within each document. In our case that was 100 topics across 348 individual magazine issues: another unreadable spreadsheet. Harvey came to our rescue again, writing another PHP script that reformats our data so we could use R to take a look at it–and here’s what we get:
In this graph, I’ve charted seven topics across the magazine’s 32 years. This is an exploratory visualization, one I’m using to try to discover something about my enormous archive–in this case about how, as a serial text, The Western Home Monthly deals with what Debra Rae Cohen calls the “competing logics of media address and presence, time and space” (“Intermediality and the Problem of the Listener” 571). Along the top are topics that recur throughout the full magazine run, and that tend to focus on both the seriality of the magazine itself (monthly) as well as an overarching concern with time, gender, and labour, as well as nation and home; along the bottom you see the rising and falling topics of new media as they emerge and, in some cases, disappear again: phonographs, radio, and then film. This, in visual form, is one of my central research questions: what is the interplay between the traditional and the modern, the everyday and the extraordinary, on the pages of this magazine?
As with the magazine page itself, there is an enormous amount of information encoded into the files the library has shared with us, information that we’ve only begun to interpret. As my literacy continues to lag behind my curiosity, my questions vastly outnumber my answers: could we use METS/ALTO’s location metadata to track a shift in how advertising is used across the magazine’s 32 years? Could the style and layout information help us to quantify what is embedded in a magazine’s bibliographic codes, the regularity of formal features in which the serial text’s identity is grounded?
Right now, having more data than we know what to do with feels like a pretty good position to be in.