Are You Sure This is a Good Idea? If So, Why?

By Matt Lavin

June 10, 2019

In my last post, I discussed Paul Silvia's How to Write a Lot and promised to describe my book project in greater detail, specifically addressing my central research question; the central dataset/subsets I'm working with; and the methods I want to employ. Here's what I have so far:

Research Question

In my book, I plan to revisit a score question in book history and the sociology of literature: how hierarchies of cultural value an taste were constructed in the late 19th and early 20th centuries. There's no shortage of work of this subject; it's a crucial concern of Pierre Bourdieu, Ellen Gruber Garvey, Lawrence Levine, Janice Radway, Joan Shelley Rubin, Warren Susman, and many others. The late nineteenth century is a focal point for these kinds of inquiries because so much growth and change were happening all at once, in book publishing and beyond.

I'm especially keen to tackle the range of roles periodicals played in the construction of cultural taste. My previous work focuses on intermediaries and mediation but, as I mentioned in a previous post, I think there's something crucial to say about the broader role of periodicals that hasn't yet been covered. My work will speak to the lexical and rhetorical patterns that are best understood by looking at large scale trends. I will focus on periodicals in the United States between 1880 and 1925 (approximately) because this is where my training is, and because the dataset I'm working with represents this period better than others.

Data

Speaking of my dataset and the subsets I'm developing, I began (more than two years ago) by requesting access to the PDF and XML files (page images, metadata, and OCRed text) representing the American Periodical Series, which contains 11.5 million records from 1,887 different periodicals. It took about a year to negotiate the terms, pay Proquest's licensing fees, and receive a hard disk in the mail with all the files on it. In total, despite being compressed, the data on the hard disk filled more than 8 terabytes!

This digital collection is so large that conventional programming techniques take unreasonably long to work through the data. A conventional, iterative block of code that might take a matter of minutes with a hundred thousand records could take days to run if you don't revise your initial thinking. However, writing scalable programs often involves learning new software, as well as new programming concepts, and I don't have unlimited time to do either of those things, either. My goal is to save time, not merely to reallocate all that time to a slightly different task.

As a result, I've settled on an approach that seems to be working well for now: I've index the entire digital collection using ElasticSearch, and I've learned enough about how it works to query it Python and generate a data subset. This calls for using ElasticSearch very modestly, such as in the following example:

This bit of code returns every record with labeled as "review" in the Proquest data. There are about 280,000 of these records. It then inserts those records, wholesale, into an sqlite3 database, which can be used for subsequent queries, including natural language processing (NLP), named entity recognition (NER), and machine learning (ML).

ElasticSearch is a good software for this kind of use case because (1) it's powerful and fast with large data; (2) queries and results are expressed as json data, which I already know how to use; (3) there are good libraries for accessing ElasticSearch in Python; and (4) it's very well documented, with a big user community, and lots of online tutorials, lessons, etc.

Analytical Methods

Last fall, I taught a graduate seminar called Digital Humanities Approaches to Textual Objects. This course was a chance for me to revisit important scholarship on computational text analysis methods, including discussions of which methods work best in the humanities. We also read work on how computational methods can best work alongside with traditional humanities scholarship, and crucial limitations of computational methods. Below is brief summary of what I regard as the most exciting computational text analysis measures. (There are also workshops for most of these measures on my Fall 2018 syllabus.) I'll say more about each method in subsequent posts, but this list is meant to provide an initial sense of the merits and drawbacks of each.

Measure or Method Pros Cons For more Info
Term Collocations Direct measure of word associations Need full OCR text to run https://www.aclweb.org/anthology/J90-1003
Regression Analysis Can use to discover features that predict a category Need categorical or continuous labels to evaluate predictions https://towardsdatascience.com/5-types-of-regression-and-their-properties-c5e1fa12d55e
Term Keyness Can help identify significant differences between corpora or parts of a corpus Potentially vulnerable to "p-hacking" if misused http://www.thegrammarlab.com/?p=193
Cosine Similarity Direct measure of similarity, not distorted by texts of varying sizes Reductionist (which isn't always bad) https://www.machinelearningplus.com/nlp/cosine-similarity/
Sentiment Analysis Can provide a snapshot of a text's most likely emotional valence Mistrusted by many DH folks (lots of baggage) https://nlp.stanford.edu/projects/socialsent/
Word2Vec/CBOW Can be used to learn and traverse a word association network Deep learning algorithm (black box problem) https://towardsdatascience.com/word-embedding-with-word2vec-and-fasttext-a209c1d3e12c

It occurs to me, suddenly, that what I've written might look a little like a recipe for a complicated dish, without any pictures of delicious food to convince anyone that all the effort will lead to some reward.

Let me get this straight. You're interested in late 19th and early 20th century American periodicals, and how they participated in establishing, maintaining, and negotiating cultural hierarchies.

You want to use computer programs to analyze this subject, but you actually have too much data?

Are you sure this is a good idea? If so, why?

In my next post, I'll address some of the potential rewards of mixing literary studies, book history, and computational/statistical methods.