Computational Methods in Authorship Studies

By Matt Lavin

September 06, 2017

What follows is a copy of my remarks from a talk at Duquesne University in February 2017. I wanted to share these now because I'm starting to think once again about the topic I discussed and I wanted to revisit past work before moving forward. Where necessary, I've added images from the accompanying slideshow.

Good afternoon, and thank you for having me (and especially thanks to Linda Kinnahan, Elaine Parsons, Mary Parish for doing the work it takes to put this event together). I would like to begin today not with the digital humanities, but by situating myself in relation to two overlapping though often wholly disparate scholarly concepts: studies of a single author (in my case "Cather studies"), and the study of authorship or authorship history. Scholarship focused on a single author is perhaps less common in an English department than it was thirty years ago, but it is by no means a thing of the past. Its position or credibility, I believe, has greatly depreciated. Any number of factors might separate one single-author scholar from another, yet the idea of being too narrowly focused, or too invested in authorial intent, or too biographical are all potential points of criticism.

Authorship history, in contrast, is perhaps best described as a subfield of the history of the book, also called book history or book studies. Book history generally focuses the inception, production, publication, circulation, and reception of any written and printed material, including but limited to books. Descriptive bibliography, which underpins the discipline, is concerned with the precise description of books as physical objects. These approaches differ from, say, biographical criticism or literary history in both topical focus and methodology, although all of the subfields I've mentioned have points of intersection. We move from the study of one author to the study of authorship when we take on topics like copyright, literary celebrity, or the origin of authorship as a profession, as do Martha Woodmansee, Loren Glass, and Leon Jackson respectively. Single-author case studies are common in this subfield; type "authorship" with Twain, Whitman, or Wharton, and you will see many results. The shared element in these studies is emphasis on the institution behind an author rather than the author as an individual.

In contrast, you would have a harder time finding a computational analysis of the history of authorship. As Matthew Kirschenbaum and Sarah Werner argued in the 2014 "State of the Discipline" essay for Book History, the distant reading or big data trend "is not one that has spoken to book historians" (410). A search for that approach would likely lead you to authorship studies or authorship attribution, which differs significantly for historical approaches to authorship. Those interested in large scale computation have focused on formalistic questions like the meaning of genre or the verbal architecture of a plot, and book historians, by and large, have focused on other areas of digital humanities, such as scholarly editing, media archeology, or digital projects development. The analytics-driven subset I am talking about differs from these other clusters of practitioners in significant ways. However, I will save most of these kinds of distinctions for later in this presentation.

In this talk, I will focus on some computational methods that seem particularly important to authorship studies and book history, and I will use Cather as a central frame. My aspiration is to do justice to Cather and yet make gesture to topics that might apply to many authors in many different contexts. I believe that many of these methods will also have implications for scholarship focused on a single author because large scale analytics can intervene in some of the classic problems that arise when scholars like me try to navigate the fuzzy periphery where studying a single author becomes the study of authorship, and vice versa.

I will elaborate and (I hope) clarify what I mean with an example. In 2012, I defended a dissertation called "Collaborative Momentum: The Author and the Middle Man in U.S. Literature and Culture, 1890-1940," which traced the role of agents, editors, publishers, and other literary go-betweens in constructing modern authorial credibility. This dissertation, like so many others, was a collection of case studies, each focused on one particular author or text. My second chapter focused on Willa Cather's role in ghostwriting S.S. McClure's My Autobiography. In fall of 2012 or so, I excerpted and revised this chapter as an article for submission and was rejected several times, in part because my work was described as too narrowly focused on a single author. I revised the article substantially, and it appeared in Auto|Biography Studies last year.

As I've mentioned, the single-author case study is a well-accepted unit of analysis in literary studies and book history. Loren Glass's Authors, Inc. has chapters like "Trademark Twain" and "Legitimating London." Meredith McGill's American Literature and the Culture of Reprinting has "Suspended Animation: Hawthorne and the Relocation of Narrative Authority." (Alliteration and puns are encouraged but not required.) The question for me is where a case study begins to lose its emphasis on representing a wider scope than the discrete subject at hand. It's easy to begin with a question like "how did Cather's collaboration with McClure exemplify and/or differ from emergent norms for ghostwriters and instead become preoccupied with the not-totally-unrelated question, "Did Cather respect and care about S.S. McClure?"

I'm not the first to face this tension. Numerous scholars have written about Cather's role in the making of this book. Sharon O'Brien's biography of Cather emphasizes that the author's coming of age required outgrowing her role as "the deferential daughter to two fathers," Henry James, her aesthetic father, and McClure, her professional father (288), in order to move from McClure's editor to a novelist of national significance. Robert Thacker, responding to O'Brien, argues that Cather's "gift of sympathy" was the major force behind her participation—"That is, she wrote it as a favor and out of gratitude for all he had done for her since their first meeting in 1903" (126). O'Brien and Thacker have largely biographical emphases, though both make connections to Cather's emerging literary aesthetic.

Deborah Lindsey Williams and Emmy Stark Zitter offer close readings of My Autobiography aimed at exposing Cather's subjectivity in its narrative voice. Williams identifies textual signs of Cather deploying the ghost as a kind of "open secret" paralleling her own queerness, and Zitter interprets traces of a writer empowering herself by taking on and eventually turning away from the voice of a male narrator.

My goal with "Reciprocity and the Real Author" was to focus on Cather as a case study of "collaboration against the backdrop of ghostwriting as an emerging profession and against the strong historical connections between ghostwriting and autobiography." I felt the need to McClure's motives for planning an autobiography and Cather's motives for serving as ghostwriter, as well as the degree to which My Autobiography was initially produced as a serial narrative for McClure's Magazine. I argued that McClure and Cather's collaborative work consistently straddled distinctions between economic and symbolic capital; that a neither a strictly economic nor a strictly biographical analysis would demonstrate the complex ways in which economic capital and credibility interacted with each other during a period when ghostwriting itself was transforming but not yet fully transformed into a market-dominant set of norms. For me, the experience of working on McClure's autobiography demonstrated the concrete ways that the study of one author and the study of authorship blur and mix but simultaneously urge a scholar in different directions.

Sometime during my revisions for this article, Duquesne's Patrick Juola came to the University of Nebraska, where I was a postdoctoral fellow, and gave a talk about authorship attribution. Though it carries the keyword authorship, this field is significantly different from what I have described so far as authorship studies. Authorship attribution is perhaps best described as the statistical (and now, almost always computational) analysis of text to determine its most probable author. It can be regarded as a subcategory of stylometry (quantitative analysis of style) or digital humanities or humanities computing, but many of its best practitioners are in computer science departments rather than English or history departments.

Using authorship attribution methods to think about Cather as a ghostwriter seemed immediately promising. O'Brien and Thacker take opposing positions about Cather's motivations but share the perspective that her voice in McClure's autobiography makes its way into her later fiction. Williams and Zitter argue that Cather was more successful making her narrator sound like McClure. The autobiography is most commonly associated with My Antonia (1918) and The Professor's House (1925) because both have strongly autobiographical structure and content. All four see continuity between the autobiography and Cather's later fiction, where a generation of scholars before them regarded the autobiography as hack work, or a rush job, or something base that showed little signs of being written by Cather at all. Authorship attribution cannot settle this disagreement, but it can provide new information to evaluate how extensively Cather changed her writing style to take on McClure's identity for the project, and perhaps contribute to a larger conversation about how well any ghostwriter can fake another person's voice.

Here I feel the need to articulate some self-awareness. I am white man speaking about the intersection of a woman's personal relationships with her emergent professional and authorial identity. To some, I might seem like a smart-aleck entering a humanistic debate by saying: "Actually, I've studied this issue with a computer program, and you're all wrong." Instead, what I'd like to do is use authorship attribution as one way (among many) to think about the ways of knowing that historicism, hermeneutics, and quantitative analysis can enable respectively and in tandem.

To do this, I constructed a study to compare Cather's writing against McClure's writing and see which was a better fit for My Autobiography. Cather wrote lots of things—nonfiction and fiction—so it was easy to find comparison texts by her. A few years after the autobiography appeared in print, McClure wrote and published a work of nonfiction called Obstacles to Peace. Using head-to-head tests, I compared Obstacles to Peace to various works by Cather, in each case using My Autobiography as the point of comparison. Beyond this, I was curious to see which of Cather's writings would register as most similar to the autobiography. I also wanted to run analyses on the magazine version of the novel and the book version, which McClure is said to have revised for publication without Cather's assistance. For the magazine, I analyzed each installment separately to see if Cather was more visible in any particular segment, which is a way of investigating the possibility that Cather wrote some installments but not others. I used a method called "Burrow's Delta," (more on this in a moment) and hypothesized that the test would easily identify Cather as the author of the autobiography.

In authorship attribution tests, especially in a "closed set" problem like this one—two candidates, one disputed text—the author's identity can be successfully determined with impressive accuracy, sometimes as high as 90-95%. Most authorship attribution methods look closely at how frequently a text uses high frequency function words, words that are so common to the English language that it's almost impossible for someone to modulate deliberately how often they appear. These are words like "the," "of," "as" and "in." They are not associated with any particular context or topic (unlike, say, "fish" or "tree" or "prayer" or "shovel") and they can be found in any text of sufficient length in the English language. Burrow's Delta is a little different in that it constructs a word list based on the most frequent words in common among the candidates and test text, but this almost always produces a comparison set of function words because they are the most frequently used words in English. My sense of how authorship attribution would perform in the face ghostwriting was that even a text that very successfully replicated the qualitative or affective experience of another person's manner of speaking would read, to a computer, as an obvious fake.

My findings were consistent with the initial conjecture. On the surface, the results of this study are quite clear, yet they may also suggest some deeper implications. I ran several versions of Burrows Delta, with various small changes to the settings (changing the number of function words analyzed, and changing sample sizes, two different ways of controlling for text length) and found that my tests predicted Cather to be the more probable author of the autobiography. For book serial as a whole, Cather won 24 head-to-heads, McClure won zero. In one test for installment 8 and two tests for installments 2, 4 and 6, McClure surpasses Cather's "Three American Singers." These are McClure's only victories in the entire set, and can be viewed as statistical outliers. (I think it would be more suspicious if he didn't win any head-to-heads.) Further:

  • Cather texts of various genres, dates, and lengths win over Obstacles to Peace
  • Results hold true for serial and books version, using culling=0 and culling=100
  • Mean margins of victory for book version are slightly smaller than magazine version, but not significant enough for me to link them to any specific edits.
  • Any changes McClure made to the book version of the autobiography were not extensive enough to swing the results of these tests to him.

The autobiography's authorial signal matching Cather's is consistent with biographical and archival sources, both of which suggest that McClure supplied subject matter for his autobiography, whereas Cather was the only person known to have put pen to paper for the serial version.

The matter of Cather's probable authorship is rather uncontroversial, which is in itself a very substantive finding, as it provides something of an initial affirmation for scholars like me who were sure she wrote the thing. Perhaps more interesting, however, were my results for individual installments and individual head-to-heads. Since Cather was the more likely candidate in so many head-to-head comparisons, these images depict the margin of difference between Cather and McClure, a greater margin suggesting (simplistically) more similarity between the autobiography and a Cather text. The images below show that Cather's margins are highest for the first and second installments and lowest at the very end of the autobiography.

Figure 1: Authorial signal margins by installment for My Autobiography serialization, comparing most frequent words, but only when words are found in all compared texts. (The higher a line is along the Y axis, the more clearly that novel matches My Autobiography's style)

Figure 2: Authorial signal margins by installment for My Autobiography serialization, comparing most frequent words, regardless of whether they are found in all compared texts. (The higher a line is along the Y axis, the more clearly that novel matches My Autobiography's style)

In terms of Cather texts most similar (by this measure) to My Autobiography, I have used histograms to show the top six margins of victory in all my various head-to-heads. I'm presenting four versions of this top six list:

Figure 3: Top Six Margins. Serial Version. Function Words Not Present in All Three Texts Ignored (cull=100). Margins of Difference Scaled to Delta.

Figure 4: Top Six Margins. Serial Version. No Function Words Ignored (cull=0). Margins of Difference Scaled to Delta.

Figure 5: Top Six Margins. Book Version. Culling.

Figure 6: Top Six Margins. Book Version. No Culling.

Taking these results as a whole, we can see that Cather's "Training for the Ballet," My Antonia, The Professor's House, and Obscure Destinies are found in the top six margins of victory for all four versions of the test. Lucy Gayheart and "Plays of Real Life" both rank in the top six margins of victory in two our of four tests. O Pioneers!, Shadows on the Rock and Not Under Forty each appear in one top six list, and other works like Alexander's Bridge, Death Comes for the Archbishop, My Mortal Enemy, and A Lost Lady do not ever appear in the top six margins of difference.

Although it can be notoriously difficult to assign causality to the result of an authorship attribution test (or, in fact, to assign causality to any correlation), I find it suggestive that My Antonia and The Professor's House have strongly autobiographical structures. The strength of their margins over Obstacles to Peace could relate to their use of the first-person but, if that were the case, I would expect to see My Mortal Enemy (also first person) in the top tier.

A logical next step here that I have yet to perform would be to run Burrows Delta on a setting that deliberately ignores personal pronouns like I, me, him, he, her, she and they. Additionally, I would like to cross-validate the results of the Burrows Delta test using two well-known authorship attribution techniques: logistical regression of high frequency function words and chi-squared testing. Each of these methods has pros and cons but, for my purposes, I'm most concerned with whether the results of these methods will be consistent with the Burrows Delta result, especially in terms of the higher margins of difference for particular texts by Cather. Additional next steps would include, foremost, expanding the McClure corpus to break the study's dependence on Obstacles to Peace. McClure authored very few texts, but some of his personal letters could be transcribed and analyzed using these methods. Cather's letters should be added to the comparison to create analytical parity.

Perhaps more significant than any immediate next steps, however, would be some consideration of how a case study like this one can be built upon in a way that informs the humanistic study of authorship or cultures of letters. Immediate research questions related to what I've done so far might include a broader inquiry into how well ghostwriters throughout history imitated their subjects' authorial tendencies. If we can establish norms for ghostwriters, perhaps we can use anomalous results to isolate cases where an autobiographical subject has more of a co-authorship role. Going further, I would like to use these methods to think critically about anonymity in nineteenth-century periodicals, pseudonyms in pulp fiction, and various models of authorial collaboration.

This is point marks the end of my remarks from the talk but there's a bit more to what I want to share. Everything below this block of italics I wrote this Fall.

This piece, like so much of what I do, is unfinished work.

Part of my task for that February day almost nine months ago was to queue up an interactive workshop, so the rest of the text pertains to that subject. I went on to try and show, by example, some of the computational methods that I think are best suited for building on a case study like mine in a way that puts pressure on the history of anonymity or ghostwriting or the humanistic study of authorship or a culture of letters.

What I basically let hang was a more theoretical or generalized discussion of this same topic. The extent to which I moved away from such musing is very clear to me now as I re-read, but it was less clear at the time. I suppose I thought the question was provocative even without an initial response.

What I think I alluded consistently in the piece is the idea that authorship studies, like computational analysis, lives and dies by the finding of facts. Book historians and bibliographers, map-makers and chronologists and scholarly editors must believe to one extent or another in the primacy and relative autonomy of external phenomena. This is the basis for historical observation, as well as quantitative analysis. Observations can and must be complicated, situated, contextualized, and questioned. I am a lover of irreverence, and I instinctively mistrust an authority figure pressing Truth down upon me like a cover stamp. Nevertheless, the question of Cather's role in McClure's autobiography has a basis in the empirical, and some positions with regard to that question are simply more accurate than others. The book historian/bibliographer and the computational humanist share a love for observation and a need to complicate our observations, and this I think is the initial basis for computational book history/bibliography, which includes computational authorship studies.

One problem with the fact people in literary studies is that so many of their observations have been superficial distortions of observable phenomena. In the author-centered literary studies of old, the Author always seems to work alone. The Author has read everything. The Author was always own course to writing her magnum opus. The Author did not lie in letters. The Author didn't get ideas from cheap paperbacks by Edward Bulwer-Lytton or Zane Grey. The Author won every fight and rerouted the river to clean the Augean Stables. Some fear a return to author-centered scholarship precisely because this specter looms in the distance. I believe that computational authorship studies can help us turn away from myths and bring us closer to discussions of norms and exceptions.

Others fear something a colleague of mine called minutia. I think what he was meaning to say is that single author scholars often get so good at looking at a tree that they forget there's a forest. This worry has less to do with the study of authorship than the study of a single author, but both are capable of descending into minutia, as I probably did numerous times in my dissertation. In my February talk, I put this perhaps more delicately. I spoke of "where a case study begins to lose its emphasis on representing a wider scope than the discrete subject at hand." Either way, I believe computation can help with this tension.

So far I'm imagining three "next steps" toward a more developed community of computational book history/bibliography. The first step, I think, is learning to write computational research questions. That's what interactive workshop after my February talk was about. We need to think in terms of questions that computing can answer, or at least help us address.

Second, to write computational research questions (or computationally compatible research questions), we must reckon with big questions of empiricism and quantification. I doubt we will resolve our ambivalence about empirical phenomena any time soon, but I believe we must confront it openly.

Third and finally, we must build a culture of open data. As book historians, we can talk about copies printed, list prices, title variations across editions, people named in various acknowledgments, or any number of areas of inquiry that require better data. One day we might even have data showing whether ghost-writtten texts typically resemble the ghostwriter's style more than the autobiographical subject. Publishing web-based resources is not enough; we must enable computational access to the data we make.

We must teach book historians to make data, to interrogate data, to make use of other scholars' data, and to share data. These practices need to become prerequisites for computational humanities, and they could be an effective rallying point for computational book history/bibliography.