Terms and Conditions for a Culture of Open Data

By Matt Lavin

December 27, 2017

This post is response to Andrew Piper's recent "An Open Letter to the MLA" (12/21/2017), as well as a response to some of the reaction it generated on Twitter. I want to express solidarity with Andrew and to acknowledge that many members of our community have raised important issues that complicate his initial statement. Nevertheless, with the MLA's annual convention soon to come, I believe that now is an especially timely moment to come together and advance our public conversations about data-driven research in the humanities.

Before all of that, I'd like to say a little about myself to contextualize my response. I'm not an especially well known digital humanities personality, but I've been engaged with DH as a scholar and a teacher since 2012, which was the year I defended my dissertation. I worked as a postdoctoral scholar at the Center for the Digital Research in the Humanities at the University of Nebraska - Lincoln, and I spent two years after that as Associate Program Coordinator for an institution-wide Mellon grant focusing on digital humanities and integrative learning at St. Lawrence University in Canton, NY. For the past two-and-a-half years, I've been Director of the Digital Media in the English Department at the University of Pittsburgh.

I should also say that, especially in terms of my scholarship, I have spent the last few years focusing on methodological training rather than public visibility. I've been learning to code and working in 'alt-ac' jobs that usually involve some kind of hands-on digital making. My pre-DH scholarship focused on spaces where authorship studies overlaps with literary studies and, like Katherine Bode, Andrew Piper, and Ted Underwood (to name a few better known examples) I believe that computational methods (including but not limited to large scale data mining and machine learning approaches) have a tremendous potential to assist scholars in revisiting core book history and sociology of literature research questions. About a year ago, I started a small website called humanitiesdata.com in an attempt to make finding and sharing datasets of interest to the humanities a bit easier. This background informs my perspective on the idea of open data. I recognize that others have stories of their own that inform their points of view.

Below, I have attempted to sketch out what open data means to me. I see open data as set of philosophies that often includes stances on access, timeliness, and reproducibility.

Access relates to the idea that some data ought to be open to the public, in formats that facilitate various uses. I would argue that the MLA International Bibliography is a good candidate for this status but, historically, I believe access to the bibliography has been restricted to members. Further, now that the bibliography is no longer in print but accessed instead as a searchable database, access is provided through subscriptions by three very prominent vendors: Cengage, EBSCO, and ProQuest. These companies collect subscription fees, mostly from academic libraries in exchange for access. Presumably the MLA is paid a percentage. If the MLA International Bibliography were to become truly open, the profit potential of these data would probably be affected.

Yes. The conflict between data-providers and researchers often takes this form: pic.twitter.com/7o0WRzDoF5

— Ted Underwood (@Ted_Underwood) December 22, 2017

Timeliness relates to the idea that data can't really be open if it comes only after long delays. This principle is especially relevant to issues of government transparency, but it also applies to datasets that accompany scholarly publications. A work of scholarship associated with a dataset likely contains the most rigorously constructed interpretation of results for that dataset that will ever exist, so publishing data and code alongside an article that depends upon both can be seen as the best way to be sure that an article's most engaged respondents will have the best possible tools in hand to scrutinize the work.

This brings me to reproducibility. The best way to scrutinize computational work and build on it is to recreate the source code and data and rerun a scholar's experiments. Reproducibility is not the most common way of thinking about scholarship in literary studies, at our level, because it often taken as a prerequisite that a scholar's quotations from source texts should be accurate and correctly cited. Yet editors at many scholarly journals spend countless hours double-checking citations, and any reader of a periodical can conceivably do the same. As far as I know, only in a data-driven context does the humanities accept work where the objects being analyzed might not be available for this kind of post-publication examination. Preventing reproducibility hurts scholarship by opening the door for fraudulence and putting up barriers for follow up work.

Some may argue that this examination and follow up work isn't happening anyway. To this I would say that openness invites open discussion and collaboration. It doesn't guarantee that these things will happen, but closed data practices do all but guarantee that these practices will be difficult or impossible.

In short, I believe digital humanities should consider access to timely, multi-format, and reproducible data as a core set of values. A culture of open data would strengthen our scholarly ecosystem, as well as the work it produces. How we deliver this is a subject for debate and discussion, but this is the first point on which I would like to express solidarity with Andrew Piper. Andrew has called upon scholarly organizations like the MLA to reevaluate the practice of partnering with for-profit companies to provide access to its data and to "stop signing license agreements that limit access and the public circulation of their data." He has argued that libraries should stop signing license agreements that limit such access. I believe that this is a question that the MLA should discuss and debate as a group and in the open. Should this question not, at least, be put to its members?

The MLA constitution states, "The executive director shall see that all actions of the Executive Council and the Delegate Assembly are reported in an appropriate publication of the association within a reasonable period of time. On petition of one percent of the association, any such action shall be subjected to a referendum mail ballot of the entire membership." I am by no means an expert on MLA procedures, but a carefully crafted petition for a full membership vote at the 2019 convention seems like an appropriate way to raise Andrew's question for public debate.

I have seen numerous Tweets raising the point that the MLA and its vendors are already engaged in efforts to develop an API (application programming interface), which will provide exactly the kind of access Andrew Piper and others are asking for.

I have been biting my tongue, but can’t continue to do so. What those of you passing around that open letter about the MLA’s data sharing practices haven’t been told is that the org has informed the author repeatedly that it’s developing a new platform with an open API +

— Kathleen Fitzpatrick (@kfitz) December 22, 2017

that is designed support exactly the kind of research the author wants to do. That sort of development takes time, and the org has asked for (and obviously not received) his patience.

— Kathleen Fitzpatrick (@kfitz) December 22, 2017

This point should be acknowledged as both a valid critique and, I believe, insufficient as a response to Andrew's concerns.

This point inevitably crosses into questions some have raised about whether Andrew was sufficiently patient with established processes for accessing and sharing MLA data. Andrew, by scraping the current interface and sharing data, may have violated a vendor's terms and conditions. However, after reading those terms and conditions, they do seem broad and difficult to adhere to. Further, I have to express sympathy for someone like Andrew, as the emotional reality of working with large datasets is often a mix of confusion and frustration. Although it does seem that Andrew made some mistakes, I think mistakes like these are understandable. As more citizens of our scholarly community learn to code, these types of run-ins are going to become increasingly common. Meanwhile, as Kathleen Fitzpatrick, Paige Morgan, and others expressed on Twitter, we have to find ways to acknowledge and respect the hard work that other practitioners are doing behind the scenes to provide access to these kinds of services.

I'm sorry you're having to deal with this. (It's the latest in a series of examples I've seen wherein people in research-focused roles fail to understand institutional processes/how institutional infrastructure works.

— Paige Morgan (@paigecmorgan) December 22, 2017

I also want to address the idea providing "non-consumptive access," which Amanda French raised in a Tweet.

I do know for a FACT that the MLA is committed to providing “non consumptive” access to researchers who want to analyze its data in the aggregate, but, um, they are building that and it doesn’t exist yet.

— Amanda French (@amandafrench) December 22, 2017

With literary texts, non-consumptive alternatives to raw text files make sense as a way to provide data without providing a version of that data that can be modified back into an e-book format that one might substitute for buying the book legally. In the context of the MLA International Bibliography, it's hard to imagine what a non-consumptive format would look like. I'm obviously open to being corrected on this, but any csv of the database could be re-engineered into a ProQuest-type interface. To me, the best solution is to make the data available through an API and as downloadable files, on the condition that the user will not use the dataset to create a front-end interface. Many datasets are shared with similar terms and conditions, and violators are subjected to the same kind of takedown notices that Andrew received.

In specific response MLA practices, I think some crucial questions should be addressed in as public a form as possible.

  • Since we have heard repeatedly that an API is being developed, what is a reasonable timeline for users to expect access? How can we get periodic updates about the status of the API? How can code-savvy scholars help develop it?

  • What should MLA data policies toward its data be in the meantime?

  • Why is hosting a self-made dataset online incompatible with the MLA's intended data policies? What are the current policies with regard to accessing the data for computational purposes?

Answers to some or all of these questions may be online somewhere. I looked on the MLA's website and didn't find them. I would have hoped to see them articulated or linked in the FAQ section of the MLA International Bibliography.

Finally, I would raise the question of how we can best engage with these questions non-adversarially, or at least less adverserially. Past mistakes aside, I believe that most of us want the same thing: a shared asset that benefits MLA members and facilitates a better understanding of our field(s) of study. I can't speak to what has gone on behind the scenes, and I wish to express only respect for anyone who feels that my call for more openness is an attack of any kind. So many people have put so much effort into making the MLA International Bibliography what it is today, and Andrew's eagerness to use the database for his scholarship can be regarded as a testament to that fact.