The Data Humanist: Posts about Book History, Cultural Analytics, and Humanities Data

Diagnosis: Neologophobia

By Matt Lavin

June 01, 2017

A Provocation

With the spring semester over, I've been coming back to a project I haven't worked on in almost a year. It originated with my collaboration with Alex Gladwin and Dan Look on H.P. Lovecraft's role in revising C.M. Eddy's "The Loved Dead." Sometime during that collaboration, Dan mentioned to us that Lovecraft had said in a letter that he didn't tend to use words in his writing that had come into the English language more recently.

At the time, I had read Ted Underwood's article on "The Emergence of Literary Diction," and I was immediately interested in whether a computational approach could look closer at Lovecraft's use (or not) of neologisms and or archaisms. At Keystone DH in June 2016, I presented a solo conference paper on some of the work I'd done to interrogate this subject, especially focusing on how to measure a single author's use of neologisms. The summer, however, was short, and I found myself scrambling to finish my fall syllabi before I'd had the chance to work on adapting the conference presentation into an article.

Before I say more about what I did in this presentation and what I am doing for the article, let me back up and say that, some time between the initial conversation with Alex and Dan and the conference paper, I asked Dan to help me find the original letter he was talking about. He provided two letter citations, both found in H.P. Lovecraft Selected Letters, but the first of them, Lovecraft to Maurice W. Moe, January 1, 1915, really got me excited:

The books which are used were not modern reprints, but musty old volumes written with "long s'ses." By some freak of childish perversity, I began to use long s's myself, and to date everything two centuries back. I would sign myself "H. Lovecraft, Gent., 1698", etc. Latin came quite naturally to me, and in other studies my mother, aunts, and grandfather helped me greatly. ... When I was ten I set to work to delete every modern word for my vocabulary, and to this and I adopted an old Walker's dictionary (1804) which was for sometime my sole authority. All the Queen Anne authors combined to form my literary diet. (qtd. in Derleth and Wandrei, 5-6)

A consider this letter a provocation of sorts. The literary biographer wonders simply how accurate Lovecraft's narrative of self-fashioning could be. Is it possible to read so much from an historic period that one's prose begins to look and feel like a throwback? Is it possible to use an historic dictionary to purge one's writing of neologisms? And if so, why would you want to do that? In this blog post, I will unfortunately answer none of these questions, as most of them are the heart of the article I'm working on. Instead, I'm going to focus on one particular computational method that I've been working on to describe an author's relationship with their historical moment: a linear regression algorithm to predict a text's year of origin.

Defending an Over-the-Top Blog Title

What is neologophobia? If you ask the Google machine, it might reply: "Did you mean: monologophobia"? Monologophobia is, I just learned, is "A fear of using a word more than once in a single sentence or paragraph." However, if you look up the Greek origin of the terms philia and phobia, I believe that you will find them to be antonyms. -Philia and -phile refer to abnormal attraction or a tendency toward, and -phobia and -phobe refer to abnormal repulsion (fear, hate) or a tendency against. It is with this definition in mind that I am using the term neologophobia: an abnormal aversion or tendency away from neologisms.

Stylochronometry: Because Big Words Are Fun

I don't want to do an extensive literature review here, but the study of a text's date signature is not new. Some have been interested in questions like "when did Shakespeare most likely write [name your play]" or "which came first? novel A or novel B?" The idea that a text contains information about its approximate date in history is a bit more recent, or maybe it's just become more prominent with the rise of machine learning methods.

I should mention that I emailed with Ted Underwood about a year ago about date prediction, and he suggested that I look into machine learning, especially linear regression, as a text's publication date is considered continuous data, which is to say that the year 1785 is more related to 1786 than 1800 whereas, in a classification task like sentiment tagging, "happy", "sad", "afraid" and "angry" are all equally related to one another. This page has a better summary.

I was very new to machine learning at the time, and I was grateful for the nudge in the right direction. I did a bit of classification for Keystone DH 2016, but I put it aside with everything else when school started last fall. My next big nudge was meeting David Bamman when he came to Pittsburgh to give a talk. I missed his talk (for a dissertation defense) but got to talk to him later that day. I later found his article, "Estimating the Date of First Publication in a Large-Scale Digital Library", which is the best work on this topic I've found.

Bamman et. al. ultimately conclude that "the best method for estimating the date of first publication for books in a large digital library is to leverage the depth of the collection, identifying duplicates and assigning the first date of publication for a book to be the earliest date attested among its near-duplicates" (6). However, this assertion pertains specifically to the information retrieval or quality assurance aspect of their work. They add:

Even though our estimates of the true first date of publication are better served with deduplication-based methods, learning a model to predict this date from the content of the book gives us the potential for deeper insight into the books in our collection by providing a mechanism for measuring apparent time, as distinct from both the observed publication date or the narrative time. (8)

In other words, an added bonus of employing machine learning methods rests in the way the model relates to its outliers. In a linear regression model in particular, the you're essentially drawing a line (or a multi-dimensional vector) designed to accurately predict a variable based on inputs. Given the input of a text's term frequencies, a well-trained model can produce a reasonably accurate prediction of that text's likely date of inception.

However, there are always outliers, texts whose predicted date is way off from its actual date. If you're in QA, you might use this outlier status to flag a text for a possible metadata error (Bamman et. al. 7). Once a date error has been ruled out, however, we must consider the notion that a text whose predicted date differs from what its term frequencies suggest might do so because its textual features defy the norms of its historical period.

Dataset and Some Code

Around May of 2016, I was putting a lot of energy into how I might create a dataset to compare Lovecraft's "date signal" to other authors of interest. Then I came upon Ted Underwood's "The Lifecycles of Genres" and realized that he'd shared publicly a dataset that I could use for my work and easily build upon. As a result, the methods I've been playing with are tested on this corpus of general metadata, genre tags (science fiction, crime/mystery, gothic/horror and non-genre) and term frequency tables for more than 950 novels. Here's the repo. I was also able to use the same codebase for a conference paper on genre analysis at the "How to Do Things with Millions of Words" conference in November 2016.

Ted also shared his code. It focuses on using a logistic regression to make genre predicts, so it needed some adapting. I focused my edits on two fronts: designing my own tests and adding features that would make the code run faster. The first way I thought to speed it up was to take a block of code that imports term frequency tables and store that data so that you don't need to rerun it every time you run a regression. Reading many text files is an infamously memory-intensive process.

You can get some major memory benefits using a datastore (mysql, psql, redis, etc.) but, after a lot of playtime, I opted to load the data as a series of Python pickles, which are surprisingly efficient. Here's what some of that code looks like:

import pickle
import csv

dict_of_10k = {}
with open ("lexicon/new10k.csv", "r") as myfile:
    for m_line in csv.reader(myfile):
        if m_line[0] != "word":
            dict_of_10k[m_line[0]] = m_line[1]

docids = []
with open ("meta/finalmeta.csv", "r") as myfile:
    for m_line in csv.reader(myfile):
        if m_line[0] != "docid":
            docids.append(m_line[0])

full_feature_dicts = []
feature_dicts = []
excluded_ids = []
for _id in docids:
    try:
        full_fdict = {}
        fdict = {}
        with open("newdata/%s.fic.tsv" % str(_id)) as f:
            s = f.read()
            s = s.replace("\n\"\t", "\n\\\"\t")
            row = s.split("\n")
            cells = [tuple(i.split("\t")) for i in row]
            for y in cells:
                try:
                    full_fdict[y[0]] = int(y[1])
                except:
                    pass
                #check for top 10k terms
                try:
                    test = dict_of_10k[y[0]]
                    fdict[y[0]] = int(y[1])
                except:
                    pass
        full_feature_dicts.append(full_fdict)
        feature_dicts.append(fdict)
    except:
        print(_id)
        excluded_ids.append(_id)

pickle.dump(feature_dicts, open( "pickled_data/feature_dicts_10k.p", "wb" ))
pickle.dump(full_feature_dicts, open( "pickled_data/full_feature_dicts.p", "wb" ))
pickle.dump(excluded_ids, open( "pickled_data/excluded_ids.p", "wb" ))

I use similar scripts to load metadata and genre tags, but the term frequency feature data is where you make all your gains in terms of efficiency. This part of code takes several minutes to run but, now that you have your data saved as a set of pickles, you can load those files instead of reloading and reprocessing the text files every time you run a regression.

C10H15N + OD = Method? (That Title is the Worst Joke I've Ever Thought Of)

So far, in one form or another, I've played with characterizing a text's date by looking at its Germanic-Latinate word origin ratio (using dictionary.com word origin data from Underwood and Sellers' work and a scraped dataset from the OED Online); calculating a "Walker ratio" by seeing how many words in a text are also in an OCRed version of Walker's Dictionary, and various machine learning methods, including linear regression. In this version of the code,

Whenever you run a supervised learning model, it's typical to partition your dataset into a training set and test set. You train the model one partition (term counts and dates) and test your data on the other partition, feeding it only term counts and generating predicted dates. You can then compare the predicted dates to the labeled dates for your test set to see how accurate your model is. My code achieves this task by generating 500 random numbers to indicate which texts will be in the training set.

from random import shuffle
import pickle
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression

#load 10k feature dicts and metadata
metadata = pickle.load( open( "pickled_data/metadata.p", "rb" ) )
feature_dicts = pickle.load( open( "pickled_data/feature_dicts_10k.p", "rb" ) )

def predict_years(metadata, feature_dicts):

    all_years = [int(i[8]) for i in metadata]
    all_ids = [i[0] for i in metadata]
    myrange = list(range(0, len(metadata)))
    shuffle(myrange)

    randoms = myrange[:500]
    randoms.sort()

    #define train_years, test_years
    train_years = []
    test_years = []

    #define train_dicts, test_dicts
    train_dicts = []
    test_dicts = []

    train_ids = []
    test_ids = []

    for num in myrange:
        if num in randoms:
            train_years.append(all_years[num])
            train_dicts.append(feature_dicts[num])
            train_ids.append(all_ids[num])
        else:
            test_years.append(all_years[num])
            test_dicts.append(feature_dicts[num])
            test_ids.append(all_ids[num])

    #print(len(train_years) == len(train_dicts))
    #print(len(test_years) == len(test_dicts))
    train_ids_str = ", ".join(train_ids)
    test_ids_str = ", ".join(test_ids)

    #use scikit learn Pipeline functionality to vectorize from dictionaries, run tfidf, and perform linear regression
    text_clf = Pipeline([('vect', DictVectorizer()), ('tfidf', TfidfTransformer()),('clf', LinearRegression()),])
    text_clf = text_clf.fit(train_dicts, train_years)
    predicted = text_clf.predict(test_dicts)

    result_rows = []
    margin = []
    for i,j in enumerate(predicted):
        m = abs(j - test_years[i])
        margin.append(m)
        row = [test_ids[i], j, test_years[i], m]
        rows.append(row)

    mean = np.mean(margin)
    main_row = [test_ids_str, train_ids_str, mean]
    return (main_row, result_rows)

Once you've run the regression, it's probably a good idea to store your results for later use. One way to do that is with csv or Microsoft Excel output, as these are widely used and easy-to-access output formats. Another way to go is to make a simple sqlite database. Sqlite3 is built into Python and is very easy to use. It's also another area of performance optimization, as an sqlite storage engine will increase retrieval speed when you go back to your results for analysis or visualization. Third, you gain all the same interoperability benefits you would get from using a format like .csv, as it's very easy to load sqlite in R and other programming languages. Here's a function that takes linear regression results stores them in two sqlite tables:

import sqlite3

def store_results(main_row, result_rows):
    conn = sqlite3.connect('regression_scores.db')
    c = conn.cursor()
    make_main = """CREATE TABLE IF NOT EXISTS main (id INTEGER PRIMARY KEY, test_ids TEXT, train_ids TEXT, mean_margin REAL)"""
    c.execute(make_main)
    make_results = """CREATE TABLE IF NOT EXISTS results (id INTEGER PRIMARY KEY, main_id INTEGER, doc_id TEXT, predicted REAL, actual REAL, margin REAL, FOREIGN KEY(main_id) REFERENCES main(id))"""
    c.execute(make_results)

    insert_main = """INSERT INTO main (id, test_ids, train_ids, mean_margin) VALUES (null, ?, ?, ?)"""
    c.execute(insert_main, main_row)
    conn.commit()

    #get id for row you just inserted
    main_id = c.execute("""SELECT id FROM main ORDER BY id DESC""").fetchone()[0]
    insert_result = """INSERT INTO results (id, main_id, doc_id, predicted, actual, margin) VALUES (null, ?, ?, ?, ?, ?)"""
    for result_row in result_rows:
        new_row = [main_id]
        new_row.extend(result_row)
        c.execute(insert_result, new_row)
    conn.commit()

Shuffle Up and Deal

In many cases, it's best to add second layer of randomness. Instead of simply partitioning your data once and checking your performance, you can run the model and collect results, then shuffle up the memberships of test and train and re-run. If you go through this process enough times, you start to get a sense of which random arrangements of the training set might be outliers when compared with all the other potential arrangements. Since my regression function will shuffle train and test automatically each time it is run, the code to execute this shuffle and re-run method can be as simple as this:

#for every number, 0 to 299
for z in range(300):
    #re-run the result function
    result_tuple = predict_years(metadata, feature_dicts)
    #re-run the datastore function
    store_results(result_tuple[0], result_tuple[1])
    #this print statement is just here to print a progress report to the terminal every tenth time the functions finish
    if z % 10 == 0:
        print(z)

Normal Distribution?

After running the regression 100, 200 or even 600 times (which is easier to cope with since we've improved the overall speed of the program), we start to get a sense of what most models tend to predict a document's date to be. Using our sqlite database, we can quickly get all the results for a given Document ID:

"""SELECT avg(predicted) as a, actual, doc_id FROM results GROUP BY doc_id;"""

This query will give us the mean prediction, a labeled date, and a Document ID from our results table. From here, we can calculate the margin between each prediction and its corresponding metadata. If we bundle all of these together predictions and margins together, we can get a sense of their overall distribution.

What we see here is something resembling normal distribution. The peak in the middle is higher than you would expect of a normal distribution (or more kurtotic in stat-speak), and there are some extreme outliers if you look at either tail. Since we're grabbing values from our database, the code to generate this plot is as simple as:

import seaborn as sns
conn = sqlite3.connect('regression_scores.db')
c = conn.cursor()
query = """ SELECT avg(predicted) as m, actual, doc_id FROM results GROUP BY doc_id; """
results_two_sided = c.execute(query).fetchall()
results_two_sided_margins = [(i[0] - i[1]) for i in results_two_sided]
pylab.rcParams['figure.figsize'] = (11, 6)
sns.distplot(results_two_sided_margins)

We can now compute how far from the mean our outliers are. Anything more than 2.5 standard deviations from the mean should be immediately suspect. These would be good targets for additional scrutiny. (I've read that the MAD score or "median absolute distance" is a better way to detect fishy outliers but I'm sticking with standard deviation for the moment out of expediency).

Visualizing Date Predictions

At Keystone DH, I presented a static scatter plot of predicted dates vs. labeled dates based upon a far less successful date prediction model (based on neologism ratios instead of term frequencies).

People were interested in the visualization, but almost everyone I talked to wanted a way to look closer which texts were outliers. Scatter plots in matplotlib (which is how I made the plot) tend to be all or nothing in terms of labels, so I started thinking about how I could make an interactive web app to show the same kind of data. Here's what I came up with:

Neologophobia Web Application (Note: this web app has been archived.)

To make this interactive scatter plot, I've hijacked the Mapbox visualization codebase to and replaced the base map with graph lines. Mapbox.js handles the panning, zooming, and marker clustering. If you click on any cluster, the map will automatically zoom in and unpack the data data points. Clicking on any data point will open a document popup, and clicking the link on a popup will bring you to a landing page for that document.

Outliers

Visually, it's easy to spot totally bizarre results. Any plotted point that's on its own (not clustered with others is a clear outlier.) Still others can be identified by zooming in one step further. The closer a point is to the red-orange line, the less distance there is between the predicted date and the labeled date. For example, if you go to the web app and click on this point ...

... you will see that the document is Ludwig Geissler's Looking Beyond. The predicted date is 1893, and the metadata lists it as a text published in 1971. This is one of the largest margins of error in the entire set. However, if we look up Ludwig Geissler's Looking Beyond, we can easily discover that this work was in fact published in 1891. (See this SF Encyclopedia entry.) Our model has successfully located a metadata error.

Another example is Weird Tidbits, New York, 1888, with a mean predicted date of 1848. Here the model might just be wrong, but it's worth noting that it seems to have been advertised as a five-volume collection of pieces "from various sources," so any number of these might be older than others (See "Books Wanted," Publisher's Weekly, July 22, 1905. 138.). Further inspection is warranted here, of course.

My favorite example is Madame de Maintenon, published in 1806, predicted at 1918, but with predictions all over the map (1998, 1796, etc.) if you look at the results data. Closer inspection of this document reveals that it is in French. The model assumes English language texts, so it's no wonder that this text is an outlier!

Next steps

The next step is to cull or relabel erroneous texts, retrain the model, and see how it performs. Once its date prediction accuracy seems stable, I'll start looking at the next set of outliers, whose status as such should have more literary or cultural significance.

Tags: