The Data Humanist: Posts about Book History, Cultural Analytics, and Humanities Data

The Alice Problem

By Matt Lavin

December 15, 2016

The Conceit

Not very long ago, one of my colleagues hired a programmer for his institution's digital humanities initiative. In advertising the position, he designed a set of three challenges designed to test candidates for baseline competency with programming as it might apply to the humanities. All three of the tasks were of the same sort and, honestly, I can only remember one of them.

It asked the candidate programmer to design a simple block of code, in any programming language, to return a list of the most frequent adjectives preceding the name Alice in the book Alice’s Adventures in Wonderland (1865).

Programmer competency tasks are fairly common but, in the constellation of what would-be programmers are asked to do, this one seemed different.

This task is not remarkable in its difficulty. In general, I tend to get sucked into coding when I'm presented with something challenging. I suspect this is the case for almost anyone who codes regularly. Programming can be very difficult, but it's often difficult in exactly the right way: even partial successes are immediately observable and utterly distinct from an outright failure. In video game studies, gamers motivated like I am are called "achievers." Making something work, especially it was hard, feels good, much like beating a boss, or earning a badge, or unlocking a secret level.

A Solution

But I can’t say that the Alice Problem really appeals to my achiever side, because it isn't very hard. A simplified solution might go something like this:

Read the full text of Alice’s Adventures in Wonderland into memory and store it as a string.
Go through the text and tokenize such that every word is treated separately from every other word.
Use one of any number of Part-of-Speech (POS) taggers to generate a best guess for the part of speech for each token.
Go through the text word by word. If the word is "Alice," look at the part of speech value to the left (one common approach to this type of question is called Key Words in Context or KWIC).
If the word to the left is an adjective, store its value. If the word has appeared before, add to a count of the total number of occurrences for that adjective
Sort the results by the number of occurrences.
Return or print the result.

Some Code

Using Python, my typical programming language of choice, this task is especially easy to tackle because the pieces needed to accomplish the tasks are all associated with one very popular library, the Natural Language Toolkit (nltk). This script does approximately what my list above describes.1

import nltk
with open('alice.txt') as a:
    alice_text = a.read()
alice_tokens = nltk.word_tokenize(alice_text.lower())
alice_pos = nltk.pos_tag(alice_tokens)
alice_pos = [i for i in alice_pos if i[0].isalpha()]
adjs = []
for i, j in enumerate(alice_pos):
    if j[0]=='alice':
        before = i-1
        if alice_pos[before][1] == 'JJ':
            adjs.append(alice_pos[before][0])
from collections import Counter
print(Counter(adjs).most_common())

The output, using the Natural Language Toolkit's default Part-of-Speech tagger, is a list of term counts that looks like this:

[('thought', 26), ('poor', 11), ('cried', 7), ('little', 3), ('exclaimed', 3), ('shouted', 1), ('red', 1), ('inquired', 1), ('foolish', 1), ('interrupted', 1), ('pleaded', 1), ('replied', 1), ('noticed', 1), ('miss', 1), ('anxious', 1), ('different', 1), ('upon', 1)]

Some Immediate Caveats

Before anything else, I should acknowledge issues with the accuracy of this tagger. It has incorrectly labelled "thought," "cried," and multiple other verbs as adjectives. We might hypothesize that adjectives such as "poor" and "little" will remain our top results with these mislabeled verbs excluded, but a parser with this many false positives could be missing any number of adjectives. On the other hand, these verbs are all like to be associated with dialogue attribution (as in "cried Alice" and "thought Alice"), so the errors could be concentrated around this particular sentence structure.

I also want to signal my awareness of numerous general ways to improve the code overall. An employer hiring an imaginary programmer for a digital humanities initiative might be specifically interested in whether the candidates used object-oriented programming, particular libraries, and any number of stylistic approaches.

These caveats aside, the necessary steps I've outlined would be quite familiar to any programmer computationally-inclined digital humanist, so I can see why a task like this might serve as a reasonable as a measure of a programmer's ability to write code that engages with a humanities computing problem.

Simplicity is Complicated

The problem is simple. The problem is deceptively complicated. Welcome to humanities computing.

One solution or another is easy to code, but the underlying question is complex, especially if one chooses to emphasize how this problem relates to reading Alice's Adventures in Wonderland. There is a question beneath this problem that has little to do with adjectives directly to the left of the name Alice. Rather, adjectives and adjacency are boundaries for the applicant to work with while sketching a computational perimeter around a more nuanced question:

"How does Carroll describe Alice, and what are the implications of this description?"

Or, perhaps more ambitiously:

"How does Carroll's description of Alice compare with other characters in other texts?"

Or, perhaps going even further:

"How does descriptive language in Alice's Adventures in Wonderland compare broadly to other narratives with imagined landscapes? And how might descriptive patterns help a reader situate Alice in relation to its historical and formal context?"

These questions should have implications for the various humanities scholarship that has considered Alice's place in the history of children's literature. For example, Beatrice Turner's "'Which is to be master?': Language as Power in Alice in Wonderland and Through the Looking-Glass" argues that "most of [Alice's] exchanges with the inhabitants of Wonderland and the Looking-glass world" are marked by a "power imbalance ... that is worked out at the level of language" (246).2 I am not immediately persuaded by Turner's argument, but it represents for me a productive site of engagement with the text because it invites additional interpretation.

Most immediately, Turner's work reminds me that questions about descriptive norms in Alice's Adventures in Wonderland and other children's fiction are affected by the way people (men, women, children, adults) interact, and those interactions legitimize or undermine a person's right to observe, react, and speak.

I think Turner's sharpest point is that Alice is a little girl, and the characters around her, despite being erratic and nonsensical and even infuriating at times, are basically cast as adults.

"The texts grant linguistic control to those who inhabit Wonderland and the Looking-glass world," she argues, "and, in doing so, define them as adults. They use this control in a very adult way, too: they exercise the adult’s right to tell the child what she is" (249). Looking at adjectives to the immediate left of the name Alice is merely one immediate and quantifiable way to gesture at broader issues like this one. The fact that Alice is described as little, poor, foolish, anxious, and different begins to substantiate Turner's initial observation about how Carroll frames Alice.

"Through the Looking Glass" as a Strained Metaphor for Something Far Less Interesting than Interdimensional Travel

One issue with "adjectives to the left of Alice" is that it's a relatively narrow way to think about how Alice is described. To begin to address the broader question,

"How does Carroll describe Alice, and what are the implications of this description?"

I would prefer a computational method with more interpretive reach. Instead of designing a tool to look specifically for adjectives next to Alice, for example, we could design an instrument to ask what kinds of term pairs tend to co-occur more generally in the novel. To set up this test, I could use the NLTK's collocations function.

from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
at = [i for i in alice_tokens if i.isalpha()]
finder = BigramCollocationFinder.from_words(at)
finder.apply_freq_filter(3)
ignored_words = nltk.corpus.stopwords.words('english')
finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)
terms = []
for i in finder.nbest(bigram_measures.pmi, 10000):
    if "alice" in i:
        print(i)

This code uses a function called "collocations" to look for common bigrams in Alice's Adventures in Wonderland. By bigrams, I mean pairs of adjacent words. For the purposes of this measure, I've removed all punctuation from the text and made all words lowercase so that "afraid" and "Afraid", for example, would read as the same term. I've set the code to ignore any term that doesn't appear at least three times, and I've removed from consideration any pairs that use excessively common "function words" like the, as, and in. (For a full list of these stopwords, see this Github Gist.) The above code snippet will only output term pairs of one of the terms is "Alice." The result of collocated terms does not focus on any particular part-of-speech, but several adjectives are easy to pick out.

('alice', 'ventured')
('alice', 'indignantly')
('exclaimed', 'alice')
('poor', 'alice')
('thought', 'alice')
('cried', 'alice')
('together', 'alice')
('alice', 'replied')
('alice', 'waited')
('said', 'alice')
('alice', 'hastily')
('alice', 'felt')
('alice', 'looked')
('alice', 'asked')
('alice', 'thought')
('alice', 'could')
('alice', 'began')
('alice', 'rather')
('caterpillar', 'alice')
('alice', 'heard')
('alice', 'quite')
('alice', 'must')
('alice', 'went')
('alice', 'said')
('little', 'alice')

In the context of this collocations analysis, the only two adjectives that rank among the top associations with "Alice" are "poor" and "little." One renders the protagonist (albeit ironically) as an object of pity, and the other reinforces her status as a child and a figure of reduced physical and social stature.

Perhaps more striking is the fact that so many of these collocations situate Alice in dialogue. Terms like "asked", "began", "said", "exclaimed", and "ventured" are specific dialogue tags (i.e., verbs). Terms like "thought", "look", "felt", "heard", and "indignantly" indirectly situate her in dialogue as well. These terms convey an Alice who is often listening, reacting internally, or conveying confusion or frustration without necessarily taking specific actions to resist.

A Continuously Widening Spiral of Comparison

Still, we might ask whether these word associations are the symptoms of a character constrained by her age and gender, or simply the kinds of words that any main character is likely to be associated with in a children's novel of this time period. After all, the rise of dialogue in 19th-century fiction could easily account for at least some of these associations.

One way to address these concerns is to compare Alice's Adventures in Wonderland to other texts (or authors, or genres, etc.). Take, for example, a a quick analysis of just a few female protagonists from novels about girls who visit other worlds. How do descriptions of these female main characters differ from how Alice is described?

A question like this is one that Turner (and many others), using situationally specific methods, cannot answer. Queries with large-scale scope are notoriously difficult to approach without computation.

An extensive search for novels about girls who visit other worlds would no doubt reveal any examples, but a few very well known archetypes come to mind: Dorothy in The Wonderful Wizard of Oz (1900), Wendy in Peter and Wendy (1910), and Lucy and Susan in The Lion, the Witch, and the Wardrobe (1950).

If we replicate the two computations I performed on Alice's Adventures in Wonderland using these three texts, we can begin to form a basis for comparison.

It's important to note that these are all male authors. Here I am not asking a question like, "Did women authors of the 19th and 20th centuries write female characters with more signs of individual agency than men," although I think a question like this one would be fascinating to explore. Instead, I'm merely investigating which terms are most often adjacent to five characters names in four very well known novels about girls who visit fantasy realms.

import nltk
import nltk.collocations
def collocation_analysis(text, character):
    with open(text) as a:
        my_text = a.read()
    my_tokens = nltk.word_tokenize(my_text.lower())
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    at = [i for i in my_tokens if i.isalpha()]
    finder = BigramCollocationFinder.from_words(at)
    finder.apply_freq_filter(3)
    ignored_words = nltk.corpus.stopwords.words('english')
    finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)
    terms = []
    pairs = []
    for i in finder.nbest(bigram_measures.pmi, 10000):
        if character in i:
            pairs.append(i)
    return pairs

output = []
for i, j in comparison_texts:
    pairs = collocation_analysis(i, j)
    output.append(pairs)

Granted, these texts are separate by many years, and Lucy and Susan might be prone to appearing as outliers because they are two of four main characters in The Lion, the Witch, and the Wardrobe. Still, I think the commonalities among these books are more important than the differences.

The above code will output a list of term pairs for each text just like the one for Alice shown above. However, this output doesn't necessarily help us compare the various characters. In the following table, I've made a row for each term and a column for each character. An X in a cell means that a given term and a given character can be considered a computed association:

	term	lucy	susan	dorothy	wendy	alice
0	saw	-----	-----	XXXXX	XXXXX	-----
1	picked	-----	-----	XXXXX	-----	-----
2	story	-----	-----	-----	XXXXX	-----
3	queen	XXXXX	XXXXX	-----	-----	-----
4	exclaimed	-----	-----	XXXXX	-----	XXXXX
5	went	-----	-----	XXXXX	-----	XXXXX
6	little	-----	-----	-----	-----	XXXXX
7	returned	-----	-----	XXXXX	-----	-----
8	asked	XXXXX	XXXXX	XXXXX	-----	XXXXX
9	could	XXXXX	-----	XXXXX	XXXXX	XXXXX
10	presently	-----	XXXXX	-----	-----	-----
11	together	-----	-----	-----	-----	XXXXX
12	heard	-----	-----	-----	-----	XXXXX
13	thought	XXXXX	-----	XXXXX	-----	XXXXX
14	answered	-----	-----	XXXXX	-----	-----
15	began	-----	-----	-----	-----	XXXXX
16	lady	-----	-----	-----	XXXXX	-----
17	walked	-----	-----	XXXXX	-----	-----
18	waited	-----	-----	-----	-----	XXXXX
19	wendy	-----	-----	-----	XXXXX	-----
20	whispered	-----	XXXXX	-----	-----	-----
21	put	-----	-----	XXXXX	-----	-----
22	hastily	-----	-----	-----	-----	XXXXX
23	mother	-----	-----	-----	XXXXX	-----
24	said	XXXXX	XXXXX	XXXXX	XXXXX	XXXXX
25	poor	-----	-----	-----	-----	XXXXX
26	peter	-----	XXXXX	-----	-----	-----
27	last	XXXXX	-----	-----	-----	-----
28	ventured	-----	-----	-----	-----	XXXXX
29	sat	-----	-----	XXXXX	-----	-----
30	felt	-----	-----	-----	-----	XXXXX
31	replied	-----	-----	XXXXX	-----	XXXXX
32	must	-----	-----	-----	-----	XXXXX
33	indignantly	-----	-----	-----	-----	XXXXX
34	caterpillar	-----	-----	-----	-----	XXXXX
35	suddenly	XXXXX	-----	-----	-----	-----
36	came	-----	-----	-----	XXXXX	-----
37	looked	XXXXX	-----	XXXXX	-----	XXXXX
38	would	-----	-----	XXXXX	XXXXX	-----
39	rather	-----	-----	-----	-----	XXXXX
40	quite	-----	-----	-----	-----	XXXXX
41	stood	-----	-----	XXXXX	-----	-----
42	see	-----	-----	-----	XXXXX	-----
43	cried	XXXXX	-----	XXXXX	XXXXX	XXXXX
44	well	XXXXX	-----	-----	-----	-----
45	moment	XXXXX	-----	-----	XXXXX	-----
46	says	-----	-----	-----	XXXXX	-----
47	knew	-----	-----	-----	XXXXX	-----
48	found	XXXXX	-----	XXXXX	-----	-----

Based on this table, we can notice some immediately fascinating aspects of the comparison. It makes it much easier to find terms associated with only one or two characters, as well as associations present for all but one or two characters. For example:

Lucy: uniquely associated with last and suddenly. No unique absences.

Susan: uniquely associated with presently, whispered, and Peter. Uniquely not associated with cried and could.

Wendy: uniquely associated with story, lady, mother, see, says, and knew. Uniquely not associated with absent and asked

Dorothy: uniquely associated with returned, answered, walked, and put. No unique absences.

Susan and Wendy: neither associated with thought or looked

Alice and Dorothy; associated with replied, exclaimed, went

Dorothy and Wendy: associated with saw, would

Lucy and Wendy: associated with moment

Alice: uniquely associated with little, poor, together, heard, began, waited, hastily, ventured, must, indignantly, caterpillar, rather, quite

These findings warrant more discussion than I'm planning on doing here today. For now, a few highlights. First, this comparative approach only strengthens the association between Alice, little, and poor established by a part-of-speech tagging approach. Further, the pairing of waited and hastily, typically oppositional terms, reminds of a character constantly trying to move along but getting held up by others. Similarly, the term began, likewise, conjures images of a character attempting to speak but getting consistently interrupted. Adding to the sense of both situational and verbal irony are the terms rather and quite, both of which seem likely to be tongue-in-cheek qualifiers (e.g. "Alice quite jumped").

In this example, we knew the main characters' names, so the comparison is a bit easier. Scaling this question beyond our example would requiring making a dictionary of main character names or developing a method that detected character names automatically. This is just one example of a continuously widening spiral of comparison. Additional layers are always possible, which is one of the main reasons it's so appealing to sit and tinker.

In the past, I have described digital humanities using the metaphor of an investigator who uses advances tools and approaches look through a keyhole, only to find that, inside the locked room there is only a mirror. This is to say that digital humanities methods, too often, become lenses through which the only result is a more nuanced understanding of digital humanities. I am confident that meditating on the advantages and limits of our methods has purchase, yet, here I believe light is also shed on Alice's Adventures in Wonderland.

I must concede, of course, that it is possible that these alternative close reading strategies mostly confirm what others have noticed using more traditional literary analysis methods. If this is so, then code is still a way to get to where we want to go, a way to establish the habits of mind that lovers of close reading are always trying to instill in their students and colleagues. The real challenge is taking the time to share one's results with others. This is inevitably a decision to stop, which eventually requires stopping. And so I have chosen to conclude with a partial list of how much more I want to do.

Ideas for Follow-up

Compare adjectives overall to lots of novels

On many novels: return the names after "little" (and other adjectives) ranked by how often those names are referred to as little. Are those names gendered?

Explore character-based sentiment tagging, in response to (if possible) Michael Mendelson's "The phenomenology of deep surprise in Alice's Adventures in Wonderland"

Use SpaCy or Parsey McParseface or for better POS tagging

Explore how to perform similar analysis on novels without knowing main characters' names

Attempt to run using pronoun disambiguation

Re-run multi-novel analysis using collocations two and three words away from characters' names

Notes

1. Code excerpts can be viewed in or downloaded/run from the .ipynb file in this folder in my Jupyter Notebooks Github repository

2. Turner, Beatrice. "'Which is to be master?': Language as Power in Alice in Wonderland and Through the Looking-Glass," Children's Literature Association Quarterly, 35.3 (Fall 2010): 243-254.

Tags: