The Data Humanist: Posts about Book History, Cultural Analytics, and Humanities Data

Analyzing Lots of Book Reviews and Extracting Information

By Matt Lavin

July 15, 2019

The What and the Why

For the past four weeks (minus a week for Willa Cather camp), I've been doing some experiments with a large corpus of book reviews. I have about 300,000 pieces of content categorized as reviews, but the corpus is a mix of single-work reviews and multi-work reviews. It also has as high as a ten percent false-positive rate; i.e., about one in ten items marked as a review is actually mislabeled non-review content. There are far too many reviews to curate by hand, so I'm refining a method to classify texts by their content and hopefully extract key information from them. I'd like to make some gains in the following areas:

1. Categorize reviews using unsupervised methods

Last year, I did some experiments parsing single-work reviews, multi-work reviews, and non-review content using supervised learning (regression and deep learning) based on term frequencies. The results were poor because I found that single-work reviews and multi-work reviews use very similar language. This method might work well for flagging non-review content, but I'd like to use in in tandem with a method that doesn't require a large training set.

2. Use these insights to profile APS reviews

I'm especially interested in isolating a subset of reviews that focus on a single work, so that it's easier to say that a language in that review most likely applies to a particular author or work.

3. Extract information from reviews

If I can classify single-work reviews with sufficient confidence, I should be able to use Named Entity Recognition (NER)methods to extract information such as price, publisher, author, title, and genre. Some of these will be easier than others, but my hope is that isolating a reviewed author will make it easier to text-mine the title, and that things like genre can then be inferred by aligning the extracted work with external sources such as HTRC, Worldcat, or Project Gutenberg.

4. Identify likely reviews that are not tagged as reviews

As I mentioned, I have about 300,000 items tagged as reviews. I also have 11.2 million records not tagged as reviews, but I've browsed through the corpus enough to wonder how many reviews are hiding in plain sight, untagged as reviews. If I can refine my method sufficiently, I'd like to use it to help me discover these likely reviews and flag them for follow-up.

5. To improve performance for large scale application

None of these goals are very realistic if I can't make my code run relatively fast. I don't need supercomputing speeds, but my goal is a method that can run on 300,000 reviews in one afternoon. Anything slower and I'm risking a mid-code power outage caused by one of Pittsburgh's many torrential downpours.

NER with Spacy

Spacy has a very powerful Named Entity Recognition engine. It uses deep learning to make predictions for several types of entities, including people, places, organizations, events, dates, money, quantities, and even works of art. (See https://spacy.io/api/annotation#named-entities for a full list.) The only problem is that Spacy, despite being a very fast text processing library compared to the alternatives, is still slow enough to make my code take hours if not days to run.

To save time by avoiding repetition, I wanted to process text in Spacy and save the results to disk. To use Spacy, you must load a language dictionary, and instantiate a Python object (a.k.a. a document) for each text object you want to parse. Spacy saves time by creating and updating a vocab list, which you need to make sense of the text instances. If you want to export data outside of Python, you need to save the documents and the vocab.

Spacy has methods for serializing documents. Option 1 is document.to_disk(), which will save your content (in binary) to the disk. Option 2 is document.to_bytes(), which you can store however you wish (database, redis store, Python pickle). I opted for document.to_bytes() based on the idea that I might use redis. After benchmarking the speeds of various solutions, however, I found that retrieving the document.to_bytes() data, however it is stored, takes almost no time at all. Instead, the document.from_bytes() method, which converts the bytestring back to a Spacy object, is what slows down the process (but it's still fast than re-instantiating a Spacy object from the ground up). After all this, here's the process I landed on:

Loop a range of years from 1880 to 1925
For each year, retrieve book reviews from the database
Instantiate spacy objects
Save three pickles to the hard disk for that year: a list of file ids, a list of spacy bytestrings, and the spacy vocab instance for those bytestrings

When I return to the pickles, I use the same process to convert them back to spacy instances, only in reverse. I loop the same range of years, retrieve bytestrings by folder, and use document.from_bytes() to create a list of spacy instances. I can then use zip to merge them with my file ids and make a dictionary with ids as keys and spacy instances and values. The code looks something like this:

from datetime import datetime
import gc
import pickle
import spacy
import sqlite3

#connect to datastore with reviews data in it
conn = sqlite3.connect('aps_reviews_datastore.db')
c = conn.cursor()

def load_spacy_by_year(year):
    """this function loads pickles for a given year of data"""
    with open("/media/backup/aps_spacy/%s/spacydocs.p" % str(year), "rb") as handle:
        doc_bytes = pickle.load(handle)
    with open("/media/backup/aps_spacy/%s/spacyvocab.p" % str(year), "rb") as handle2:    
        vocab_bytes = pickle.load(handle2)
    with open("/media/backup/aps_spacy/%s/_ids.p" % str(year), "rb") as handle3:    
        _ids = pickle.load(handle3)
    return _ids, doc_bytes, vocab_bytes

for r in range(1880, 1926):
    print(r, datetime.now().time())
    _ids, doc_bytes, vocab_bytes = load_spacy_by_year(r)
    spacy_store = dict(zip(_ids, doc_bytes))
    
    #this could probably be outside the loop; it only has to happen once
    nlp = spacy.load('en')
    
    #each year has its own vocab_bytes 
    nlp.vocab.from_bytes(vocab_bytes)

    #get reviews data from sqlite
    rows = c.execute("SELECT RecordID, RecordTitle, FullText, extracted_parsed.single_work as single_work, extracted_parsed.parsed_year as year FROM reviews LEFT JOIN extracted_parsed ON reviews.RecordID=extracted_parsed.aps_id WHERE single_work ='possibly_single' AND extracted_parsed.parsed_year=?", [r,]).fetchall()

    for z in rows:
        try:
            doc_byte_string = spacy_store[z[0]]
            newdoc = spacy_from_pickle(doc_byte_string, nlp)
        except:
        	#just in case an id is not where it should be, instantiate the spacy doc from scratch
            newdoc = nlp(z[1])

Once I've finished this operation, I can use plain text methods (such as regular expressions) in combination with Named Entity Recognition approaches. My next step was to construct a function to analyze a review and create a profile based on its contents. I'll discuss that function in my next post.

Tags: