Extracting Information from Book Reviews by Creating Review Profiles

By Matt Lavin

September 03, 2019

Background

In a previous post, I discussed the fact that I've been working with about 300,000 pieces of content from Proquest's "American Periodicals Series" categorized as reviews. However, items with this label are actually a blend of mislabeled non-review content, single-work reviews, and multi-work reviews. I have too much data to do hand-correcting, and I'm also wondering if there are actual reviews in the rest the data that aren't labeled as reviews. In short, I have a strong incentive to see if I can (1) classify items as likely to be reviews and (2) extract key information from them like the reviewed author, the reviewed title, the book's publisher, the price, the likely genre, etc.

String Matching vs. NER

With string matching, the basic idea would be to start with a big list of author and title names, and then match or fuzzy match each one to each review. This method is fairly speedy, but it doesn't do much to separate non-review content from reviews, or multi-work reviews from singe-work reviews. In theory, multiple mentions of the same author increase the probability that a review is focused on that author, but I don't yet have data to back of this hypothesis. I also predicted that there would be some very confusing edge cases, such as a review where author X wrote a biography of author Y and the string matching approach picks the biographical subject over the actual author. Lastly, you have to really trust your source list. If it's a list derived from titles in the HTC, as one of mine is, you're only mining for titles included in the HTRC and, in theory, letting that corpus have undue influence on the reviews you find.

Enter NER. Spacy has a powerful named entity recognition library that uses a neural network approach to score things like people, places, organizations, events, dates, money, quantities, and even works of art. (See https://spacy.io/api/annotation#named-entities for a full list.) Using NER, although computationally more expensive than string matching, can increase the likelihood of getting certain kinds of information.

This all seemed fairly promising, but my first run at the data produced something of a mish-mash. Consider the following output, which represents the first fifteen entities recognized in a random item from the American Periodicals Series:

  entity text
0 PERSON References
1 ORG the American Decisions
2 LAW the French Code and Civil Law
3 ORDINAL Sixth
4 NORP American
5 LANGUAGE English
6 NORP American
7 ORG EDMUND AI
8 PERSON SAMUEL C. BENNETT
9 GPE Boston
10 GPE NeW York
11 ORG Htoughton, MSifdln & Company
12 ORG Riverside
13 GPE Cambridge
14 DATE 1892

After reading this list, the first question that might come to mind is, "What's a NORP?" The answer, of course, is RTFM. According to spacy's documentation on NER (linked above), NORP stands for "Nationalities or religious or political groups." Classic acronym!

Question two might be, "Okay, so is this, you know, a book review?"

Let me hold off on answering that potential question because I think the answer will seem artificially obvious once you know it. The idea here is to investigate whether details of a book review can be gleaned from Named Entity Recognition, so let's look more of the NER data in groups and see if it gives us some clues. For organizations we have "the American Decisions," "EDMUND AI," (sounds like an AI company, I guess), "Htoughton, MSifdln & Company," "Riverside," "Air," "Conditional Sales," "Conditions," "Warranty," "Delivery," and "Stoppage."

For geographies, we have "Boston," "NeW York," "Cambridge," "States," "Transitu," "Sales," and "Benjamin." (Side note: Stoppage in Transitu is a real thing, defined as "the right of a seller of goods to stop them on their way to the buyer and resume possession of them.")

Meanwhile, for people, we have "References," "SAMUEL C. BENNETT," "Arthur Beilhy," "Pearson- Gee", "Fenwick Boyd," "Benjamin," "Benjamin," "Mutual Assent," "Benjamin," "Travis," "Tiedeman," and "Sales." Lastly, the only two dates seem to be 1892 and 1053.

My best guess before looking at the whole review looked something like this:

  • A book review
  • About a book Published by Houghton Mifflin
  • Printed by Riverside Press
  • Published in 1892
  • No price info
  • Nonfiction, probably about sales or business
  • No clue about the title
  • An author named SAMUEL C. BENNETT, or possibly Benjamin

I should stress that this was an educated guess. It's informed by knowledge, such as the fact that Henry Houghton was a Boston publisher, and Riverside Press was his printing operation. Houghton partnered with George Mifflin in 1880 to form Houghton, Mifflin, and Company. Similarly, that the APS dataset starts in the 1700s, which rules out 1053 as a likely date of publication. SAMUEL C. BENNETT is in all caps, which seemed to suggest a headline. I considered the possibility that I was looking at a review of a work of fiction, and that Benjamin might be the main character but, if forced to put money on my guess, I would have bet on nonfiction because of the apparent lexical register of the other named entities.

In short, despite some ambiguity, there seem to be better clues here than we might get by matching a big list of titles. After looking at the review, the following details became apparent:

Title Benjamin's Treatise on the Law of Sale of Personal Property, with References to the American Decisions and to the French Code and Civil Law
Authors Judah Philip Benjamin, with American notes by Edmund H. and Samuel C. Bennett
Publication Place Boston and New York
Publisher Houghton Mifflin & Company
Year of Publication 1892

I would say my initial intuitions were on track, but they produced fairly poor results. I got the publisher right, as well as the year. The OCR misinterpreted Edmund H. as "EDMUND AI," so there was little chance of considering him as an author/editor. Yet Benjamin, the most common PERSON entity, is the author, and Samuel C. Benett is one of the editors of the American edition. What's more, most of this information is in one heading at the top of the book review. Here is the uncorrected OCR for that portion of the review:

BENJAMNlWs TREATIsE ON THE LAW OF SALE OF PERSONAL PROPERTY; with References to the American Decisions and to the French Code and Civil Law. Sixth American edition, the latest English edition. With American notes by EDMUND AI. and SAMUEL C. BENNETT. Boston and NeW York: Htoughton, MSifdln & Company. The Riverside press, Cambridge. 1892. pp. 1053.

Note that 1053 is the number of pages, not the date, and it's preceded by "pp."

This close look at one review, of course, is just one example of what might happen when you use NER methods to lift information from a book review. It demonstrates some of the potential liabilities of an NER-based approach, but it doesn't say anything about how often certain types of confusion might arise. I looked at a few hundred of these as I thought about how to use NER on this task, so my planning was informed by more than this one example.

Over time, I came up with a few ideas for how to proceed. First, the title, author, publisher, etc. aren't independent variables. There's theoretically a book in some library catalog (hopefully many catalogs) that matches the description of the reviewed book. If you get one detail right, it should increase the probability of getting other details right. The rub is that you don't initially know which details are the most reliable. What you'd want, ideally, is a statistical method to rate the confidence of any given piece of information and use that score as a Bayesian prior to make future predictions. If this is sounding more and more like a machine learning task, our brains might be similar.

Second, and related to the idea of converting this to machine learning task, it seemed like there might be useful ways to express NER results as binary or continuous variables. For example, instead of a list of all names, would it be useful to know how many different named entities are in the document? Or how many times the top name is used? Or how many different years between 1880 and 1925 were found?

Enter a Profile-based Approach

Somewhere in the midst of all this, I started writing a Python function called "profile_review." It's big and clunky at the moment, but it sorts and filters named entities and produces what I hope is a cleaner summary of the NER data. As I worked on it, I began to wonder if I could add a few simple string-matching outputs to the profile as well; some isolated regex might catch something the NER had missed.

I won't share the entire function at this time. It's not ready for public consumption. Instead, I'll include a list of the fields I'm currently constructing, and a summary of how I derive them.

aps_id This is just the internal id from the American Periodicals Series, for matching purposes
book_count the number of times the word book appears in the review
books_count the number of times the word books appears in the review
year_count the number of different years extracted from NER (after removing pre-1880 and post-1925 years)
formats a string match for '2o', '4o', '8vo', '12mo', '16mo', 'folio', 'octavo', 'duodecimo', 'quarto', 'sixteenmo', which are often in book reviews
titles_people_orgs a numeric expression of the number of different people, work titles, and organizations derive from NER
dollar_count_headline  
formats_headline a string match for '2o', '4o', '8vo', '12mo', '16mo', 'folio', 'octavo', 'duodecimo', 'quarto', 'sixteenmo' in the headline specifically
dollar_count the number of different monetary amounts, below $50, derive using NER, and converted to numeric expressions (e.g. "fifty" become 50).
pp a count of strings matching 'pp.' or 'pp[space]' but not '[letter]pp' or 'pp[letter]'
doc_length the total number of words found in the document

Some of these fields are speculative. I don't know if formats_headline will differ much from formats, but I'm interested in finding out.

I'll conclude this post by show you the profile function's output for Benjamin's Treatise on the Law of Sale of Personal Property. Here's what that looks like:

 

{'aps_id': '125714499',
 'book_count': 0,
 'books_count': 0,
 'year_count': 2,
 'formats': {},
 'titles_people_orgs': 20,
 'prices': {},
 'dollar_count_headline': 0,
 'formats_headline': {},
 'dollar_count': 0,
 'pp': {},
 'doc_length': 940}

The next step with this task is to see if the information I'm deriving helps predict review/not review, single/multi, and book review details. I have strong intuitions that at least some of these will be helpful.