How to semi-automatically collect and parse LexisNexis articles without breaking the rules

A recent response, sans context, I posted to a mailing list asking about “mass downloading from LexisNexis”:

For *nix systems, I’ve found csplit to work well. With LexisNexis, you can do a bulk download of up to 500 articles in a single query and choose a given format, such as plain text. Some examples of output for June to July 2015 U.S. newspapers articles containing terms relating to police shootings can be found in this repo:

github.com/deadlyforcedb/lesson-planning

Note If you’re on OS X: the included version of csplit is junk so you’ll need to install GNU coreutils, after which, you can call that version of csplit with gcsplit.

To try it out, you can use this sample output text file from LexisNexis:

the_url=https://raw.githubusercontent.com/deadlyforcedb/lesson-planning/master/files/data/lexis-nexis/Newspaper_Stories%2C_Combined_Papers2015-07-11.TXT
curl $the_url -o /tmp/stories.txt 
cd /tmp
gcsplit -f story stories.txt '/[0-9]* of [0-9]* DOCUMENTS/' {*}

This creates files named “storyxx” from 0 to 161 in the /tmp directory, one for each instance of “X of 161 DOCUMENTS”

You probably want better-zero-padded filenames:

gcsplit -f story -b %3d.txt stories.txt \
        '/[0-9]* of [0-9]* DOCUMENTS/' {*}

This kind of human-powered-faux-automated scraping is a technique that I’m hoping to teach in class this fall…it works well with services like Lexis-Nexis in which scraping is forbidden…but otherwise provide an easy way to bulk download documents. IIRC, Lexis-Nexis will return a max of 500 documents for a broad search, and won’t tell you exactly how many max results your query returned. So if you wanted to conduct a search with a lot of OR-type operators, e.g. "police AND (shooting OR killing)" and find yourself bumping into that 500-result ceiling, do two separate searches for:

"police AND killing"`
"police AND shooting"

Assuming these both happen to fall under 500 results in this scenario, use csplit on both the files…then use diff (or whatever, I haven’t thought through the simplest non-programming way to do it) to find the unique stories from both files, as the two separate queries will obviously overlap. Then grep to your heart’s delight.

It’s more cumbersome than having a scraper, but in this case a scraper isn’t possible, and this is about 95% of the way there for most use-cases while still falling within LexisNexis’s terms of service.

Dan Nguyen's Blog | Thoughts, Data and Computational Journalism