Why I've Stopped Teaching Web Scraping

To come: lots of negative thoughts about teaching web scraping to beginning programmers. This is sparked by a recent NICAR-L question. Here’s my partial response, to be later elaborated when I have time:

In terms of teaching web-scraping: I don’t think this is worth the pain. Web scraping ends up not just being about scraping HTML, which is very complicated itself (the syntax and the DOM), but about understanding the modern web stack. I’ve taught the web inspector because it’s a good way to see all the pieces that the web is made up of, but students don’t generally get the big picture unless they’ve done actual web development before.

The other part of this is that there is an ever-growing, incredible amount of data that is available through APIs and data dumps. With Twitter, Spotify, Instagram, and/or Facebook, you can do everything from activity analysis, monitoring of government accounts, data aggregation, to bot making and network analysis…and it’s data that virtually all of your students will be familiar with as end-users and so it’s very easy to stoke their curiosity about the data’s technical underpinnings, and how to use math to do relevant analyses.

If you don’t like using such proprietary services, than the U.S. government, as well as Socrata and the Sunlight Foundation, has more than enough datasets for fun and curiosity and truly deep investigative work. You don’t even have to get into the work of accessing APIs (which admittedly can be a pain in the ass if you’re dealing with the OAuth2 flow, nevermind the part where you frantically check your students’ repos to see if they’ve published their passwords to Github)…There are plenty of static text dumps (in CSV/JSON/TSV, etc) to work with…I find that one of the hardest yet most important concepts for students to get are the basic data structures (lists/arrays and hashes/dicts) and why we serialize/deserialize them. Scraping from websites involves the same fundamental concepts, except with an endless pile of shit to get there.

Dan Nguyen's Blog | Thoughts, Data and Computational Journalism