Dan Nguyen's Blog | Thoughts, Data and Computational Journalism

July 9, 2015

Why I've Stopped Teaching Web Scraping

To come: lots of negative thoughts about teaching web scraping to beginning programmers. This is sparked by a recent NICAR-L question. Here’s my partial response, to be later elaborated when I have time:

In terms of teaching web-scraping: I don’t think this is worth the pain. Web scraping ends up not just being about scraping HTML, which is very complicated itself (the syntax and the DOM), but about understanding the modern web stack. I’ve taught the web inspector because it’s a good way to see all the pieces that the web is made up of, but students don’t generally get the big picture unless they’ve done actual web development before.

The other part of this is that there is an ever-growing, incredible amount of data that is available through APIs and data dumps. With Twitter, Spotify, Instagram, and/or Facebook, you can do everything from activity analysis, monitoring of government accounts, data aggregation, to bot making and network analysis…and it’s data that virtually all of your students will be familiar with as end-users and so it’s very easy to stoke their curiosity about the data’s technical underpinnings, and how to use math to do relevant analyses.

If you don’t like using such proprietary services, than the U.S. government, as well as Socrata and the Sunlight Foundation, has more than enough datasets for fun and curiosity and truly deep investigative work. You don’t even have to get into the work of accessing APIs (which admittedly can be a pain in the ass if you’re dealing with the OAuth2 flow, nevermind the part where you frantically check your students’ repos to see if they’ve published their passwords to Github)…There are plenty of static text dumps (in CSV/JSON/TSV, etc) to work with…I find that one of the hardest yet most important concepts for students to get are the basic data structures (lists/arrays and hashes/dicts) and why we serialize/deserialize them. Scraping from websites involves the same fundamental concepts, except with an endless pile of shit to get there.

July 7, 2015

My favorite fun, informative reads about game development

Being a game developer is what got me interested in computer engineering. I’ve veered a long way from that goal but I still love reading about it, because programming for games involves so many clever hacks, especially on the user-facing side.

Here are a few links and reads I’ve enjoyed:

Naughty Dog circa Crash Bandicoot seems like the Xerox PARC of video game development, because of how many great developers have connections to it. Andy Gavin and Jason Rubin have written a lengthy series of blog posts covering the entire making of Crash Bandicoot.
Dave Baggett battled his “hardest bug ever” when developing Crash Bandicoot (HN comments). His Quora answer to How did game developers pack entire games into so little memory twenty five years ago? is another must-read. In fact, all of Baggett’s entire Quora answer feed is worth reading (yes, I never thought I’d say “entire”, “Quora feed”, and “worth reading” in a single sentence).
Patrick Wyatt has a series on the “Tough times on the road to Starcraft”. Some great highlights: Avoiding game crashes related to linked lists and this path-finding hack
Fabien Sanglard has a wonderful series of code reviews for classic games, ranging from Prince of Persia to DOOM 3 (one surprise for me: all the assets are in human-readable text, which John Carmack later said was a mistake)
Casey Muratori also has a series of wonderful posts about writing code for The Witness, including how to grow grass. His post on creating a visual debugger for The Witness to map an island’s walkable surfaces gave me a new appreciation for the power of visualization in computational work. Check out his Handmade Hero series, in which he creates a game and engine from scratch.
Valve’s entire list of publications is worth reading. Some of my favorites: how the Valve engineering and writing team came up with a spreadsheet-like system to build the dynamic dialogue of Left 4 Dead 2. And also, how to efficiently create zombie wounds and other neat rendering tricks.
David Galindo’s 7-part series, How much do indie PC devs make, anyways?, is not as focused on the programming part, but deserves a mention for his thoroughness and the fact that he’s made a living off of a game as silly-on-its-face as Cook, Serve, Delicious!
Mary Rose Cook’s live demo of how-to-code Space Invaders from scratch (source code here) is a terrific watch and reminds me of how much fun game dev can be:

The 14 Deadly Sins of Graphic-Adventure Design
In honor of recently passed Satoru Iwata, Nintendo’s chief executive, programmer, and gamer, this Q&A in which Iwata describes working with the Pokemon source code (via reddit/TIL)
In honor of the recent King’s Quest remake: The Unmaking and Remaking of Sierra On-Line, which has a detailed description of the original KQ’s technical challenges and innovations.
The Most Officialest SkiFree Home Page!
Lucas Pope’s dev log for Return of the Obra Dinn
Lucas Pope’s dev log for Papers Please

More to come…

July 6, 2015

Moving from WordPress to Jekyll

After reading the Stack Exchange engineering team’s excellent writeup on how they moved their WordPress blogs to Jekyll, I’ve decided to quit procrastinating and start my own Jekyll-powered blog – blog.danwin.com. I’ve used WordPress for my blog at danwin.com for the past 5 years and I’ll probably leave that as is, as I don’t have the Stack Exchange team’s talent or patience for doing the content migration.

Don’t get me wrong; WordPress is a fantastic piece of software, considering how long it has lived and the millions of voices it has hosted on the Internet. But it’s not for me. While my current blog gets a decent amount of Google search traffic (people love reading about how infinite scroll might be bad), over the years, I couldn’t bring myself to keep posting to it. It’s not that I didn’t have ideas – I have a Dropbox folder full of 75%-finished posts that I could paste into the WordPress text editor. And I still post daily to Twitter, Hacker News, and while I was in New York, Tumblr.

I just got tired of the WordPress posting process. The logging into my abysmally slow cheap Dreamhost instance. Then, the 5 to 25 second wait for the New Post screen to load up. And then, the process of turning my Markdown drafts into HTML, then pasting into the WordPress rich text editor. Then the hand-fixing of HTML. Then hitting “Publish”, and waiting for my cheapo Dreamhost shared server to take 30 seconds to complete the action. And then I manually run the cache-busting plugin. When I inevitably have to fix typos or add new paragraphs to the post, I have to repeat all of the steps above, sometimes starting from the re-editing of the original Markdown textfile, all the while my cheapo Dreamhost server, which I pay $99 a year for, is taking 15 to 30 seconds to load each page.

It’s funny how a few minutes of friction are more than enough to stop the creative process. So moving to a whole new blogging platform, as momentous as it seems, is worth it to me because the publishing process is reduced to mere seconds.

After setting up a new Jekyll project on my computer, this is my publishing process:

Open my text editor (Sublime Text 3).
Write plain text.
Hit Cmd-S to save my changes.
Hit Cmd-Tab to switch to my command prompt
Type jekyll build to build out the entire blog into a subfolder.
Type s3_website push to push that subfolder online.
Wait a few seconds for the changes to appear at blog.danwin.com

My first brush with Markdown and Jekyll was building out the Bastards Book of Photography using the Octopress framework.

Since then, virtually everything I’ve done has been with a static site generator, particularly the wonderful Middleman project – check out this writeup by Vox Media’s product team on how they use Middleman: Take a peek at the code that powered The Verge 50.

Obvious benefits

Fast performance via S3 hosting: How To: Hosting on Amazon S3 with CloudFront
Unhackable: check out this HN discussion, Is a static site hosted on AWS S3 ‘hackable’?
Cheap hosting: How I served 100k users without breaking the server- or a dollar bill.

The less-heralded benefits of static-site generation

Use and practice your operating system skills
Find-and-replace across text files
Backup however you want
Learn how to hack

Things I’ve made

These sites use either Jekyll or Middleman:

Other cool things

A list of other organizations and people using Jekyll or Middleman.

How I Jekyll

You can see the repo here.

I’ve set up a Amazon S3 bucket named blog.danwin.com and I use the s3_website to handle the pushing and syncing of files.

So far, no custom plugins.