Dan Nguyen's Blog | Thoughts, Data and Computational Journalism

About Dan: I'm a computational data journalist and programmer currently living in Chicago. Previously, I was a visiting professor at the Computational Journalism Lab at Stanford University. I advocate the use of programming and computational thinking to expand the scale and depth of storytelling and accountability journalism. Previously, I was a news application developer for the investigative newsroom, ProPublica, where I built ProPublica's first and, even to this day, some of its most popular data projects.

You can follow me on Twitter at @dancow and on Github at dannguyen. My old WordPress blog and homepage are at: http://danwin.com.


Why I've Stopped Teaching Web Scraping

To come: lots of negative thoughts about teaching web scraping to beginning programmers. This is sparked by a recent NICAR-L question. Here’s my partial response, to be later elaborated when I have time:

In terms of teaching web-scraping: I don’t think this is worth the pain. Web scraping ends up not just being about scraping HTML, which is very complicated itself (the syntax and the DOM), but about understanding the modern web stack. I’ve taught the web inspector because it’s a good way to see all the pieces that the web is made up of, but students don’t generally get the big picture unless they’ve done actual web development before.

The other part of this is that there is an ever-growing, incredible amount of data that is available through APIs and data dumps. With Twitter, Spotify, Instagram, and/or Facebook, you can do everything from activity analysis, monitoring of government accounts, data aggregation, to bot making and network analysis…and it’s data that virtually all of your students will be familiar with as end-users and so it’s very easy to stoke their curiosity about the data’s technical underpinnings, and how to use math to do relevant analyses.

If you don’t like using such proprietary services, than the U.S. government, as well as Socrata and the Sunlight Foundation, has more than enough datasets for fun and curiosity and truly deep investigative work. You don’t even have to get into the work of accessing APIs (which admittedly can be a pain in the ass if you’re dealing with the OAuth2 flow, nevermind the part where you frantically check your students’ repos to see if they’ve published their passwords to Github)…There are plenty of static text dumps (in CSV/JSON/TSV, etc) to work with…I find that one of the hardest yet most important concepts for students to get are the basic data structures (lists/arrays and hashes/dicts) and why we serialize/deserialize them. Scraping from websites involves the same fundamental concepts, except with an endless pile of shit to get there.

My favorite fun, informative reads about game development

Being a game developer is what got me interested in computer engineering. I’ve veered a long way from that goal but I still love reading about it, because programming for games involves so many clever hacks, especially on the user-facing side.

Here are a few links and reads I’ve enjoyed:

More to come…

Moving from WordPress to Jekyll

After reading the Stack Exchange engineering team’s excellent writeup on how they moved their WordPress blogs to Jekyll, I’ve decided to quit procrastinating and start my own Jekyll-powered blog – blog.danwin.com. I’ve used WordPress for my blog at danwin.com for the past 5 years and I’ll probably leave that as is, as I don’t have the Stack Exchange team’s talent or patience for doing the content migration.

Don’t get me wrong; WordPress is a fantastic piece of software, considering how long it has lived and the millions of voices it has hosted on the Internet. But it’s not for me. While my current blog gets a decent amount of Google search traffic (people love reading about how infinite scroll might be bad), over the years, I couldn’t bring myself to keep posting to it. It’s not that I didn’t have ideas – I have a Dropbox folder full of 75%-finished posts that I could paste into the WordPress text editor. And I still post daily to Twitter, Hacker News, and while I was in New York, Tumblr.

I just got tired of the WordPress posting process. The logging into my abysmally slow cheap Dreamhost instance. Then, the 5 to 25 second wait for the New Post screen to load up. And then, the process of turning my Markdown drafts into HTML, then pasting into the WordPress rich text editor. Then the hand-fixing of HTML. Then hitting “Publish”, and waiting for my cheapo Dreamhost shared server to take 30 seconds to complete the action. And then I manually run the cache-busting plugin. When I inevitably have to fix typos or add new paragraphs to the post, I have to repeat all of the steps above, sometimes starting from the re-editing of the original Markdown textfile, all the while my cheapo Dreamhost server, which I pay $99 a year for, is taking 15 to 30 seconds to load each page.

It’s funny how a few minutes of friction are more than enough to stop the creative process. So moving to a whole new blogging platform, as momentous as it seems, is worth it to me because the publishing process is reduced to mere seconds.

After setting up a new Jekyll project on my computer, this is my publishing process:

  1. Open my text editor (Sublime Text 3).
  2. Write plain text.
  3. Hit Cmd-S to save my changes.
  4. Hit Cmd-Tab to switch to my command prompt
  5. Type jekyll build to build out the entire blog into a subfolder.
  6. Type s3_website push to push that subfolder online.
  7. Wait a few seconds for the changes to appear at blog.danwin.com

My first brush with Markdown and Jekyll was building out the Bastards Book of Photography using the Octopress framework.

Since then, virtually everything I’ve done has been with a static site generator, particularly the wonderful Middleman project – check out this writeup by Vox Media’s product team on how they use Middleman: Take a peek at the code that powered The Verge 50.

Obvious benefits

The less-heralded benefits of static-site generation

  • Use and practice your operating system skills
  • Find-and-replace across text files
  • Backup however you want
  • Learn how to hack

Things I’ve made

These sites use either Jekyll or Middleman:

Other cool things

A list of other organizations and people using Jekyll or Middleman.

How I Jekyll

You can see the repo here.

I’ve set up a Amazon S3 bucket named blog.danwin.com and I use the s3_website to handle the pushing and syncing of files.

So far, no custom plugins.

subscribe via RSS