tl;dr: This post contains an interactive CartoDB choropleth map of the latest Census population estimates data (and a top 20 list of fastest-shrinking cities), as well as the process of how I used Python 3.x and geopandas to wrangle Census data and shapefiles.

Last week, the U.S. Census Bureau released its 2015 population estimates. The Census’s press release focuses on just a few of the top cities ranked by growth, but you can download the raw data to analyze yourself.

For example, embedded below is an interactive CartoDB choropleth map I’ve created (see it fullscreen) showing the population change of incorporated places (i.e. cities and towns, but not counties) from 2010 to 2015, for places that had an estimated population of at least 30,000 in 2015:

See this map fullscreen on CartoDB: https://dansn.cartodb.com/viz/f61abdf4-2075-11e6-b3a5-0e3ff518bd15/public_map

The top 20 cities >= 100,000 pop, ranked by biggest losses from 2010 to 2015

Since so many of the other Census data news releases focuses on the fastest growing cities, here’s the list of the fast shrinking cities:

State Place 2015 estimated pop. % change 2010-2015
Michigan Detroit city 677116 -4.8
Illinois Rockford city 148278 -3.0
Alabama Montgomery city 200602 -2.6
Ohio Toledo city 279789 -2.5
Ohio Cleveland city 388072 -2.0
Louisiana Shreveport city 197204 -1.9
Mississippi Jackson city 170674 -1.8
Connecticut Waterbury city 108802 -1.4
Georgia Macon-Bibb County 153515 -1.2
New York Buffalo city 258071 -1.2
Ohio Dayton city 140599 -1.2
Missouri St. Louis city 315685 -1.1
Connecticut Hartford city 124006 -1.1
New York Syracuse city 144142 -0.8
Ohio Akron city 197542 -0.7
Virginia Hampton city 136454 -0.7
Alabama Mobile city 194288 -0.5
New York Rochester city 209802 -0.4
Kansas Topeka city 127265 -0.4
Louisiana Baton Rouge city 228590 -0.4

The rest of this post explains where I got the data and how I wrangled it:

Getting the population estimates data

The Census press release links to its data site, and if you poke around, you’ll eventually land on this page, which packages the data in easy-to-download CSVs:

Population Estimates: Incorporated Places and Minor Civil Divisions Datasets: Subcounty Resident Population Estimates: April 1, 2010 to July 1, 2015

Direct link: https://www.census.gov/popest/data/cities/totals/2015/files/SUB-EST2015_ALL.csv

Getting the boundary shapefiles for incorporated places

The Census also maintains shapefiles for the many levels of geography that aggregates data across, including incorporated places. The landing page for the data is here:

Cartographic Boundary Shapefiles - Places (Incorporated Places and Census Designated Places)

While there’s a lot of incorporated places in the U.S., the shapefile data for such places is not larger than the shapefile data needed to show all of America’s zipcodes. But unfortunately, unlike the zip code data, the Census doesn’t seem to provide a bulk download of incorporated places nationwide in addition to the individual state files. Which means you’ll have to download all 50+ files individually and somehow concatenate them. I don’t know how to do that manually, but I do know how to do that programmatically.

Geospatial data wrangling with Python 3.x and geopandas

Getting geopandas (the equivalent of R’s rgdal) to run on Python 3.x and my OS X machine was a long enough journey that I wrote a separate iPython notebook about it.

Assuming you have geopandas installed, here are the steps to bulk download and combine the Census shapefiles into a single GeoJSON file – which you can download here if you’re impatient: stash.compjour.org/samples/census/cbf_places_2015_500k.json

First, a little warmup: downloading the CSV of population estimates and saving it locally. Yes, it’s a lot of boilerplate for downloading a single file, but it’ll make the next chunk of code a little more palatable.

from os import makedirs
from os.path import basename, exists, join
import requests
DATA_DIR = 'data'
DEST_FNAME = join(DATA_DIR, 'SUB-EST2015_ALL.csv')
SRC_URL = 'https://www.census.gov/popest/data/cities/totals/2015/files/SUB-EST2015_ALL.csv'

makedirs(DATA_DIR, exist_ok=True)
with open(DEST_FNAME, 'w') as wf:
    print("Downloading", SRC_URL)
    resp = requests.get(SRC_URL)
    print("Saving to", DEST_FNAME)
    wf.write(resp.text)

Batch scraping the Census shapefiles

This next chunk of code assumes you’re running it in a new session, hence the reinitialization of the file directories. But it uses a regex to pick out the 50+ zip files that are in the webform’s HTML code and then downloads and saves each zip file (renamed to the corresponding state FIPS code).

Here’s what the HTML for each zipfile URL looks like – I’ve added the newline characters for easier reading, but that value= "..." weirdness is in the Census site’s sourcecode:

<option value= "http://www2.census.gov/geo/tiger/GENZ2015/shp/cb_2015_01_place_500k.zip">
   Alabama
</option>
from os import makedirs
from os.path import exists, join
import re
import requests
SHPFILE_DIR = join('data', 'cbf_places_2015_500k')
makedirs(SHPFILE_DIR, exist_ok=True)
URLS_PATTERN = r'(?<=value= ").+?cb_2015_.+?place_500k.zip(?=")'
SOURCE_URL = 'https://www.census.gov/geo/maps-data/data/cbf/cbf_place.html'
resp = requests.get(SOURCE_URL)

# perform lazy person's html parsing:
for url in re.findall(URLS_PATTERN, resp.text):
    # filename is just state code + '.zip'
    statecode = re.search(r'(?<=2015_)\d{2}(?=_place)', url).group()
    fname = join(SHPFILE_DIR, statecode + '.zip')
    if not exists(fname):
       print("Downloading", url)
       zresp = requests.get(url)
       with open(fname, 'wb') as w:
          print("\tWriting", fname)
          w.write(zresp.content)
    else:
        print("Already have", fname)

Combining all shapefiles into one GeoJSON

This script leverages the power of geopandas and the (normal) pandas method, concat() to combine the Census shapefiles. Notice in the second to last line we have to typecast the concatenated dataframe as a GeoDataFrame in order to save it properly as GeoJSON:

from glob import glob
import geopandas as gp
SHPFILE_DIR = join('data', 'cbf_places_2015_500k')
DEST_FNAME = join('data', 'cbf_places_2015_500k.json')
dfs = []
for fname in glob(join(SHPFILE_DIR, '*.zip')):
    d = gp.read_file('zip://' + fname)
    print("Opened ", fname, " -- ", len(d), 'records.')
    dfs.append(d)
alldf = gp.GeoDataFrame(gp.pd.concat(dfs))
alldf.to_file(DEST_FNAME, driver='GeoJSON')

Again, here’s the GeoJSON file if you want to see what it turned out as:

http://stash.compjour.org/samples/census/cbf_places_2015_500k.json

PostGIS query

While we can do data-wrangling with geopandas, I prefer to upload the datasets straight to CartoDB and do the rest in PostGIS. Here are the datasets hosted on CartoDB:

Covering PostGIS and CartoDB is a topic too broad for this tutorial, but I’ve written a separate one for the Stanford Journalism curriculum that you can follow: Using PostGIS, SQL, and CartoDB to identify schools at risk from Oklahoma’s earthquakes

The join between the two Census tables involves using the corresponding state and place FIPS codes. We also narrow the resultset by restricting the summary level: sumlev='162' because '162' represents an “Incorporated place”, according to the Census documentation for the population estimate file.

And finally, to keep small places with proportionately huge booms from skewing the map, we restrict the selection to places that have an estimated population of at least 30,000 in 2015:

(pardon the excessive SELECT statement; it’s not possible to use select * as we have to use only one of the table’s special CartoDB fields)

SELECT 
  pop.stname AS state_name,
  places.name AS place_name,
  places.statefp, places.placefp,
  places.the_geom, places.affgeoid,
  places.aland, places.awater, places.geoid, places.lsad,
  places.the_geom_webmercator, places.cartodb_id,
  census2010pop, estimatesbase2010, popestimate2010,
  popestimate2011, popestimate2012, popestimate2013,
  popestimate2014, popestimate2015,
  (popestimate2015 - popestimate2010) AS delta_2010_2015,
  (popestimate2015 - popestimate2014) AS delta_2014_2015,
  round(100.0 * (popestimate2015 - popestimate2010) / popestimate2010, 1)  
    AS delta_2010_2015_pct,
  round(100.0 * (popestimate2015 - popestimate2014) / popestimate2014, 1)  
    AS delta_2014_2015_pct
FROM census_places_popest_2015 AS pop
INNER JOIN cbf_places_2015_500k AS places
  ON pop.state = places.statefp
     AND pop.place = places.placefp
WHERE pop.sumlev = '162' 
  AND pop.popestimate2015 >= 30000
  AND pop.popestimate2010 > 0
  AND pop.popestimate2014 > 0
ORDER BY popestimate2015 DESC

The resulting joined dataset is here: census_2015_places_and_popest_deltas

And here’s the link to the map on CartoDB: https://dansn.cartodb.com/viz/f61abdf4-2075-11e6-b3a5-0e3ff518bd15/public_map

Calculating the top 20 population losses

This doesn’t require geopandas, just regular pandas with its regular DataFrames and querying/sorting functionality

from os.path import join
import pandas as pd
df = pd.read_csv(join('data', 'SUB-EST2015_ALL.csv'))
df = df[df.SUMLEV == 162]
df['delta_2010_2015'] = df['POPESTIMATE2015'] - df['POPESTIMATE2010']
df['delta_2010_2015_pct'] = (df['delta_2010_2015'] * 100 / df['POPESTIMATE2010']).round(1)

Filter the list for POPESTIMATE2015 >= 100000 and format at a tab-delimited string:

xf = df[df.POPESTIMATE2015 >= 100000].sort_values('delta_2010_2015_pct').head(20)
# print out in tab-delimited format:
xf.to_csv(sep='\t', index=False, 
          columns=['STNAME', 'NAME', 'POPESTIMATE2015', 'delta_2010_2015_pct'])