Mapping the U.S. Census population estimates for incorporated places with CartoDB and geopandas
tl;dr: This post contains an interactive CartoDB choropleth map of the latest Census population estimates data (and a top 20 list of fastest-shrinking cities), as well as the process of how I used Python 3.x and geopandas to wrangle Census data and shapefiles.
Last week, the U.S. Census Bureau released its 2015 population estimates. The Census’s press release focuses on just a few of the top cities ranked by growth, but you can download the raw data to analyze yourself.
For example, embedded below is an interactive CartoDB choropleth map I’ve created (see it fullscreen) showing the population change of incorporated places (i.e. cities and towns, but not counties) from 2010 to 2015, for places that had an estimated population of at least 30,000 in 2015:
See this map fullscreen on CartoDB: https://dansn.cartodb.com/viz/f61abdf4-2075-11e6-b3a5-0e3ff518bd15/public_map
The top 20 cities >= 100,000 pop, ranked by biggest losses from 2010 to 2015
Since so many of the other Census data news releases focuses on the fastest growing cities, here’s the list of the fast shrinking cities:
State | Place | 2015 estimated pop. | % change 2010-2015 |
---|---|---|---|
Michigan | Detroit city | 677116 | -4.8 |
Illinois | Rockford city | 148278 | -3.0 |
Alabama | Montgomery city | 200602 | -2.6 |
Ohio | Toledo city | 279789 | -2.5 |
Ohio | Cleveland city | 388072 | -2.0 |
Louisiana | Shreveport city | 197204 | -1.9 |
Mississippi | Jackson city | 170674 | -1.8 |
Connecticut | Waterbury city | 108802 | -1.4 |
Georgia | Macon-Bibb County | 153515 | -1.2 |
New York | Buffalo city | 258071 | -1.2 |
Ohio | Dayton city | 140599 | -1.2 |
Missouri | St. Louis city | 315685 | -1.1 |
Connecticut | Hartford city | 124006 | -1.1 |
New York | Syracuse city | 144142 | -0.8 |
Ohio | Akron city | 197542 | -0.7 |
Virginia | Hampton city | 136454 | -0.7 |
Alabama | Mobile city | 194288 | -0.5 |
New York | Rochester city | 209802 | -0.4 |
Kansas | Topeka city | 127265 | -0.4 |
Louisiana | Baton Rouge city | 228590 | -0.4 |
The rest of this post explains where I got the data and how I wrangled it:
Getting the population estimates data
The Census press release links to its data site, and if you poke around, you’ll eventually land on this page, which packages the data in easy-to-download CSVs:
Direct link: https://www.census.gov/popest/data/cities/totals/2015/files/SUB-EST2015_ALL.csv
Getting the boundary shapefiles for incorporated places
The Census also maintains shapefiles for the many levels of geography that aggregates data across, including incorporated places. The landing page for the data is here:
Cartographic Boundary Shapefiles - Places (Incorporated Places and Census Designated Places)
While there’s a lot of incorporated places in the U.S., the shapefile data for such places is not larger than the shapefile data needed to show all of America’s zipcodes. But unfortunately, unlike the zip code data, the Census doesn’t seem to provide a bulk download of incorporated places nationwide in addition to the individual state files. Which means you’ll have to download all 50+ files individually and somehow concatenate them. I don’t know how to do that manually, but I do know how to do that programmatically.
Geospatial data wrangling with Python 3.x and geopandas
Getting geopandas (the equivalent of R’s rgdal) to run on Python 3.x and my OS X machine was a long enough journey that I wrote a separate iPython notebook about it.
Assuming you have geopandas installed, here are the steps to bulk download and combine the Census shapefiles into a single GeoJSON file – which you can download here if you’re impatient: stash.compjour.org/samples/census/cbf_places_2015_500k.json
First, a little warmup: downloading the CSV of population estimates and saving it locally. Yes, it’s a lot of boilerplate for downloading a single file, but it’ll make the next chunk of code a little more palatable.
from os import makedirs
from os.path import basename, exists, join
import requests
DATA_DIR = 'data'
DEST_FNAME = join(DATA_DIR, 'SUB-EST2015_ALL.csv')
SRC_URL = 'https://www.census.gov/popest/data/cities/totals/2015/files/SUB-EST2015_ALL.csv'
makedirs(DATA_DIR, exist_ok=True)
with open(DEST_FNAME, 'w') as wf:
print("Downloading", SRC_URL)
resp = requests.get(SRC_URL)
print("Saving to", DEST_FNAME)
wf.write(resp.text)
Batch scraping the Census shapefiles
This next chunk of code assumes you’re running it in a new session, hence the reinitialization of the file directories. But it uses a regex to pick out the 50+ zip files that are in the webform’s HTML code and then downloads and saves each zip file (renamed to the corresponding state FIPS code).
Here’s what the HTML for each zipfile URL looks like – I’ve added the newline characters for easier reading, but that value= "..."
weirdness is in the Census site’s sourcecode:
<option value= "http://www2.census.gov/geo/tiger/GENZ2015/shp/cb_2015_01_place_500k.zip">
Alabama
</option>
from os import makedirs
from os.path import exists, join
import re
import requests
SHPFILE_DIR = join('data', 'cbf_places_2015_500k')
makedirs(SHPFILE_DIR, exist_ok=True)
URLS_PATTERN = r'(?<=value= ").+?cb_2015_.+?place_500k.zip(?=")'
SOURCE_URL = 'https://www.census.gov/geo/maps-data/data/cbf/cbf_place.html'
resp = requests.get(SOURCE_URL)
# perform lazy person's html parsing:
for url in re.findall(URLS_PATTERN, resp.text):
# filename is just state code + '.zip'
statecode = re.search(r'(?<=2015_)\d{2}(?=_place)', url).group()
fname = join(SHPFILE_DIR, statecode + '.zip')
if not exists(fname):
print("Downloading", url)
zresp = requests.get(url)
with open(fname, 'wb') as w:
print("\tWriting", fname)
w.write(zresp.content)
else:
print("Already have", fname)
Combining all shapefiles into one GeoJSON
This script leverages the power of geopandas and the (normal) pandas method, concat() to combine the Census shapefiles. Notice in the second to last line we have to typecast the concatenated dataframe as a GeoDataFrame in order to save it properly as GeoJSON:
from glob import glob
import geopandas as gp
SHPFILE_DIR = join('data', 'cbf_places_2015_500k')
DEST_FNAME = join('data', 'cbf_places_2015_500k.json')
dfs = []
for fname in glob(join(SHPFILE_DIR, '*.zip')):
d = gp.read_file('zip://' + fname)
print("Opened ", fname, " -- ", len(d), 'records.')
dfs.append(d)
alldf = gp.GeoDataFrame(gp.pd.concat(dfs))
alldf.to_file(DEST_FNAME, driver='GeoJSON')
Again, here’s the GeoJSON file if you want to see what it turned out as:
http://stash.compjour.org/samples/census/cbf_places_2015_500k.json
PostGIS query
While we can do data-wrangling with geopandas, I prefer to upload the datasets straight to CartoDB and do the rest in PostGIS. Here are the datasets hosted on CartoDB:
- The population estimates data, as provided by the U.S. Census: census_places_popest_2015
- The combined places shapefile that we created with geopandas: cbf_places_2015_500k
Covering PostGIS and CartoDB is a topic too broad for this tutorial, but I’ve written a separate one for the Stanford Journalism curriculum that you can follow: Using PostGIS, SQL, and CartoDB to identify schools at risk from Oklahoma’s earthquakes
The join between the two Census tables involves using the corresponding state and place FIPS codes. We also narrow the resultset by restricting the summary level: sumlev='162'
because '162'
represents an “Incorporated place”, according to the Census documentation for the population estimate file.
And finally, to keep small places with proportionately huge booms from skewing the map, we restrict the selection to places that have an estimated population of at least 30,000 in 2015:
(pardon the excessive SELECT statement; it’s not possible to use select *
as we have to use only one of the table’s special CartoDB fields)
SELECT
pop.stname AS state_name,
places.name AS place_name,
places.statefp, places.placefp,
places.the_geom, places.affgeoid,
places.aland, places.awater, places.geoid, places.lsad,
places.the_geom_webmercator, places.cartodb_id,
census2010pop, estimatesbase2010, popestimate2010,
popestimate2011, popestimate2012, popestimate2013,
popestimate2014, popestimate2015,
(popestimate2015 - popestimate2010) AS delta_2010_2015,
(popestimate2015 - popestimate2014) AS delta_2014_2015,
round(100.0 * (popestimate2015 - popestimate2010) / popestimate2010, 1)
AS delta_2010_2015_pct,
round(100.0 * (popestimate2015 - popestimate2014) / popestimate2014, 1)
AS delta_2014_2015_pct
FROM census_places_popest_2015 AS pop
INNER JOIN cbf_places_2015_500k AS places
ON pop.state = places.statefp
AND pop.place = places.placefp
WHERE pop.sumlev = '162'
AND pop.popestimate2015 >= 30000
AND pop.popestimate2010 > 0
AND pop.popestimate2014 > 0
ORDER BY popestimate2015 DESC
The resulting joined dataset is here: census_2015_places_and_popest_deltas
And here’s the link to the map on CartoDB: https://dansn.cartodb.com/viz/f61abdf4-2075-11e6-b3a5-0e3ff518bd15/public_map
Calculating the top 20 population losses
This doesn’t require geopandas, just regular pandas with its regular DataFrames and querying/sorting functionality
from os.path import join
import pandas as pd
df = pd.read_csv(join('data', 'SUB-EST2015_ALL.csv'))
df = df[df.SUMLEV == 162]
df['delta_2010_2015'] = df['POPESTIMATE2015'] - df['POPESTIMATE2010']
df['delta_2010_2015_pct'] = (df['delta_2010_2015'] * 100 / df['POPESTIMATE2010']).round(1)
Filter the list for POPESTIMATE2015 >= 100000
and format at a tab-delimited string:
xf = df[df.POPESTIMATE2015 >= 100000].sort_values('delta_2010_2015_pct').head(20)
# print out in tab-delimited format:
xf.to_csv(sep='\t', index=False,
columns=['STNAME', 'NAME', 'POPESTIMATE2015', 'delta_2010_2015_pct'])