EDA Before CDA

One Paragraph Summary

Always explore your data visually. Whatever specific hypothesis you have when you go out to collect data is likely to be worse than any of the hypotheses you’ll form after looking at just a few simple visualizations of that data. The most effective hypothesis testing framework in existence is the test of intraocular trauma.


This morning, I woke up to find that Neil Kodner had discovered a very convenient CSV file that contains geospatial data about every valid US zip code. I’ve been interested in the relationship between places and zip codes recently, because I spent my summer living in the 98122 zip code after having spent my entire life living in places with zip codes below 20000. Because of the huge gulf between my Seattle zip code and my zip codes on the East Coast, I’ve on-and-off wondered if the zip codes were originally assigned in terms of the seniority of states. Specifically, the original thirteen colonies seem to have some of the lowest zip codes, while the newer states had some of the highest zip codes.

While I could presumably find this information through a few web searches or could gather the right data set to test my idea formally, I decided to blindly plot the zip code data instead. I think the results help to show why a few well-chosen visualizations can be so much more valuable than regression coefficients. Below I’ve posted the code I used to explore the zip code data in the exact order of the plots I produced. I’ll let the resulting pictures tell the rest of the story.

zipcodes <- read.csv("zipcodes.csv")
ggplot(zipcodes, aes(x = zip, y = latitude)) +
ggsave("latitude_vs_zip.png", height = 7, width = 10)
ggplot(zipcodes, aes(x = zip, y = longitude)) +
ggsave("longitude_vs_zip.png", height = 7, width = 10)
ggplot(zipcodes, aes(x = latitude, y = longitude, color = zip)) +
ggsave("latitude_vs_longitude_color.png", height = 7, width = 10)
ggplot(zipcodes, aes(x = longitude, y = latitude, color = zip)) +
ggsave("longitude_vs_latitude_color.png", height = 7, width = 10)
ggplot(subset(zipcodes, longitude < 0), aes(x = longitude, y = latitude, color = zip)) +
ggsave("usa_color.png", height = 7, width = 10)


(Latitude, Zipcode) Scatterplot

Latitude vs zip

(Longitude, Zipcode) Scatterplot

Longitude vs zip

(Latitude, Longitude) Heatmap

Latitude vs longitude color

(Longitude, Latitude) Heatmap

Longitude vs latitude color

(Longitude, Latitude) Heatmap without Non-States

Usa color

6 responses to “EDA Before CDA”

  1. Andy W

    Related to zip codes, you may be interested in Robert Kosara’s zip “scribble” maps, http://eagereyes.org/zipscribble-maps/united-states. Also easily done in R with ggplot2! Here I make color breaks according to state, although for your question you would not want to do that.

    base_g <- ggplot(subset(zipcodes, longitude < 0), aes(x = longitude, y = latitude, color = state)) + geom_path(alpha = 0.7, size = 0.1) + coord_cartesian(xlim = c(-125, -60), ylim = c(25,50)) + theme(legend.position = "none")
    ggsave("usa_scribble.pdf", height = 7, width = 10)

    Of course one could make it look nicer (probably making some geographers roll over in there graves by not using "coord_fixed" or projecting the data into a more appropriate projection) but that goes against the grain of your point of quick EDA visualizations.

  2. Paul Hurley

    I love that you show all the visualisations, even swapping lang/long when the US is on it’s side, just the kind of thing I would do….

  3. Tal Galili

    Great post John, thank you.

  4. Jason
  5. Harlan

    Yep. Here’s some more things you should know about ZIP codes. For starters, they’re not intrinsically geographical! The Census uses a similar alternative called ZCTA codes that are better for data analysis, although still not great. Here are a couple of relevant links: