Blog

Charting Business Growth and Going Back to Geometry

An excerpt of the codebase.JONAH SMITH/NYT INSTITUTE

When we arrived in Tucson, one of our immediate fascinations was the “Modern Streetcar,” as the locals call it. Soon, Polo Rocha was hard at work on a story about business rejuvenation tied to the streetcar. Considering other facets of the story, we learned that the City of Tucson releases the records of business licenses issued. Weekly business license datasets are available on the Tucson local government webpage going back to the beginning of 2011.

Using the data, I started a project to analyze business growth around the trolley line versus the wider city. I first wrote a script to download all 233 files from the web page.

Unlike many data sets provided by city governments, the files were available in plain text format, which is quite helpful for developers because they do not require special software to read. Unfortunately the files had fixed widths to denote the columns, which makes it a bit more difficult to separate than, say, comma or tab separated columns.

To extract their data into tab-delimited files, I wrote a script to separate the columns following a helpful guide provided by the city. And I made an interesting discovery: The dates of the files were different than the start dates given in the data. This helped me compare the time between application approval and licenses granting. If this time had been smaller near the train station than in other parts of the city, we hypothesized, it may be evidence of the city prioritizing business development downtown.

The result of all this parsing was a single clean file of about 15,000 entries.

Perhaps the most challenging step in my analysis was to locate new businesses in reasonable proximity to the new tram line. To do so, we needed to collect the longitude and latitude coordinates of every trolley station and every business in the dataset — only 15,000 sets of coordinates.

The coordinates of the trolley stations were surprisingly difficult to collect. Although the local public transportation agency has General Transit Feed Specification data available, including coordinate information for all the stops—not all trolley stops are marked. I ended up finding the remaining stations on my own.

Finding the coordinates of the businesses was easier but a lot more time consuming. Google offers an API for turning addresses into coordinates, a process called geocoding. After some legwork with the API limits over two days, all of the businesses with recognizable addresses also had latitude and longitude data and could then be mapped and analyzed.

The next challenge was to compute distances from the stops and businesses in order to find the closest stop to any given business. As it turns out, calculating these distances is somewhat challenging. You may remember the distance formula from high school geometry, but that is for computing distances on a flat surface. The Earth, on the other hand, is round. And to further complicate matters, the Earth is not a perfect sphere.

“Distance over the Earth” math usually works better for long distances because they are less susceptible to rounding errors. (When computers do calculations, they inevitably introduce a little error. When the numbers are small to begin with, that error can dwarf the actual value.) The distances are quite small in this case, since we are only dealing with the City of Tucson. Though I implemented a geodesic approximation to analyze the distances, the results were clearly not accurate. Plotting business locations determined to be close to the streetcar line, the points were literally all over the map.

Rethinking the problem, we decided that the geographic area of Tucson was small enough that it could be approximated by a flat surface. Here’s the idea: if you took a basketball and looked at some tiny square on its surface, it would be hard to tell that it is not actually a flat surface, rather than a curved one. As that square gets smaller, the surface becomes less and less curved. In our case the sphere is the Earth, and Tucson is that square. (If the Earth were the size of a basketball, the square representing Tucson would be about 1/8th the surface area of a single pebble on its surface.)

Our other insight was that the exact distribution around stations was irrelevant; we just want to know about businesses that are “close” to the stations. We decided to use the handy high school distance formula after all, and to select “close” business by trial and error.

In the end, the data were not as interesting as we might have hoped. Still, we think they provide an interesting perspective on business growth in Tucson, and we think this strategy for analysis could help re-assess the streetcar’s impact a few years down the line.

These charts show the trend in licenses, by category, issued to businesses within walking distance of the Sun Link streetcar line. Licensing trends within walking distance roughly reflected those of the rest of the city.Jonah Smith/NYT Institute