Even as citizens generate more data than ever before, most cities haven’t taken full advantage of that information flow to improve services and become more efficient. “Historically, cities have been moving in analog, trying to measure things with imperfect data in information-poor environments,” says Harvard Business School Assistant Professor Michael Luca.
That may be about to change. Thanks to the Internet, mobile apps, and a wide range of useful programs online, residents add to the pool of information with every keystroke they make on their computer or smartphone. In addition, cities are expanding their own data-gathering and crunching capabilities through advancements like sensor networks and sophisticated modeling software.
What if cities could make use of all that data to better give residents what they need—for example, using Google Street View to guide economic development, or Yelp restaurant reviews to target hygiene inspections?
There is so much data now, it’s exhilarating—and frightening
“There is all sorts of data that is coming in now,” says Luca, “and if you use it carefully you could revamp the way that every policy is evaluated and every operation is done.”
In a new working paper, Big Data and Big Cities: The Promises and Limitations of Improved Measures of Urban Life, Luca and three collaborators argue that cities have never been better positioned to take advantage of the vast amounts of data being generated in the world. The key is figuring out how to use it. In the paper, Luca, Edward L. Glaeser and Scott Duke Kominers (PhDBE 2011) of Harvard University, and PhD student Nikhil Naik of the Massachusetts Institute of Technology Media Lab, cite three trends that make cities particularly poised to exploit big data.
First, the open data movement has led cities to digitize more of their own information, putting everything from tax records and public health inspection scores online. “They take a dataset that used to be in an obscure database or on paper, and now it’s available for the public to innovate on,” says Luca.
Second, citizens are generating what Luca calls “digital exhaust,” data generated online as part of their daily activities, which could be captured by cities to give clues into their citizens’ behavior. “Yelp is used to help people search for restaurants, not to tell cities where to go to inspect, but it could be used for that purpose,” says Luca. Similarly, Google searches in different geographies could give policymakers key insight into what their citizens care about.
Finally, private corporations are more willing than ever to share their own internal data with government to gain understanding into their employees—providing insight, for example, into health behavior of workers in different neighborhoods.
“There is so much data now, it’s exhilarating—and frightening,” says Luca. “Cities need to think carefully about what data to use, how to use it, and when not to use it.”
Taming the data flow
To get a handle on all this data and to better predict outcomes of policies, Luca believes cities need to develop algorithms to coordinate their own data with online information.
In past work, for example, colleague Nikhil Naik used machine learning to analyze images. Using these techniques, he took some 3,600 images of New York City blocks, obtained from the Google Street View Image API, and “taught” the computer to recognize various features, including streets, sidewalks, buildings, and trees.
In their study, the Luca team linked Naik’s images with household income levels for some 2,400 blocks, provided by the city online. “The incomes act as labels for the images, and then the machine learns the association between how the features relate to those incomes,” says Naik.
After a certain amount of “training,” the computer uses the algorithm it generates to predict income levels based on images for which data is unavailable. When the researchers crunched those numbers, they determined that the image analysis could give a much more accurate prediction for income than other measures. In statistical terms, the images accounted for 77 percent of the possible variation for income. By contrast, other measures such as race and education only account for 25 percent of variation.
Even more interesting, the Luca team was able to take the algorithm developed in New York and apply it to street view images in Boston—determining that it accurately predicted income in the city within 86 percent of variation. By using this kind of algorithm, Luca says, cities could better determine the effect of economic development initiatives—on a block-by-block level—in real time, without having to wait for annual income surveys.
For instance, if city regulators wanted to see the impact of licensing three new businesses on a particular block, they could monitor changes in Google Street Views, cross-referencing that data with online reviews in the neighborhood and housing estimates by real-estate sites such as Zillow.
“You could then train an algorithm to estimate the quality of life in the neighborhood on a ongoing basis and see how that quality changes over time,” says Luca. “This can be especially valuable for cities that have detailed survey data coming in every few years, but are interested in policy changes that happen on a much more frequent basis.”
Yelp used to target dirty restaurants
In his own work, Luca has applied similar techniques to textual analysis of Yelp reviews to help cities determine which restaurants to inspect. “Right now, cities do random inspections,” he says. If you theoretically divide restaurants in half by hygiene scores (given by Yelp users), with the top half being clean and the bottom half dirty, then “they have a 50 percent chance of finding a dirty place.”
In an effort to improve that percentage, Luca trained the computer to analyze Yelp reviews, comparing specific combinations of words with the number of violations restaurants received to teach the computer to recognize factors that might indicate a dirty eatery. When he then applied the algorithm to a second group of restaurants, the chance of finding a “dirty” restaurant increased to 80 percent.
In an upcoming paper, Luca, Glaeser, Kominers, and Harvard doctoral candidate Andrew Hillis further honed the algorithm to develop a tool for the City of Boston. After officials were asked to identify health violations of the most concern, the researchers and Boston ran an open tournament to crowdsource the best scoring Boston-specific algorithm. More than 700 people entered the competition.
The research team tested algorithms from 23 finalists; the tournament was won by a statistician from London. “Using a Boston-specific algorithm, we found that you could cut the number of inspectors by 40 percent and find the same number of violations,” says Luca. In other words, cities could be much more efficient with existing resources—producing the same results with less money, or improving their performance without additional expenditures. The researchers are currently working with the city to further develop and test the algorithm, with the goal of implementing a version for actual use in the field.
In principle, many areas of city operations could be made more efficient through this type of approach, says Luca. Using Google searches keyed to specific geographies, government analysts could determine what kind of jobs people were searching for, using that information over time to forecast unemployment and what kind of training programs should be developed.
And in the same way that the New York street view data was applied to Boston, policymakers could apply algorithms created in locations with rich data to those with poor data to better guide policy. Naik is currently working on a way to apply the street view technology in rural areas, such as villages in Indonesia, where it is costly and difficult to perform accurate surveys.
Results may vary
Of course the application of such data comes with significant caveats—an algorithm generated on the features found in New York City may not be readily transferable to small Indonesian villages, which display radically different types of features.
It will take hard work and experimentation to produce algorithms that can make accurate predictions in diverse environments.
“People have this notion that big data can solve every problem,” warns Luca. “We fundamentally think it’s about pairing the right sets of questions and tools with the right datasets.”
With some creativity and ingenuity, however, Big data could effectively provide cities with street-by-street information to aid officials in improving urban quality of life.
“You could know street by street how people are doing,” says Luca. “Then operations could step in and say, ‘How could we do things better?’”