For years, the federal government, states, and some cities have enthusiastically made vast troves of data open to the public. Acres of paper records on demographics, public health, traffic patterns, energy consumption, family incomes and many other topics have been digitized and posted on the Web.
This abundance of data can be a gold mine for discovery and insights, but finding the nuggets can be arduous, requiring special skills.
A project that came out of the MIT Media Lab April 4 seeks to ease that challenge and to make the value of government data available to a wider audience. The project, called Data USA, bills itself as “the most comprehensive visualization of U.S. public data.” It is free, and its software code is open source, meaning that developers can build custom applications by adding other data.
Cesar A. Hidalgo, an assistant professor of media arts and sciences at the MIT Media Lab who led the development of Data USA, said the website was devised to “transform data into stories.” Those stories are typically presented as graphics, charts and written summaries. The media lab worked with the consulting and auditing firm Deloitte, which provided funding and expertise on how people use government data sets in business and for research. “The goal was organize and visualize data in a way that a lot of people think about it,” said Patricia Buckley, director of economic policy and analysis at Deloitte and a former senior economist at the Commerce Department.
Type “New York” into the Data USA search box, and a drop-down menu presents choices – the city, the metropolitan area, the state and other options. Select the city, and the page displays an aerial shot of Manhattan with three basic statistics: population (8.49 million), median household income ($52,996) and median age (35.8).
Lower on the page are six icons for related subject categories, including economy, demographics and education. If you click on demographics, one of the data stories appears, based largely on data from the American Community Survey of the U.S. Census Bureau.
Using colorful graphics and short sentences, it shows the median age of foreign-born residents of New York (44.7) and of residents born in the United States (28.6); the most common countries of origin for immigrants (the Dominican Republic, China and Mexico); and the percentage of residents who are U.S. citizens (82.8 percent, compared with a national average of 93 percent). Data USA features a selection of data results on its home page. They include the gender wage gap in Connecticut; the racial breakdown of poverty in Flint, Michigan; the wages of physicians and surgeons across the United States; and the institutions that award the most computer science degrees. Hal R. Varian, chief economist of Google, who has no connection to Data USA, called the site “very informative and aesthetically pleasing.” The fact the government is making so much data publicly available, he added, is fueling creative work like Data USA.
Data USA embodies an approach to data analysis that will most likely become increasingly common, said Kris Hammond, a computer science professor at Northwestern University. The site makes assumptions about its users and programs those assumptions into its software, he said.
“It is driven by the idea that we can actually figure out what a user is going to want to know when they are looking at a data set,” Hammond said. Data scientists, he said, often bristle when such limitations are put into the tools they use. But they are the data world’s power users, and power users are a limited market, said Hammond, who is also chief scientist at Narrative Science, a startup that makes software to interpret data.