We created Project Tycho at the University of Pittsburgh Graduate School of Public Health to promote open access to public health data. Supported by the Bill & Melinda Gates Foundation, our vision for Project Tycho is to help launch a movement for open access to disaggregated public health data (spatially and temporally granular public health data) from around the world. Starting “at home”, we digitized the entire history of weekly US surveillance data from 1888 to 2011 into a computable and easily accessible format. US Nationally Notifiable Diseases Surveillance System (NNDSS) reports have been published every week in the Morbidity and Mortality Weekly Report and its various precursors dating back to 1888, but previously this rich data source was available only in hard print copy or non-downloadable formats. By providing open access to this entire dataset, we hope to advance the use of public health data for research and decision support, and to help set a new global norm for data sharing in public health.
We digitized and made accessible data from a total of 6300 weekly reports (containing 35,000 tables) published between 1888 and 2011. Weekly reports were obtained from various online libraries including PubMed Central, the Hathi Trust Digital Library and the MMWR web pages, and from bound library copies. For data entry we explored state of the art optical character recognition methods, but these did not produce satisfactory results due to heterogeneities in report formats and print quality, so for this first stage of our work we resorted to manual data entry. Data entry was performed by Digital Divide Data, a social enterprise group that provides IT educational opportunities to disadvantaged students through work-study programs in Laos and Cambodia. Each weekly report contained multiple notifiable disease tables that were double entered by ~200 million keystrokes, generating 35,000 separate spreadsheets. All these data were evaluated and merged through custom algorithms we devised, into the Project Tycho database.
The current version of the Project Tycho database contains all the consistent and standardized data that are available in the entire US data set. There were some vexing heterogeneities in reporting formats between 1888 and 2011, so processing continues for those few especially problematic parts of the data that could not yet be included. Since 1888, a total of 56 diseases and 70 subcategories were reported at some time or another from 52 states (including Washington DC and New York City), 6 overseas territories, 2828 counties, and about 3000 cities or towns. The combined data represent about 90 million reported cases and 4 million reported deaths. Every data point can be checked by a simple click, to a PDF of the original historical reference document.
A manuscript that provides all the details about Project Tycho has been submitted for publication, and it will be made available through this website when it is published.