Coulda Woulda Shoulda Part Two: Data

(Feel free to skim if the first part of this post seems boring or condescending. The precise relation between anecdote and data is somewhat a side issue to this post; the meat is something a little meatier.)

Sniffy people who find themselves in arguments sometimes sniff, “data is not the plural of anecdote.” By this, they mean to discredit an anecdote supporting a claim they oppose, by suggesting that the anecdote is some kind of non-representative outlier.

They may be right about that, but they are mistaken when they say that data is not the plural of anecdote. All data consists, in fact, of a plurality of anecdotes. It must be admitted that not just any collection of anecdotes constitutes data. Data is the plural of anecdote in the special case where all the anecdotes in question share a data model.

Consider this anecdote:

From June of 2003 to March of 2007, Gertrude Toynbee resided at 1234 West Peacock Street in Cincinnati, Ohio.

That anecdote contains:

  • Year and Month of start of residence
  • Year and Month of end of residence
  • Name of resident
  • Street Address of residence
  • City of residence
  • State of residence

You can say the data model for this anecdote consists of those specific bits of information. A database is a system for studying the relations between anecdotes, when each anecdote contains the exact same kind of specific information.

One anecdote isn’t all that interesting, but suppose you had a database containing this information for all residences, all residents, and all terms of residence. There are endless interesting questions that could be easily queried of such a database. How many people moved from New York City to Philadelphia in 2011? How many congressional representatives should the state of Nevada have? What percentage of arrivals in Florida in 2013 came from each other state? What was the total population of Wink, Texas for all months and years?

You could also answer questions ranging from the intrusive to the bizarre. List all places and times of residence for Gertrude Toynbee! List all years and months in which a person with the initials “G.T.” lived in the 1200 block of West Peacock Street!

Now suppose you had a collection of anecdotes that included not only the above information, but also the birthdate, sex and marital status of each resident at the time of residency. Surely you could make some fascinating sociological studies by querying all that data. But this new information comes with a penalty: you are getting your anecdotes from a variety of reporters and they don’t all report the same way. Some don’t include the name of the resident. Some don’t include the dates of residence. Some don’t include the city of residence. And some conform to the original data model and don’t report age, sex or marital status.

Now you have just as much information as you had before, maybe more. But you actually have less data, at least for some questions. You can’t determine the exact relation between marital status and living in Gary, Indiana because you some of your Gary, Indiana information doesn’t contain marital status, and vice versa. Data professionals can compensate somewhat for data which is incomplete in this way; they can give you an answer and then give a pretty clear idea how likely the answer is to be correct. Another approach is to take the incompatible data sets and put them in different databases. Now you have good confidence in your answers, but the only answers you can get will be local to that specific data set. There is no one database that can answer all the questions you might have with certainty.

That’s the situation we face with Xi Jinping’s Deadly Inscrutable Chinese Virus. We have plenty of information, but almost no data. Available data sets are small and local and can’t be used to answer larger questions. Reporting is wildly inconsistent; imagine if reporting parties didn’t even agree on the definition of a street address or marital status. Publicly available anecdotes paint a very confusing, contradictory picture.

The result is that there are fundamental questions we can’t answer in a general, meaningful way. Public policy is, therefore, largely a matter of guesswork. And some makers of public policy are guessing in their own favor at the expense of the public.

The Centers for Disease Control could have anticipated and mitigated this situation. It was a given that, sooner or later, whether by zoonotic serendipity or enemy action, a new pathogen would arrive on our shores, infectious and deadly enough to cause trouble and pain on an historical scale. Some of the more likely candidates would be in the categories of respiratory syndromes and hemorrhagic fevers. There is no reason why CDC couldn’t wargame these scenarios in advance. They certainly had the budget.

Suppose someone at the CDC had said, “In the event of an outbreak, we will want data. Standardized, uniform reporting is the key to useable data. CDC is the only institution that can standardize reporting on a national scale. Let’s build a reporting database in advance, on the scenario that there will be, oh I don’t know, a respiratory syndrome.”

When the outbreak occurred, that database could have gone online with a data model tailored to respiratory syndromes. All medical systems would be obliged to report to this database, any case where the novel pathogen was supected. The data model could have been something like:

  • Age and sex of patient
  • Time/Date of symptoms
  • Time/Date of hospital admission
  • Time/Date of Discharge/Decease
  • Health status at discharge
  • Virus/antibody test results
  • Medical Interventions ordered, what and when
  • Best guess as to time/date/nature of exposure
  • Recent travel history
  • Preexisting conditions / pertinent medical history
  • Genetic markers
  • Socioeconomic status
  • Home life factors (Population Density, Air quality, etc. (ranges))
  • Occupational Factors (Public-facing, Instructor/Lecturer, Office, Outdoor work, Commute time and method)
  • etc. etc.

Now obviously some of this stuff is a little fudgy and not all the data you’d get would be accurate. But it would be tremendously powerful to have this tool, and CDC could have foreseen the need for it.

Here’s why they didn’t: it would have meant breaking the glass.

Early on in this outbreak, there was some public contemplation of the idea of creating a central reporting database under federal auspices. The main objection to the idea was that such a database would compromise individual medical privacy. That’s true; it almost certainly would. And such compromises would almost certainly violate the Health Insurance Portability and Accountability Act.

Nevertheless, it would almost certainly have been the right thing to do. Better to violate the privacy of a few, than to give free rein to havoc among the whole public.

It seems so obvious, but it is not reasonable to expect the CDC to do the right thing if it involves breaking the law, unless they are indemnified for breaking the law in a time of crisis. That indemnification wasn’t in the cards. Apparently nobody in the position to do anything about it, ever envisioned a scenario where indemnification would be a vital tool and incentive to public service.