Skip to content. Skip to main navigation.


  • March 28 2012

    Big data is not (yet) big enough - part one

    The promise of timely and reliable evidence in sufficient quantities to statistically ‘power' health care decisions depends on its quality, its interoperable ontological mapping of concepts (so that “apples” are really “apples”) and on the accessibility of the data on which to base discoveries and conclusions. These aspects are improving rapidly in health care Big Data recently, making that promise real. Detecting, identifying and doing analytical, descriptive-statistics procedures: there is no doubt that these must inform and define the best matches and optimal choices and improvement actions and predictive models.

    The key issue to be addressed in health and health care, though, is the same faced by other industries and domains: avoid being overwhelmed by the petabyte-scale data yet find the best, ethically appropriate ways to obtain and apply actionable knowledge from the data. If only the “operational” perspective is applied (see graphic above), the risk is to remain at a superficial level, whipsawed by emphemeral “macro” features that may be more apparent than of deep physicochemical or clinical or operational relevance. And if only “analysis” is pursued, the risk is to obtain a very detailed numerical characterization of the dataset without adequate understanding of (a) how to interpret these figures in a statisically valid way or (b) the practical deployment details and the social and ethical defensibility of the interpretation and contemplated application of the knowledge.

    “At the petabyte scale, information is not a matter of simple three or four axes and drilling-down, but of dimensionally-agnostic statistics. It calls for an entirely different approach, one that requires us to lose the tether of data as something that can be visualized in its totality. Petabyte scale forces us to view data mathematically first and establish a ‘context'—a ‘meaning' or codeset for it—later. Google conquered the advertising world with nothing more than applied mathematics. Google didn't pretend to know anything about the culture and conventions of advertising—it just assumed that better data, with better analytic tools, and an algorithm to derive ‘meaning' (PageRank) would win the day. And Google was right.” --Chris Anderson, ‘Long Tail: Selling Less of More'

    These two risks are related to the challenge of obtaining actionable knowledge out of the data, which is how to provide a good description that can inform the definition of improvement actions. As far as Big Data is concerned, currently the health research scientific communities seem to be mostly preoccupied with the “identification/description” pole. There are quite a few quantitative methods to process safety and comparative-effectiveness performance data.

    On the other hand, research and analyses and real-world, practical applications are beginning to appear that are making productive use of the petabytes of health data that are now accumulating. These are clearly innovative and beneficial, to the public and to private organizations and individuals.

    • Predict future risks and adverse health events/statuses, with enough lead time to enable interventions to prevent the event/status from materializing
    • Automatically discover multi-variable patterns that implicitly constitute new ‘concepts,' including features that denote similarities that form a basis for personalized medicine decision-making
    • Quickly find an anonymized large set of other persons whose array of health-related attributes (and medical history, and outcomes subsequent to treatment) closely resembled the array of facts that have materialized so far for me or for a family member, so that the likely consequences of options I am (we are) now considering can be objectively evaluated
    • Optimize health plan benefits design and transparent pricing of policies for fairer and more accountable (and cheaper, or better value) coverages for more people
    • Optimize procedures and processes, to deliver superior clinical outcomes, operational efficiency, and financial performance
    • Simulate the outcomes of alternative contemplated changes in staffing or processes, before implementing them
    • Discover new safe and effective diagnostics and therapeutics (and label-expansion covering new uses of existing ones), by mining outcomes data for products (and combinations of concomitant uses of two or more products) that have a similar mechanism-of-action and that are already available in the market
    • Detect and forecast public health trends and outbreaks
    • Measure the effectiveness of procedures and health policies in communities or diagnosis-oriented groupings of people
    • Automatically find patients who meet inclusion-exclusion criteria to be eligible to participate in clinical trials, so that research can reach reliable conclusions and decisions more quickly and respond to unmet health needs and reduce one bottleneck in new-product development
    • Quantitatively design clinical trial inclusion-exclusion criteria to enable more safe and effective therapeutic products to successfully complete the clinical development and regulatory approval cycles, more quickly and with lower risk and cost

    But while noting these benefits and emerging Big Data-based applications, we also should look more closely at and ask hard questions about the data, about the factors that contribute to its creation and about how those affect the answers that come from analyzing the data. Are Big Data repositories of health care-related tweets really a sound and reliably generalizable measure of consumer sentiment if the decision to tweet (about a doctor or a provider institution or a disease or a treatment or a side-effect or outcome of a treatment) isn't random and isn't geographically and socio-economically unbiased? Who is online, how do they differ from the offline folks, and how do those differences potentially skew the inferences that can be drawn from the data?

    Are scholarly refereed journal articles really the best “gold standard” of evidence anymore, if the studies published are ones mostly involving positive results and ones that only involved a few hundreds or a few tens of thousands—when the Big Data alternative is relatively well-controlled, accuracy-validated, and statistically unbiased observational cohorts of cases and controls involving many millions instead of hundreds or a few thousands?

    Are health data warehouses that are only fed by a set of relatively financially well-off, metropolitan health care institutions adequately and unbiasedly projectable (see Campbell, pp. 230, 290) to represent (and contribute to decision-making for) the non-participating rural institutions or the public charity-care institutions whose financial performance is marginal?

    Are Watson's Jeopardy-style, page-rank-based, rapidly-rung-in answers adequate for low-probability-event safety-oriented questions or for questions involving new discoveries and innovations where there is as yet no large corpus of pre-existing evidence “needles” for the Big Data Watson-type algorithms to identify in the “haystack?"

    Today there is scant discussion of questions and risks like these. Perhaps the shortage of attention so far to limitations of “Big Data” research is just a predictable, breathless, early “best thing since sliced bread” phase of the technology innovation cycle. But that doesn't mean we should relax our scientific standards just because petabyte-scale data conveniently now exist.

    Part two of this discussion is available here.

    Douglas McNair, MD, PhD, is president of Cerner Math, Inc., and one of three Cerner Engineering Fellows and is responsible for innovations in decision support and very-large-scale datamining. McNair joined Cerner in 1986, first as VP of Cerner's Knowledge Systems engineering department; then as VP of Regulatory Affairs; then as General Manager for Cerner's Detroit and Kansas City branches. Subsequently, he was Chief Research Officer, responsible for Cerner's clinical research operations. In 1987, McNair was co-inventor and co-developer of Discern Expert®, a decision-support engine that today is used in more than 2,000 health care facilities around the world. Between 1977 and 1986, McNair was a faculty member of Baylor College of Medicine in the Departments of Medicine and Pathology. He is a diplomate of the American Board of Pathology and the American Board of Internal Medicine.

    • Comments (0) | Rating: 0/5
  • Comment
  • Bookmark
  • Print