December 19 2011
The Sixth Estate in the cloud, intellectual property and the limits of collaboration
“Many IP [intellectual property] scholars have emphasized the important benefits of openness with respect to digital media ... In a nutshell, the dominant idea is that, in the digital age, the best IP policy is a minimalist IP policy. This is wrong. Robust IP protection is in no way inconsistent with the promotion of a flourishing environment for digital media [and science]. Quite the contrary: IP rights are essential to this goal. As compared to the top-down, one-size-fits-all approach of IP minimalists, traditional strong IP protection encourages and facilitates a wide variety of approaches, without mandating or coercing any single approach.” Robert Merges, 2011
There is a book (Justifying Intellectual Property) published last June by Robert Merges of Harvard and Cal-Berkeley law schools that you will want to read. It is exciting, beautifully written, and has strong relevance to discoveries and inventions that arise from observational research involving re-use of health data, including ‘panel' or population-level data such as are stored in multi-institution EHR-derived data warehouses and HIE repositories. In the book, Merges defends the importance of strong intellectual property protections in the present populist, pervasively-interconnected, social-network-obsessed information age that is so enthralled with unfettered collaboration for collaboration's sake.
In that regard, there are some health informatics leaders who uncritically advocate “sending the [research] query to the [geographically-distributed] data.” From technology and access points of view—and compared to the alternative of centralizing everything—broadcasting database queries to the data wherever they may be located seems like a sensible thing to do. On the surface it sounds good. And it is sensible if you are sure of the provenance and quality of the data that's out there, and if the only thing you care about is getting an answer to your EHR or HIE database query. That may be the case if what you're doing is purely a matter of public health, say.
But what if your query—or a data mining series of queries, an ensemble of queries combined with statistical and mathematical analysis of the queries' results—produces a new discovery? What if it yields a valuable invention: intellectual property that reflects your skill and insights in designing the queries and analysis, as well as the value that inheres in the data resources you analyzed, which are owned by others and have been invested in and painstakingly curated and paid for by others over the years? What if your query may, either deliberately or inadvertently, reveal the “secret sauce” of the health care processes of the contributor providers whose activity gave rise to the data that your query is interrogating and analyzing?
In those situations, I believe you would agree to “send your query to the data” only if (A) all of the stakeholders' intellectual property rights are respected, or if (B) no IP or commercial uses or derivative works are anticipated to result from the query. Not just you as a stakeholder and the HIE operator as a stakeholder, but every stakeholder. A corollary is that providers and consumers are stakeholders, too, who would reasonably only consent to their data being warehoused and re-used and subjected to your queries if (A) their IP rights are respected, or (B) no IP emanates from the re-uses of their data assets.
Why did reading Robert Merges' book make me think about this? It is because what is emerging with the rapid development of HIEs and other multi-contributor health data warehouses, I believe, is what amounts to what we could call ‘The Sixth Estate,' and Merges's writing deals specifically with intellectual property in the Sixth Estate.
The Six Estates
- The First Estate is clergy—organized religion as a political authority.
- The Second Estate is nobility—aristocracy whose political power and wealth are inherited from generation to generation. This includes individuals and private organizations who command prodigious wealth, such that they can influence elections or cause the Third Estate to be governed in whatever ways they wish.
- The Third Estate consists of the common citizens in a democracy, plus elected and appointed officials in their government. This includes government health services agencies and health research councils (AHRQ, IOM, MRC, and the like). Indirectly, it also includes commercial health services providers and insurers and other health-related firms whose policies and coverages rules have pervasive scope (statewide or national), to which consumers often have few alternatives and little or no recourse, due to geographic market concentration. Through their lobbying and other political activity, non-media corporations are part of and exert their power through the Third Estate.
- The Fourth Estate is, essentially, official corporate and public-sector “mainstream” media and NGOs and the mainstream scientific establishment, whose traditional roles include serving as a check-and-balance on the Third Estate. For example, many hospitals and managed care organizations and health-related professional societies engage in standards-development, evaluations of best-practice and evidence-based medicine, and so forth. Through these activities, they critique—and lobby forcefully to revise—the policies and practices that are sanctioned by Third Estate organizations.
- The Fifth Estate is comprised of trustworthy critics of the Fourth Estate—ad hoc, mostly-noncommercial participants in networked social media, the blogosphere, and semantic web mash-ups—perhaps now including HIEs or other networked consortia.
- The Sixth Estate is yet another check-and-balance—performing evaluations of and critiquing all of the lower Estates. But it must also be recognized as a “meta-Estate” insofar as it can produce analytics and new discoveries from the corpus of data that arises from activity of the lower Estates.
In any capitalistic democracy, each 'Estate' consumes information regarding what is done/emitted/produced by the Estate immediately below it and produces something else from it—usually ongoing analysis or critical commentary that assesses the various consequences or results of the other Estate's policies and activities, and informs or advises or recommends things that should be done to change or improve them.
Each Estate holds and exercises some form of political power in society. Each Estate is associated with some type of property; each has some amount of financial and other resources. This is so, too, with organizations and individual participants in the Sixth Estate, such as formal federations of health data repositories and informal, loosely-affiliated or transient contributors to specific projects involving aggregation or mining of health data. The conventional meme about the Estates only focuses on the top-down flow of power, critiquing roles of each Estate on the one below it in the hierarchy. However, there are bottom-up and non-hierarchical interactions among these Estates, too.
What Robert Merges' book addresses is the role of government and the Law (the Third Estate) and its mechanisms to protect the intellectual property rights of members of all of the other Estates. This is increasingly important with peer-to-peer (P2P) and cloud-based, geographically-disperse data assets and IP that is in them or derived from them.
In P2P information retrieval, a network of peer servers provides a (re-)search service collaboratively. When there is true parity among the participating peers, in terms of equal amounts of consuming and supplying resources and in terms of comparable exposure to risks and costs and rewards from the activity, everything is fair. Traditionally, most IT systems have employed a centralized architecture. Database transactions and queries were controlled by code that executed the database operations on data that were stored and managed in one central location. But increasingly cloud- and grid-based and federated IT systems employ a geographically-disperse architecture in which database transactions and queries are executed in a distributed fashion, where the data are stored and managed by different organizations and the eventual uses to which the data are put is, in general, not known to those organizations.
Furthermore, with Map-Reduce or other mechanisms that segment and parallelize the distribution of queries, it is likely that the intentions of the entities who are executing the queries could not readily be ascertained by examining the logs of subsets of the grid- or cloud-based services that executed various ensemble segments' portions of the overall query.
So long as no confidential or identifiable data are revealed, that situation is not, in and of itself, a problem. It only becomes problematic when the entity who executes the query (or ensemble of queries, on an ensemble of data from different sources) later turns those results into valuable intellectual property and then unjustly declines to disclose that fact to the stakeholders who own the data assets that enabled the discovery or invention of the property, and declines to enter into a contract to share the license fees or royalty revenues or other cashflow and assets and property rights with those stakeholders.
Given that situation, in the future we will need capabilities that assign agents to services that execute queries upon data and ensembles of data, rather than to assign functions or tasks. And we will need capabilities to “log” or track who did what, which data contributed to which discoveries and patents and contracts, and so on. An essential feature of this approach is query-nonrepudiation and financial negotiation between orders and the resources on which they seek to be executed. Approved orders (queries) are given a budget (of disk I/Os; of CPU seconds; of memory; of network bandwidth; of other resources), and they operate as temporary cost-centers. The orders (queries) ask for processing bids from the cloud-based data services resources that are able to provision [portions of] what the orderer wants; the orders (queries) then accept bids from the least expensive, adequately timely, adequately extensive resources that meet their requirements. The resources that execute the orders (queries) operate as virtual revenue-centers, seeking to stay busy while maximizing the fees they earn from the orders they process. Management of each data service prioritizes the orderstream by varying the budgets that are granted to the approved orders. Orders that offer high budgets and favorable terms with regard to downstream intellectual property rights participation can outbid other orders. Orders that request access to rare, specialized data or that entail use-cases that have high potential commercial value must agree to prices and terms that are commensurate with the scarce supply of the resources and the high financial value of the uses to which the data are to be put.
Or, among the terms, maybe you will only agree to provide me with data if I accept your “Non-assertion of Patent” (NAP) clause, much like Microsoft has done with its OEMs who license Windows. I have to agree in advance not to sue you (or other contributors to a data ensemble) for patent infringement after you provide the data asset to me. No one has any reliable way of knowing the full scope of what others intend to do with the asset, in terms of generating new inventions, or reducing inventions to practice, or validating the practicality or usefulness of the inventions. But, to prevent any dispute in the future, I must pre-agree to your NAP clause.
Basically, the advocates of loosely-coupled P2P data mining applications operating on different platforms only talk about “research” or “public health” or other kinds of interoperating without any IP agreements in place. That sort of advocacy perpetuates the myth of science and medicine as priestly professions, pretending that they are not businesses and pretending that there is no intellectual property or any need for identifying and protecting stakeholders' rights in such property.
Yes, there are problems whose solution only comes about through open collaboration, or whose solution comes far quicker with collaboration than by restricted, proprietary efforts. Tim Gowers, a Cambridge University mathematician, Fields Medal winner, and founder in 2009 of the Polymath Project, comes to mind. But Gowers' project was not about applied mathematics that has immediate, commercial intellectual property value, like theorems pertaining to data compression or cryptography or NP-hard computability. It was instead a proof of a theorem in abstract mathematics, the density Hales-Jewett theorem.
By contrast, almost every discovery related to health has substantial commercial value, in addition to whatever public health or social and political significance it may have. What intellectual property rights might a health institution or a consumer have, in an invention that was based, in part, on the re-use of their data? Possibly none, or possibly some. Consider that the collection of data amount to an asset that arose as a byproduct of other enterprise, such that the enterprise did not contemplate in advance the variety of subsequent potential purposes to which the byproducts might be put. The data are much like the “watch” of Richard Dawkins' “blind watchmaker.” The watchmaker does not need to intend explicitly or in advance that it is a watch that she/he is making, let alone this specific watch, for there to be a watch that is produced by the watchmaker's actions, or for the watchmaker to have a valid property interest in the work product that is protected under the law. The data are a work product in their own right, just as the watch is, and valuable as such in their original, intended use-cases. But now we have observational researchers analyzing an ensemble of such “watches,” and discovering and inventing new things for different use-cases based on that analysis.
Let's say a musician happens to improvise some music in live performance, extemporizing a new song. The musician consents for it to be recorded by the other party, and that other party does record it. Subsequently, a portion of what was recorded is put online, goes “viral” and is incorporated into several derivative works by other artists. But before the performance, the musician did not know whether she/he would improvise or create anything new at all, and certainly did not contemplate the possible derivative uses of that new thing, if some new tune did get in fact get created. The new thing, the improvised tune, is a kind of byproduct of the musician's performance.
Most traditional, tangible “byproducts” (blood meal and gelatin from slaughterhouses; whey/casein from cheese manufacture; bran from milling; glycerol from biodiesel refining; molasses from sugar refining; asphalt from oil refining; gypsum from flue scrubbing) can be regarded as mere commodities—raw materials that can be used in the manufacture other goods and processes. There is no expectation on the part of the manufacturer that the sale of the byproduct asset may yield a large and infinitely long stream of revenues for the party to whom it's sold. By contrast, data are not used-up in the course of their re-use; the data can be re-used arbitrarily many times.
Furthermore, unlike traditional, tangible byproducts which are fungible and do not reveal anything about the people or processes that produced them, data byproducts (A) are not all equal and (B) do reveal things about how they were produced. Consider when news breaks about relatively unknown people and the media turn to social networking sites such as Facebook to find photographs of the subject and other information about them. Posting photos on social media does not make them public domain, and copyright holders may have a claim for infringement for unauthorized use of the photos or certain sensitive, valuable information that they have posted. The same is true of [de-identified] EHR data that are uploaded into an HIE or a repository for observational research. And unlike traditional byproducts that are single-use commodities, used-up as they are used, data are infinitely re-usable. While they are “fresh enough,” their value is inexhaustible. Some data have a value that is “perishable” or time-dependent. But while they are sufficiently fresh to meet the requirements of a given use-case, the data can be used again and again.
With these things in mind, I suggest that population-level health data are not a commodified “byproduct” per se; instead, they are a non-commodity “coproduct” of health services delivery.
Licensing health data can be done in many ways. Some advocate GPL (Gnu Public License) “copyleft” licensing, as is used with UNIX and a lot of open-source software. But GPL licenses have historically applied to executable procedural code. That means that, so long as you merely execute the code on your own premises (including cloud-based servers that you own or rent, running the code as software-as-a-service) and you do not redistribute the protected asset or any part of it, then you are not in the position of granting copyleft GPL rights to anyone else.
However, for HIEs and other interoperable data warehouses what we are talking about is instead declarative corpuses of data and epidemiologic or index or other derivative works that structure it and govern access to it. It is possible that the only thing you do is respond to queries that interrogate the existence or number of cases that meet a set of inclusion-exclusion selection criteria (as with clinical-trial planning case-finding apps). In such a situation, the information you return does not admit of any significant opportunities for intellectual-property-bearing re-use. In other words, some of the queries are simply establishing whether or not an hypothesis is supportable or not, or finding out whether the number of cases that exists in the responding datasets is sufficient to power a planned study, or finding out which investigational centers may be able to perform the planned study. To date, that is predominantly what i2b2 and other services do. Such queries are like nothing more than a prior-art search on a corpus of issued or published patents.
More commonly, though, you respond to queries with detailed row-and-column information from the database tables: the services that respond to the queries are thereby distributing protected assets or subassemblies of them. If the owners of any of the data subassemblies or ensembles have asserted copyleft protection, then those services are bound to passing-through the copyleft rights to the requestor for whom the query was performed.
Merges' first objection is to ‘digital determinism' (DD), the utopian idea that IP policy should just acquiesce to whatever users do, such as illegally share media via peer-to-peer services like Napster, Grokster or Kazaa. His second objection is to the utopian notion of ‘collective creativity' (CC): the idea of collaboration as an inherently virtuous means that justifies any and all ends. The fact that intermediaries may not any longer be necessary in online culture is used to assert that middle-men (agents and regulatory agencies and fair-use auditing and other compliance/forensic services) are never necessary or useful. Mash-ups and small contributions by many parties can be assembled into a new “distributed creation” that is unlike traditional media and processes and the social and legal constructs that govern those. It is like Cluetrain, NowIsGone, and other iconoclast manifestos that think legitimacy will come by shouting long and loud about how yesterday—and all of its laws and policies and procedures—don't matter anymore.
Like Merges, I believe that utopian practices will be unsustainable. Newness does not equal fairness or goodness. Enablement and collaborative empowerment only work so long as people treat each other fairly, and insuring that they do so is precisely the reason why we have patent and copyright laws. The utopian critiques of existing intellectual property constructs can be useful, but they often fail to understand the very real impact of enabling individuals and organizations to re-use information in new ways. The rise of mass media created the so-called ‘Fourth Estate', and the use of the Internet and related technologies is today a global ‘Fifth Estate.' And now the use and creation of derivative works, knowledge, and “meta” data assets enable the acceleration of valuable discoveries—in ways that entail yet another new source of power, that in turn demands renewed kinds of accountability, a ‘Sixth Estate.'
The development and sustainability of the ‘Sixth Estate' do not require new policy initiatives or new laws and regulations. Instead, the emergence of the Sixth Estate requires an understanding of—and respect for—existing Third Estate laws and policies.
Aufderheide P, Jaszi P. Reclaiming Fair Use: How to Put Balance Back in Copyright. Univ Chicago, 2011.
Austin G. Importing Kazaa, exporting Grokster. Santa Clara Comp & High-Tech Journal 2006; 22:577-86.
Bin R, et al., eds. Biotech Innovations and Fundamental Rights. Springer, 2011.
Coates K. Competition Law and Regulation of Technology Markets. Oxford Univ, 2011.
Cohen J, Loren L. Copyright in a Global Information Economy. 3e. Wolters Kluwer, 2010.
Cooper S. Watching the Watchdog: Bloggers as the Fifth Estate. Marquette, 2006.
Dawkins R. The Blind Watchmaker. W.W. Norton, 1996.
Giburd B, et al. K-TTP: A new privacy model for large-scale distributed environments. Proc ACM KDD-2004, Seattle WA.
Kutsche R-D, et al., eds. Engineering Federated Information Systems. Infix Verlag/IOS, 2001.
Lamoureux E, et al. Intellectual Property Law and Interactive Media. Peter Lang, 2009.
Landy G, Mastrobattista A. The IT / Digital Legal Companion: A Comprehensive Business Guide to Software, IT, Internet, Media and IP Law. Syngress, 2008.
Lessig L. Code: And Other Laws of Cyberspace, Version 2.0. Basic, 2006.
Lessig L. Free Culture: The Nature and Future of Creativity. Penguin, 2005.
Lessig L. The Future of Ideas: The Fate of the Commons in a Connected World. Vintage, 2002.
Lindberg V. Intellectual Property and Open Source: A Practical Guide to Protecting Code. O'Reilly, 2008.
Merges R. Justifying Intellectual Property. Harvard Univ, 2011.
Merges R, et al. Intellectual Property in the New Technological Age. 5e. Aspen, 2009.
Mitakis C. The e-rated industry: Fair-use sheep, or infringing goat? Vanderbilt J Ent Law & Practice 2004; 291-6.
Nielsen M. Reinventing Discovery: The New Era of Networked Science. Princeton Univ, 2011.
Phelps M, Kline D. Burning the Ships: Transforming Your Company's Culture Through Intellectual Property Strategy. Wiley, 2010.
Reagle J. Good Faith Collaboration: The Culture of Wikipedia. MIT, 2010.
Seni G, Elder J. Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. Morgan Claypool, 2010.
Strowel A, ed. Peer-to-Peer File Sharing and Secondary Liability in Copyright Law. Elgar, 2009.
Trade-Related Aspects of Intellectual Property Rights (TRIPS) Agreement. Annex 1C, 1994.
World Intellectual Property Organization (WIPO) Copyright Treaty, Article 6bis(5).
Timothy Gowers's blog
U.S. Code Title 17 Copyright
U.S. Code Title 35 Patents
Oxford Internet Institute
Cambridge University Centre for Research in the Arts, Social Sciences & Humanities (CRASSH)
U.K. Central Office of Information (COI) NDS Midata Programme
Fair-Use Project at Stanford University
Open Invention Network
U.S. Copyright Office. Exemption to prohibition on circumvention of copyright protection systems for access control technologies. 29-SEP-2011.
Digital Millennium Copyright Act (DMCA) page at Electronic Frontier Foundation (EFF)
Copyleft page at Wikipedia
i2b2 page about i.p. and restriction to non-commercial uses
Recombinant Data Corp
Douglas McNair, MD, PhD, is president of Cerner Math, Inc., and one of three Cerner Engineering Fellows and is responsible for innovations in decision support and very-large-scale datamining. McNair joined Cerner in 1986, first as VP of Cerner's Knowledge Systems engineering department; then as VP of Regulatory Affairs; then as General Manager for Cerner's Detroit and Kansas City branches. Subsequently, he was Chief Research Officer, responsible for Cerner's clinical research operations. In 1987, McNair was co-inventor and co-developer of Discern Expert, a decision-support engine that today is used in more than 2,000 health care facilities around the world. Between 1977 and 1986, McNair was a faculty member of Baylor College of Medicine in the Departments of Medicine and Pathology. He is a diplomate of the American Board of Pathology and the American Board of Internal Medicine.