December 19, 2011
The Sixth Estate in the cloud, intellectual property and the limits of collaboration
“Many IP [intellectual property] scholars have emphasized the important benefits of openness with respect to digital media ... In a nutshell, the dominant idea is that, in the digital age, the best IP policy is a minimalist IP policy. This is wrong. Robust IP protection is in no way inconsistent with the promotion of a flourishing environment for digital media [and science]. Quite the contrary: IP rights are essential to this goal. As compared to the top-down, one-size-fits-all approach of IP minimalists, traditional strong IP protection encourages and facilitates a wide variety of approaches, without mandating or coercing any single approach.” Robert Merges, 2011
There is a book (Justifying Intellectual Property) published last
June by Robert Merges of Harvard and Cal-Berkeley law schools that you will
want to read. It is exciting, beautifully written, and has strong relevance to
discoveries and inventions that arise from observational research involving re-use
of health data, including ‘panel’ or population-level data such as are stored
in multi-institution EHR-derived data warehouses and HIE repositories. In the
book, Merges defends the importance of strong intellectual property protections
in the present populist, pervasively-interconnected, social-network-obsessed information
age that is so enthralled with unfettered collaboration for collaboration’s
sake.
In that regard, there are some health informatics leaders who
uncritically advocate
“sending the [research] query to the [geographically-distributed] data.” From
technology and access points of view—and compared to the alternative of
centralizing everything—broadcasting database queries to the data wherever they
may be located seems like a sensible thing to do. On the surface it sounds
good. And it is sensible if you are sure of the provenance and quality of the data that’s out there, and if the
only thing you care about is getting an answer to your EHR or HIE database
query. That may be the case if what you’re doing is purely a matter of public
health, say.
But what if your query—or a data mining series of queries, an ensemble
of queries combined with statistical and mathematical analysis of the queries’
results—produces a new discovery? What if it yields a valuable invention:
intellectual property that reflects your skill and insights in designing the
queries and analysis, as well as the value that inheres in the data resources
you analyzed, which are owned by others and have been invested in and
painstakingly curated and paid for by others over the years? What if your query
may, either deliberately or inadvertently, reveal the “secret sauce” of the
health care processes of the contributor providers whose activity gave rise to
the data that your query is interrogating and analyzing?
In those situations, I believe you would agree to “send your query to
the data” only if (A) all of the stakeholders’ intellectual property rights are
respected, or if (B) no IP or commercial uses or derivative works are
anticipated to result from the query. Not just you as a stakeholder and the HIE
operator as a stakeholder, but every stakeholder. A corollary is that providers
and consumers are stakeholders, too, who would reasonably only consent to their
data being warehoused and re-used and subjected to your queries if (A) their IP
rights are respected, or (B) no IP emanates from the re-uses of their data
assets.
Why did reading Robert Merges’ book make me think about this? It is
because what is emerging with the rapid development of HIEs and other multi-contributor
health data warehouses, I believe, is what amounts to what we could call ‘The
Sixth Estate,’ and Merges’s writing deals specifically with intellectual
property in the Sixth Estate.
The Six Estates
- The First Estate is clergy—organized religion as
a political authority.
- The Second Estate is nobility—aristocracy whose
political power and wealth are inherited from generation to generation. This
includes individuals and private organizations who command prodigious wealth,
such that they can influence elections or cause the Third Estate to be governed
in whatever ways they wish.
- The Third Estate consists of the common citizens
in a democracy, plus elected and appointed officials in their government. This
includes government health services agencies and health research councils
(AHRQ, IOM, MRC, and the like). Indirectly, it also includes commercial health
services providers and insurers and other health-related firms whose policies
and coverages rules have pervasive scope (statewide or national), to which
consumers often have few alternatives and little or no recourse, due to
geographic market concentration. Through their lobbying and other political
activity, non-media corporations are part of and exert their power through the
Third Estate.
- The Fourth Estate is, essentially, official
corporate and public-sector “mainstream” media and NGOs and the mainstream
scientific establishment, whose traditional roles include serving as a
check-and-balance on the Third Estate. For example, many hospitals and managed
care organizations and health-related professional societies engage in
standards-development, evaluations of best-practice and evidence-based
medicine, and so forth. Through these activities, they critique—and lobby
forcefully to revise—the policies and practices that are sanctioned by Third
Estate organizations.
- The Fifth Estate is comprised of trustworthy
critics of the Fourth Estate—ad hoc, mostly-noncommercial participants in
networked social media, the blogosphere, and semantic web mash-ups—perhaps now
including HIEs or other networked consortia.
- The Sixth Estate is yet another
check-and-balance—performing evaluations of and critiquing all of the lower
Estates. But it must also be recognized as a “meta-Estate” insofar as it can
produce analytics and new discoveries from the corpus of data that arises from
activity of the lower Estates.
In any capitalistic democracy, each 'Estate' consumes information
regarding what is done/emitted/produced by the Estate immediately below it and
produces something else from it—usually ongoing analysis or critical commentary
that assesses the various consequences or results of the other Estate’s
policies and activities, and informs or advises or recommends things that
should be done to change or improve them.
Each Estate holds and exercises some form of political power in society.
Each Estate is associated with some type of property; each has some amount of financial
and other resources. This is so, too, with organizations and individual
participants in the Sixth Estate, such as formal federations of health data
repositories and informal, loosely-affiliated or transient contributors to
specific projects involving aggregation or mining of health data. The
conventional meme about the Estates only focuses on the top-down flow of power,
critiquing roles of each Estate on the one below it in the hierarchy. However,
there are bottom-up and non-hierarchical interactions among these Estates, too.
What Robert Merges' book addresses is the role of government and the
Law (the Third Estate) and its mechanisms to protect the intellectual property
rights of members of all of the other Estates. This is increasingly important with peer-to-peer (P2P) and
cloud-based, geographically-disperse data assets and IP that is in them or
derived from them.
In P2P information retrieval, a network of peer servers provides a (re-)search
service collaboratively. When there is true parity among the participating
peers, in terms of equal amounts of consuming and supplying resources and in
terms of comparable exposure to risks and costs and rewards from the activity,
everything is fair. Traditionally, most IT systems have employed a centralized
architecture. Database transactions and queries were controlled by code that
executed the database operations on data that were stored and managed in one
central location. But increasingly cloud- and grid-based and federated IT
systems employ a geographically-disperse architecture in which database
transactions and queries are executed in a distributed fashion, where the data
are stored and managed by different organizations and the eventual uses to
which the data are put is, in general, not known to those organizations.
Furthermore, with Map-Reduce or other mechanisms that segment and
parallelize the distribution of queries, it is likely that the intentions of
the entities who are executing the queries could not readily be ascertained by
examining the logs of subsets of the grid- or cloud-based services that executed
various ensemble segments’ portions of the overall query.
So long as no confidential or identifiable data are revealed, that
situation is not, in and of itself, a problem. It only becomes problematic when
the entity who executes the query (or ensemble of queries, on an ensemble of
data from different sources) later turns those results into valuable
intellectual property and then unjustly declines to disclose that fact to the
stakeholders who own the data assets that enabled the discovery or invention of
the property, and declines to enter into a contract to share the license fees
or royalty revenues or other cashflow and assets and property rights with those
stakeholders.

Given that situation, in the future we will need capabilities that
assign agents to services that execute queries upon data and ensembles of data,
rather than to assign functions or tasks. And we will need capabilities to
“log” or track who did what, which data contributed to which discoveries and
patents and contracts, and so on. An essential feature of this approach is
query-nonrepudiation and financial negotiation between orders and the resources
on which they seek to be executed. Approved orders (queries) are given a budget
(of disk I/Os; of CPU seconds; of memory; of network bandwidth; of other resources),
and they operate as temporary cost-centers. The orders (queries) ask for
processing bids from the cloud-based data services resources that are able to
provision [portions of] what the orderer wants; the orders (queries) then
accept bids from the least expensive, adequately timely, adequately extensive
resources that meet their requirements. The resources that execute the orders
(queries) operate as virtual revenue-centers, seeking to stay busy while
maximizing the fees they earn from the orders they process. Management of each
data service prioritizes the orderstream by varying the budgets that are
granted to the approved orders. Orders that offer high budgets and favorable
terms with regard to downstream intellectual property rights participation can
outbid other orders. Orders that request access to rare, specialized data or
that entail use-cases that have high potential commercial value must agree to
prices and terms that are commensurate with the scarce supply of the resources
and the high financial value of the uses to which the data are to be put.
Or, among the terms, maybe you will only agree to provide me with data
if I accept your “Non-assertion of Patent” (NAP) clause, much like Microsoft
has done with its OEMs who license Windows. I have to agree in advance not to
sue you (or other contributors to a data ensemble) for patent infringement
after you provide the data asset to me. No one has any reliable way of knowing
the full scope of what others intend to do with the asset, in terms of generating
new inventions, or reducing inventions to practice, or validating the practicality
or usefulness of the inventions. But, to prevent any dispute in the future, I
must pre-agree to your NAP clause.
Basically, the advocates of loosely-coupled P2P data mining
applications operating on different platforms only talk about “research” or
“public health” or other kinds of interoperating without any IP agreements in
place. That sort of advocacy perpetuates the myth of science and medicine as
priestly professions, pretending that they are not businesses and pretending
that there is no intellectual property or any need for identifying and
protecting stakeholders’ rights in such property.
Yes, there are problems whose solution only comes about through open
collaboration, or whose solution comes far quicker with collaboration than by
restricted, proprietary efforts. Tim Gowers, a Cambridge University
mathematician, Fields Medal winner, and founder in 2009 of the Polymath
Project, comes to mind. But Gowers’ project was not about applied
mathematics that has immediate, commercial intellectual property value, like
theorems pertaining to data compression or cryptography or NP-hard computability.
It was instead a proof of a theorem in abstract mathematics, the density
Hales-Jewett theorem.
By contrast, almost every discovery related to health has substantial
commercial value, in addition to whatever public health or social and political
significance it may have. What intellectual property rights might a health
institution or a consumer have, in an invention that was based, in part, on the
re-use of their data? Possibly none, or possibly some. Consider that the collection of data amount to an asset that
arose as a byproduct of other enterprise, such that the enterprise did not
contemplate in advance the variety of subsequent potential purposes to which
the byproducts might be put. The data are much like the “watch” of Richard
Dawkins’ “blind watchmaker.” The watchmaker does not need to intend explicitly
or in advance that it is a watch that she/he is making, let alone this specific watch, for there to be a
watch that is produced by the watchmaker's actions, or for the watchmaker to
have a valid property interest in the work product that is protected under the
law. The data are a work product in their own right, just as the watch is, and
valuable as such in their original, intended use-cases. But now we have
observational researchers analyzing an ensemble of such “watches,” and
discovering and inventing new things for different use-cases based on that
analysis.
Let’s say a musician happens to improvise some music in live
performance, extemporizing a new song. The musician consents for it to be
recorded by the other party, and that other party does record it. Subsequently,
a portion of what was recorded is put online, goes “viral” and is incorporated
into several derivative works by other artists. But before the performance, the
musician did not know whether she/he would improvise or create anything new at
all, and certainly did not contemplate the possible derivative uses of that new
thing, if some new tune did get in fact get created. The new thing, the
improvised tune, is a kind of byproduct of the musician’s performance.
Most traditional, tangible “byproducts” (blood meal and gelatin from
slaughterhouses; whey/casein from cheese manufacture; bran from milling;
glycerol from biodiesel refining; molasses from sugar refining; asphalt from
oil refining; gypsum from flue scrubbing) can be regarded as mere commodities—raw
materials that can be used in the manufacture other goods and processes. There
is no expectation on the part of the manufacturer that the sale of the byproduct
asset may yield a large and infinitely long stream of revenues for the party to
whom it’s sold. By contrast, data are not used-up in the course of their
re-use; the data can be re-used arbitrarily many times.
Furthermore, unlike traditional, tangible byproducts which are fungible
and do not reveal anything about the people or processes that produced them,
data byproducts (A) are not all equal and (B) do reveal things about how they
were produced. Consider when news breaks about relatively unknown people and
the media turn to social networking sites such as Facebook to find photographs
of the subject and other information about them. Posting photos on social media
does not make them public domain, and copyright holders may have a claim for infringement
for unauthorized use of the photos or certain sensitive, valuable information
that they have posted. The same is true of [de-identified] EHR data that are
uploaded into an HIE or a repository for observational research. And unlike
traditional byproducts that are single-use commodities, used-up as they are
used, data are infinitely re-usable. While they are “fresh enough,” their value
is inexhaustible. Some data have a value that is “perishable” or time-dependent.
But while they are sufficiently fresh to meet the requirements of a given
use-case, the data can be used again and again.
With these things in mind, I suggest that population-level health data
are not a commodified “byproduct” per se; instead, they are a non-commodity
“coproduct” of health services delivery.
Licensing health data can be done in many ways. Some advocate GPL (Gnu
Public License) “copyleft” licensing, as is used with UNIX and a lot of
open-source software. But GPL licenses have historically applied to executable
procedural code. That means that, so long as you merely execute the code on
your own premises (including cloud-based servers that you own or rent, running
the code as software-as-a-service) and you do not redistribute the protected
asset or any part of it, then you are not in the position of granting copyleft
GPL rights to anyone else.
However, for HIEs and other interoperable data warehouses what we are
talking about is instead declarative corpuses of data and epidemiologic or
index or other derivative works that structure it and govern access to it. It
is possible that the only thing you do is respond to queries that interrogate
the existence or number of cases that meet a set of inclusion-exclusion
selection criteria (as with clinical-trial planning case-finding apps). In such
a situation, the information you return does not admit of any significant
opportunities for intellectual-property-bearing re-use. In other words, some of
the queries are simply establishing whether or not an hypothesis is supportable
or not, or finding out whether the number of cases that exists in the
responding datasets is sufficient to power a planned study, or finding out
which investigational centers may be able to perform the planned study. To
date, that is predominantly what i2b2 and other services do. Such queries are
like nothing more than a prior-art search on a corpus of issued or published
patents.
More commonly, though, you respond to queries with detailed
row-and-column information from the database tables: the services that respond to the queries are thereby
distributing protected assets or subassemblies of them. If the owners of any of
the data subassemblies or ensembles have asserted copyleft protection, then
those services are bound to passing-through the copyleft rights to the
requestor for whom the query was performed.

Merges’ first objection is to ‘digital determinism’ (DD), the utopian
idea that IP policy should just acquiesce to whatever users do, such as
illegally share media via peer-to-peer services like Napster, Grokster or Kazaa.
His second objection is to the utopian notion of ‘collective creativity’ (CC): the idea of collaboration as an inherently
virtuous means that justifies any and all ends. The fact that intermediaries
may not any longer be necessary in online culture is used to assert that
middle-men (agents and regulatory agencies and fair-use auditing and other
compliance/forensic services) are never necessary or useful. Mash-ups and small
contributions by many parties can be assembled into a new “distributed
creation” that is unlike traditional media and processes and the social and
legal constructs that govern those. It
is like Cluetrain, NowIsGone, and other iconoclast manifestos that think
legitimacy will come by shouting long and loud about how yesterday—and all of
its laws and policies and procedures—don’t matter anymore.
Like Merges, I believe that utopian practices will be unsustainable. Newness
does not equal fairness or goodness. Enablement and collaborative empowerment
only work so long as people treat each other fairly, and insuring that they do
so is precisely the reason why we have patent and copyright laws. The utopian critiques
of existing intellectual property constructs can be useful, but they often fail
to understand the very real impact of enabling individuals and organizations to
re-use information in new ways. The rise of mass media created the so-called
‘Fourth Estate’, and the use of the Internet and related technologies is today
a global ‘Fifth Estate.' And now the use and creation of derivative works,
knowledge, and “meta” data assets enable the acceleration of valuable
discoveries—in ways that entail yet another new source of power, that in turn
demands renewed kinds of accountability, a ‘Sixth Estate.'
The development and sustainability of the ‘Sixth Estate’ do not require
new policy initiatives or new laws and regulations. Instead, the emergence of
the Sixth Estate requires an understanding of—and respect for—existing Third
Estate laws and policies.
Additional Resources
Aufderheide P, Jaszi P. Reclaiming Fair Use: How to Put Balance Back in
Copyright. Univ Chicago, 2011.
Austin G. Importing Kazaa, exporting Grokster. Santa Clara Comp &
High-Tech Journal 2006; 22:577-86.
Bin R, et al., eds. Biotech Innovations and Fundamental Rights.
Springer, 2011.
http://www.amazon.com/Biotech-Innovations-Fundamental-Rights-Roberto/dp/884702031X/
Coates K. Competition Law and Regulation of Technology Markets. Oxford
Univ, 2011.
Cohen J, Loren L. Copyright in a Global Information Economy. 3e.
Wolters Kluwer, 2010.
http://www.amazon.com/Copyright-Global-Information-Economy-3e/dp/0735591962/
Cooper S. Watching the Watchdog: Bloggers as the Fifth Estate.
Marquette, 2006.
Dawkins R. The Blind Watchmaker. W.W. Norton, 1996.
Giburd B, et al. K-TTP: A new privacy model for large-scale distributed
environments. Proc ACM KDD-2004, Seattle WA.
Kutsche R-D, et al., eds. Engineering Federated Information Systems.
Infix Verlag/IOS, 2001.
Lamoureux E, et al. Intellectual Property Law and Interactive Media.
Peter Lang, 2009.
http://www.amazon.com/Intellectual-Property-Interactive-Digital-Formations/dp/0820486353/
Landy G, Mastrobattista A. The IT / Digital Legal Companion: A
Comprehensive Business Guide to Software, IT, Internet, Media and IP Law.
Syngress, 2008.
Lessig L. Code: And Other Laws of Cyberspace, Version 2.0. Basic, 2006.
Lessig L. Free Culture: The Nature and Future of Creativity. Penguin,
2005.
Lessig L. The Future of Ideas: The Fate of the Commons in a Connected
World. Vintage, 2002.
Lindberg V. Intellectual Property and Open Source: A Practical Guide to
Protecting Code. O’Reilly, 2008.
http://www.amazon.com/Intellectual-Property-Open-Source-Protecting/dp/0596517963/
Merges R. Justifying Intellectual Property. Harvard Univ, 2011.
http://www.amazon.com/Justifying-Intellectual-Property-Robert-Merges/dp/0674049489/
Merges R, et al. Intellectual Property in the New Technological Age.
5e. Aspen, 2009.
Mitakis C. The e-rated industry: Fair-use sheep, or infringing goat?
Vanderbilt J Ent Law & Practice 2004; 291-6.
Nielsen M. Reinventing Discovery: The New Era of Networked Science.
Princeton Univ, 2011.
http://www.amazon.com/Reinventing-Discovery-New-Networked-Science/dp/0691148902/
http://michaelnielsen.org/polymath1/index.php?title=Main_Page
Phelps M, Kline D. Burning the Ships: Transforming Your Company's
Culture Through Intellectual Property Strategy. Wiley, 2010.
Reagle J. Good Faith Collaboration: The Culture of Wikipedia. MIT,
2010.
Seni G, Elder J. Ensemble Methods in Data Mining: Improving Accuracy
Through Combining Predictions. Morgan Claypool, 2010.
Strowel A, ed. Peer-to-Peer File Sharing and Secondary Liability in
Copyright Law. Elgar, 2009.
Trade-Related Aspects of Intellectual Property Rights (TRIPS)
Agreement. Annex 1C, 1994.
World Intellectual Property Organization (WIPO) Copyright Treaty,
Article 6bis(5).
Timothy Gowers’s blog
http://gowers.wordpress.com/
U.S. Code Title 17 Copyright
http://www.law.cornell.edu/uscode/html/uscode17/usc_sup_01_17_10_1.html
U.S. Code Title 35 Patents
http://www.law.cornell.edu/uscode/html/uscode35/usc_sup_01_35.html
Oxford Internet Institute
http://www.oii.ox.ac.uk/events/?id=485
Cambridge University Centre for Research in the Arts, Social Sciences
& Humanities (CRASSH)
http://www.crassh.cam.ac.uk/events/1533/
U.K. Central Office of Information (COI) NDS Midata Programme
http://nds.coi.gov.uk/content/Detail.aspx?ReleaseID=421869&NewsAreaID=2
Fair-Use Project at Stanford University
http://www.law.stanford.edu/program/centers/fup/
Open Invention Network
http://openinventionnetwork.com/
U.S. Copyright Office. Exemption to prohibition on circumvention of
copyright protection systems for access control technologies. 29-SEP-2011.
http://www.copyright.gov/1201/
Digital Millennium Copyright Act (DMCA) page at Electronic Frontier
Foundation (EFF)
https://www.eff.org/issues/dmca-rulemaking
Copyleft page at Wikipedia
http://en.wikipedia.org/wiki/Copyleft
i2b2 page about i.p. and restriction to non-commercial uses
https://www.i2b2.org/work/policy.html
Recombinant Data Corp
http://www.recomdata.com/
Douglas McNair, MD, PhD, is president of Cerner Math, Inc., and one of three Cerner Engineering Fellows and is responsible for innovations in decision support and very-large-scale datamining. McNair joined Cerner in 1986, first as VP of Cerner’s Knowledge Systems engineering department; then as VP of Regulatory Affairs; then as General Manager for Cerner’s Detroit and Kansas City branches. Subsequently, he was Chief Research Officer, responsible for Cerner’s clinical research operations. In 1987, McNair was co-inventor and co-developer of Discern Expert, a decision-support engine that today is used in more than 2,000 health care facilities around the world. Between 1977 and 1986, McNair was a faculty member of Baylor College of Medicine in the Departments of Medicine and Pathology. He is a diplomate of the American Board of Pathology and the American Board of Internal Medicine.