Navigating the perfect [ data ] storm

Bioscience has recently undergone a series of knowledge-based and technological revolutions. A critical consequence has been increasing recognition of the need to invest in infrastructure. Good access to data (and samples) from multiple studies is axiomatic to the value of this infrastructure. Access must be streamlined, secure, and based upon transparent and ‘fair’ decision making. It must be clear who has created and who has used which data. Ethico-legal policies and guidelines, which reflect dominant local cultural and societal norms, must take account of the increasingly global nature of bioscience research. A robust data infrastructure must also be attentive to the translational aims and social impact of its knowledge generation. In order to maintain the trust of its constituency – the general public as well as professional, political, commercial stakeholders – it must develop mechanisms to take account of all of these perspectives. These considerations form the basis of an emerging data economy. Building on extant achievements and pursuing the ideas outlined here could revolutionise the way we use and manage large-scale data. They have critical implications for biomedical and public health research communities and will be of central relevance for healthcare managers and policy makers, governments and industry. However, if the major challenges are to be met we must continue to invest, both nationally and internationally, in developing the cooperative infrastructures that provide a complementary foil to competitive resourcing mechanisms that drive hypothesis-driven science.


INTRODUCTION
Scientific advance involves the asking and answering of questions within constraints of contemporaneous knowledge and technology.Until recently, most definitive 'answers' in health science reflected factors with relatively large effects (e.g. the health impact of smoking cigarettes).But, the study of the etiological architecture of common chronic diseases demands that we explore much weaker effects (1,2) including interactions (3,4).This poses obvious challenges for statistical power (5).Moreover, weak relationships are easily created or concealed by confounding or reversecausality (6).Provision of an effective platform for tomorrow's biomedical science therefore demands high quality data on an unprecedented scale.Furthermore, many research questions necessitate co-analysis of multiple studies, placing a premium on data harmonization (7) and stream-lined access.Though the shape of this emergent data economy (or, more accurately, economies) is as yet unclear, its evolution is rapidly gathering momentum.
We are facing a 'perfect [data] storm' on four main fronts which need resolution to enable the development of an effective platform for biomedical science.First, there is a need to create political, legal and ethical frameworks for data governance that incorporate privacy issues and protect research participants' personal information, whilst also being attentive to the ethical dimensions of scientific enterprise, such as intellectual property rights and recognition of the investment of scientists (8,9).Second, there is a need to establish effective mechanisms for recognising the substantive contributions of everybody in building, maintaining and operating data infrastructures (8,10), not just the research leaders that obtain funding (11).Third, there is a need to optimise the exploitation of an increasing deluge of large, complex data sets (12)(13)(14) and to identify the social dimensions of optimising data curation (15) and data sharing (16,17).Fourth, the management and use of data and the generation of knowledge needs to be taken forward with social impact and translational aims in mind, particularly by engaging the insights of all relevant stakeholders (16).

Protecting participants
The maintenance of public and scientific trust in the systems of scientific governance is fundamental to successful data sharing.Technological advances must work within the existing political, legal and cultural environment to have legitimacy and be socially accepted.One of the challenges is to build governance structures that allow the free movement of data to encourage scientific advancement while at the same time ensuring that individual participant's data are protected from harm.The ethical, epistemological, social and practical barriers to data-sharing within the research community need to be studied, practical solutions need to be researched, and changes implemented.Innovative solutions need to be developed to address the confidentiality and anonymisation challenges that arise from the rich, complex and potentially identifiable data that can be amassed through biomedical infrastructure.Changing and diverse societal attitudes towards privacy as well as new ways of engaging research participants through social media and IT solutions need to be incorporated into the development of new governance frameworks.
While research is increasingly global with large international, multidisciplinary collaborations and studies that span national borders, our current regulatory mechanisms for research are nationally based.Working to create frameworks at an international level (c.f.P 3 G generic consent materials for population biobanks) for adaptation within local settings can help to address this issue.Likewise, frameworks developed at the local level can inform international policies.Considerable work has been done already to link up national endeavours with international platforms and to co-ordinate efforts through organisations such as P 3 G and ISBER, to prospectively harmonize these efforts and meet the future challenges of e-governance.However, the accelerating pace of genome science will demand an internationally coherent approach if we are to have any chance to address future challenges.And, as we argue below, this necessitates the active consideration of the viewpoints of a range of stakeholders; whether scientific, professional, public or participant (16).Development of such a global vision for ethical, legal and the social implications (ELSI) of genomics is underway (32) and must underpin protection both of the participants, whose data are the basis of scientific knowledge, and of the scientists and others who produce that knowledge.

Identifying scientific contributions
Transparent identification of data and their origins is central to the acknowledgment of all contributions in the ideal data management infrastructure.This requires all actants (33) (material or human entities) to be unambiguously, computationally and securely identifiable.In practice, this would mean assigning digital identi-fiers (IDs) and sometimes version numbers to everything, not least: biobanks/cohorts; the institutions that host resources; research participants; datasets and databases; scientists that generate the 'dataverse'; individuals/organisations that use the stored and shared information and journals that publish their findings.Currently, only some of these carry IDs, but optimal datasharing and usage will not be achieved until such IDs become ubiquitous, properly designed, and widely recognised and used.
There is potential to leverage online digital IDs to establish a globally distributed and seamlessly automated system for facilitating data access -bringing benefits of speed, transparency, and equity (8).The scheme (34) (Box 1), would greatly improve current processes for granting access to potentially sensitive datasets.Several small scale projects under the auspices of GEN2PHEN and BioSHaRE-EU are currently piloting controlled access to summary-level, aggregate datasets aiming to roll out this approach for use with more sensitive data.Such a system would, for example, have circumvented the data release controversies that followed Homer et al. (35).It would also ameliorate the current hindrance of scientific progress by delays and complications involved in gaining access to this class of data -a situation at odds with the obligation to maximize the knowledge generated by publicly-funded research (8).

ANALYSING DATA THAT CANNOT BE ACCESSED
Social and ethico-legal imperatives driving expectations of security, privacy and transparency have already engendered important changes in how we use, share and analyse data.For example, when data are physically very large, Kahn (12) argues that streamlined analysis may benefit from moving "computation to the data, rather than the data to the computation".But, this idea can be taken an important step further; enabling the secure joint analysis of data from several studies, even when some of those studies are unable to share raw data.This is crucial because conventional approaches to joint analysis cannot optimise the efficiency and flexibility of the statistical analysis whilst simultaneously ensuring that all relevant ethico-legal and governance stipulations are met in full (Box 2).DataSHIELD provides a novel solution to this challenge (36).
Under DataSHIELD (Figure 1 and Box 2) full joint analysis is achieved via simultaneous parallelized analysis of the individual-level data at each study.The approach is iterative and -at each iteration -the separate parallel analyses are linked by exchanging summary statistics with the analysis centre.These statistics carry no sensitive information and are non-identifying; in these regards they are equivalent to the study-level results that are shared freely under SLMA (study level meta-analysis -see Box 2).Furthermore, although the analysis is mathematically equivalent to ILMA (individual level meta-analysis -see Box 2), the participant-Box 1. Unique IDs for researchers and bio-resources.

Researcher IDs • ORCID
− The Open Researcher and Contributor ID initiative (53,54) is constructing a global registry of unique, permanent and institutionally verifiable IDs for authors of scholarly publications.− ORCID will enable reliable disambiguation of one author from another, plus new knowledge capabilities discovery mediated via searching across these unambiguous IDs.

• ORCID (extended)
− Extension of the ORCID concept into the online world of databases and data sharing could meet the goal of appropriately recognising (and ultimately rewarding) the intellectual and other inputs of researchers to construction, maintenance and use of all aspects of the global data and information infrastructure.− Unambiguous identification of individual researchers and science contributors could provide the foundation of a rapid, IT mediated access mechanism (34) for data of low or moderate disclosure risk, that cannot be posted freely on the web but would ideally be available over the web with light touch oversight control -a data access approach termed "speed pass".Given broad-based acceptance by the research community, a willingness of institutions to be responsible for their own bona fide scientists and recognition that proscribed misuse of data or samples might lead to loss of access rights, such a system would, for example, provide an ideal response to the limited risk of identification posed by the methods described by Homer et al. (35).

Bioresource IDs • BRIF
− The Bioresource Research Impact Factor (55,56) has been proposed as an indicator of the use of all bio-resources (biobanks, cohorts, reference collections and databases).− Each bio-resource should have its own internationally unique and persistent recognised ID, as a necessary step to automatically trace its use.− Standardisation of citation using this ID is required .− It would facilitate tracking of the contribution of individuals to bio-resources, and of bio-resources to bioscience and to the bioscience infrastructure as a whole.− An international working group (18) is currently addressing the various dimensions of such a tool.Box 2. Two conventional approaches to jointly analysing (meta-analysing) multiple studies.
• Two conventional approaches to joint analysis (i) Study level meta-analysis (SLMA): investigators at each study analyse their own data; they return results to a central analysis centre (AC); results are meta-analysed at the AC.(ii) Individual level meta-analysis (ILMA): individual level data (de-identified) are physically transferred from each study to the AC; data from all studies are analysed together.• Choice of approach from the perspective of the science and statistical analysis − SLMA works if analysis can be completely pre-planned and if it is straightforward to specify and obtain the study level results that are required.− ILMA is greatly to be preferred if any exploratory analysis is required.Every unplanned analysis under SLMA causes serious delay as each group of study investigators must re-analyse their own data and return the new results to the AC.• Choice of approach from the ethico-legal perspective − Individual level data cannot physically be transferred if governance materials (consent forms, information leaflets, conditions applied by an ethics committee) prohibit it.− Even when the transfer of individual level data is permitted, it is likely to require a lengthy access process involving scientific oversight and ethics committees.The pace of progress in contemporary bioscience is such that research groups fear losing out to competitors.

• How should we move forward?
-An approach is needed that allows timely meta-analysis of individual-level data but avoids the need for data to be physically transferred, or even visible, outside of the original study in which they were collected.DataSHIELD (36) is such an approach.
level data never leave their study of origin and remain invisible to the analysing statistician.Given appropriate ethico-legal consent, therefore, use of DataSHIELD might arguably be permissible even if the collaborating studies are prohibited from physically sharing data (29).

MAINTAINING INTEGRITY
Scientific, technical and ethico-legal mechanisms can only facilitate data sharing where there can be an assurance of their integrity and of the outcome of the data sharing.In turn, it is the ultimate production of publically and politically acceptable translational outcomes of scientific knowledge that provides 'definitive' evidence of scientific integrity.In the context of data sharing, trust is fundamental to maintaining integrity.Trust functions at a number of levels: between participants and scientists in the collection of data; between scientists (from all active disciplines) in the production of knowledge; between stakeholdersscientific, political, commercial and public -in the application of outputs of the knowledge generated.The acceptability of science and its products is effectively driven by trust (37,38).Development of trust requires active engagement and such engagement must be fit for purpose and tailored to its members.Engaging the public, participants, scientists and other stakeholders serves at least two purposes: maintaining public and participant trust in the science and scientific process; contributing public and stakeholder views and perspectives to the development of that science.Arguably, therefore, a key function for engagement is to ensure attention to the translational aims and social impact of scientific knowledge.Engagement is, therefore, a tool for translation -i.e.T1 tm -in translational science terminology (16).
Translation of scientific knowledge into societal impact (health and health service improvement) requires development of tools and mechanisms for the strategic engagement of stakeholders.Hybrid forums, that is, discussions incorporating transdisciplinary, multisector representation (39), based on existing international collaborations, have the potential to transcend disci-plinary and science/professional boundaries and barriers, thereby fostering communication and trust.International transdisciplinary groups of natural, social and humanities scientists are already established (e.g.P 3 G Observatory (19), BBMRI (20)).Funded appropriately, these groups could form the basis of extended forums, to include political, policy, commercial and professional stakeholders, for integrated strategic discussion about developments in the science, its translation and social impact.Further, issues of trust -central for the effective production of scientific knowledge -must be acknowledged.Increased specialization, collaborations and teamwork within science, scientific activity today necessitates that scientists assess and trust the integrity of their colleagues -whether this activity is data collection, data processing, experimentation, interpretation of results, or peer review.Trustworthiness of other members of scientific community is a central foundation of scientific knowledge generation; what sociologists and historians of science describe as the epistemic role of trust (40)(41)(42)(43).While trust can enhance and aid cooperation, interaction and sharing, lack of trust can hamper not only the production of knowledge but also its effective exchange and sharing.Thus, enhancing effective data sharing in biomedical sciences will need to take into consideration the social and practical processes which impact upon trust between scientists.This requires social science research to identify, describe and reflect upon those barriers and their impact on knowledge production.
Engaging the public and research participants argu-ably requires different methods (44,45).Involving members of the public in conventional governance and organisational meetings is the most common mode of engagement.But this engagement can be tokenistic, may include only the 'usual suspects' or those with known views and may not therefore result in the incorporation of the valuable insights that may be derived from pubic perspectives (44,46).Public meetings and consultations specifically addressing issues in genomics may predominantly attract those motivated by extreme views (47).Resistance to the disempowering characteristics of conventional engagement in these settings can lead to aetiolated outcomes (48).Ironically, these approaches risk undermining rather than maintaining public trust.A strategy for genuinely engaging the public must be multifaceted: it must comprise individuals as well as communities; be purposive as well as using evolutionary mechanisms for engagement; it must take advantage of new Web 2.0 social media; and must target existing forums with broad appeal.However, engaging motivated individuals and groups is not the same as gaining a genuine insight into public perceptions.If we really want to understand public views of specific issues in genomics and biobanking, for example privacy, we need to undertake well conducted, appropriately designed research to do so (cf.49,50).In other words, understanding public views and perceptions requires robust, theoreticallyinformed and adequately resourced social science research.Only in this way can we properly inform the development of socially appropriate and acceptable scientific knowledge generation.

"WHAT'S PAST IS PROLOGUE"
(The Tempest, 2.1) Bioscience has recently undergone a series of knowledge-based and technological revolutions.A critical consequence has been increasing recognition of the need to invest in infrastructure.Good access to data (and samples) from multiple studies is axiomatic to the value of this infrastructure.Access must be streamlined, secure, and based upon transparent and "fair" decision making (8).It must be clear who has created and who has used which data (10).Ethico-legal policies and guidelines, which already reflect local cultural and societal norms, must take account of the increasingly global nature of bioscience research (32).A robust data infrastructure must also be attentive to the translational aims and social impact of its knowledge generation (16).In order to maintain the trust of its constituency -the general public as well as professional, political, commercial stakeholders -it must develop mechanisms to take account of all of these perspectives.
But this is no tabula rasa.Despite its obvious benefits and regardless of the approach used, shared data analysis must conform to long-standing principles: for example, analysis is simply not valid unless the studies to be combined are harmonized (7); likewise, harmonized data sets will be useless if data from one study cannot be shared beyond national borders because data governance requirements and policies do not allow it.Building on extant achievements and pursuing the ideas outlined here could revolutionise the way we use and manage large-scale data.They have critical implications for biomedical and public health research communities and will be of central relevance for healthcare managers and policy makers, governments and industry.However, if the major challenges are to be met we must continue to invest, both nationally and internationally, in developing the cooperative infrastructures that provide a complementary foil to competitive resourcing mechanisms that drive hypothesis-driven science.

Figure 1 .
Figure 1.Schematic overview of IT architecture for DataSHIELD as applied to six studies.Analysis Computer (dark shading) runs R(51).Data Computers (light diagonal stripes) run OPAL(52) and R. Each Data Computer linked with Analysis Computer over internet via firewalls.