Dealing with data

July 13, 2012 Theo Bloom Data Editorial policy PLOS Biology Publishing Research

PLoS Biology aspires to drive openness in data use and re-use, not just open access to the literature. We are pleased to announce a partnership with Dryad that helps our authors make their data re-useable and accessible by providing a home for data linked to publications.

640px-Wikimedia_data — Source: Wikimedia attributed to Victorgrigas

It is not news to say that the current handling of research data within the life sciences is rather chaotic. Most funding agencies and research institutions want to ensure that the data generated by the researchers they fund is made easily available both to the research community, for replication and re-use, and to all other interested parties, so as to facilitate the advancement of science and technology. These organisations look to journals that publish original research to help make data available. We might imagine that, in an idealised future world, scientists will collect all types of data in an organised and structured way that links methodology, date and time of sample together with all relevant details alongside the data (such that all electronic data automatically has complete and well-structured metadata). We might also hope that researchers will want to share their research outputs with all interested parties, subject only to adequate attribution and certain limitations on specific types of data (for example, data about human patients whose privacy should be protected, and data on endangered species whose locations need similar safeguards). But we are not very close to this utopia right now. (As an aside to those purists who belive data ‘are’ plural, I refer you to a post on The Guardian’s blog that can be summarised in this quote: “It’s like agenda, a Latin plural that is now almost universally used as a singular. Technically the singular is datum/agendum, but we feel it sounds increasingly hyper-correct, old-fashioned and pompous to say “the data are”.”)

Even when data is shared – which is not always – where it should be housed remains a matter for discussion. In many cases, the most useful place for publicly available data is in well-structured, well-curated and long-term-sustainable databases. Such databases are often unique to a field in order to meet that field’s particular and unique requirements for database structure and curation. Well-known examples include GenBank for gene sequences, the Protein Data Bank for protein structure and ArrayExpress for microarray data. But some kinds of data are insufficiently common to have driven the development of their own specific databases to house them, or a particular field may not yet have agreed to the standards and formats required for database development and adoption. Some institutions and funders are therefore stepping in to provide suitable long-term and open access homes for such ‘orphan’ data (for example via OpenAIREplus in the European Union).

The publishers of research output – journals – are uniquely well-placed to help researchers ensure that all data underlying a study are made available alongside any published articles. Journals can also help to meet the needs that researchers and funders have in making data available in useful formats, while accruing appropriate credit to the people who generate the data generators. But many people have noted that ‘supplementary files’ provided with an article are not the best way to share data either. Current publisher practices (including PLoS’s own) tend to exacerbate problems around data re-use by requiring standardised figure formats, providing ‘flat’ file types such as PDFs, and discouraging the provision of raw data while simultaneously encouraging the ‘dumping’ of various combinations of data and text into ill-structured ‘supplementary files’.

An ideal that most funders of research and some authors aspire to is to allow both re-use and total replicability: anyone who wants to can both use the data for any future use and (re)analyse the data to get exactly the same results as the authors. A key barrier to achieving this ideal is that a significant number of ‘data generators’ do not want to share, for one reason or another. For those who believe that – having spent a long time collecting data – they own it and don’t want to lose their competitive edge by sharing it, we publishers can work with the funders of research to enforce sharing (just as we do for deposition of sequencing data in GenBank, protein structure data in PDB, and so on). But I believe the majority of researchers would share data if they felt it accrued the same kinds of benefits to them as does sharing their ideas through publication. Credit in the form of publications is used to assess researchers for funding, promotion and tenure, and researchers need similar incentives and credits to encourage them to share data. Some editors and journals are trying to fit data into the format of a published article, by launching data journals or contemplating data article types so as to use the established systems for citing research articles as a way to accrue credit for data sharing. But it seems we need to come up with better ways to give appropriate credit to those who generate and share data – methods that allow small and large data packets to be cited, and that allow citation of specific ‘bits’ of a dataset.

A step towards resolving the data problem: PLoS Biology partners with Dryad

In efforts to address some of the issues highlighted here with the sharing of data, PLoS has begun working with several partners on a number of fronts to improve systems for citing and crediting data sharing. One partnership we are pleased to announce formally today is with Dryad (www.datad ryad.org), an open access repository of data underlying peer-reviewed articles that is being developed by the National Evolutionary Synthesis Center and the University of North Carolina Metadata Research Center, in coordination with a large group of Jou rnals and Societies . From PLoS Biology’s perspective, Dryad provides a good answer for authors who don’t know where or how to store their data: it takes data ‘packages’ associated with published articles and makes them freely available. Dryad provides a unique identifier (DOI) for each data package, and allows authors to upload subsequent versions of their data (clearly indicated), as well as providing download statistics for each data package. By having a close partnership with Dryad, PLoS Biology can offer authors a seamless tying together of an article with its underlying data; we can also provide confidential access for editors and reviewers to data associated with articles under review (see D epositing data to Dryad guidelines). PLoS Biology is the first of the PLoS journals to have these close links to Dryad in place, but we plan to roll out this partnership to the other PLoS journals in the near future, and indeed all authors can submit data directly to Dryad: at the time of writing there are 36 articles in the PLoS corpus that have associated data in Dryad.

Please ‘watch this space’ as we announce further plans and partnerships in the data arena, and please do let us know what you think are the most pressing issues, by emailing us or starting a discussion here.

COI declaration: Theo Bloom is a member of the advisory boards of OpenAIREplus and Dryad.