PLOS Data Policy: Update

May 30, 2014 Theo Bloom and Jennifer Lin Uncategorized

Two months after the implementation of the PLOS journals’ data policy, what have we learned from our authors, reviewers, editors, correspondents, and commenters in the blogosphere?

More than 16,000 manuscripts have been submitted with a data availability statement

In order to optimise the re-use of data by readers and by data miners, authors of all new manuscripts submitted since March 3, 2014 have included a statement about where the data underlying their description of research can be found. At the time of writing, more than 16,000 sets of authors have included information about data availability with their submission. We have had fewer than 10 enquiries per week to [email protected] from authors who need advice about ‘edge cases’ of data handling and availability – fewer than 1% of authors – and these cases have helped us to further update our FAQ, contributing to a decline in such enquiries over time. We would like to say a huge thankyou to all the authors and editors who have worked with us in this period to iron out wrinkles in our submission processes and helped us make it as easy as possible to capture information about data availability. Special mention is due to the pioneering subset of authors who have already published articles in PLOS journals with a Data Availability Statement, whether all data are provided within the manuscript and its supplementary files or providing links to a domain-specific repository, a non-specific repository or more diverse resources (see Image).

Some groups of authors still have concerns about data sharing

From our period of increasing public consultation about the PLOS data policy, we knew that at launch we would encounter authors with specific issues in two main areas: firstly, big datasets that are too large for hosting in most repositories (although some, such as the journal GigaScience target precisely this domain), and secondly around patient confidentiality and the associated need to have oversight committees for instances in which access should be restricted to appropriate individuals. Since the launch we have heard these issues raised again, but we have also heard three main arguments used by bloggers and others to justify not sharing the data underlying research articles that we find it hard to agree with. They can be summarised as follows.

It’s mine, I collected it. Most funders and institutions have moved away from this idea, but it persists in the mind of many researchers.

It’s complicated and unique: no-one else could understand it properly This “data as a unique snowflake” argument supposes that no-one other than those who collected the data can understand it enough to re-use it. Taken to an extreme, this argument would tend to suggest that there is no point in peer-reviewing or publishing research at all. We would rather work with those (e.g. BioSharing) who are working to develop standards and approaches to describing data so that it can indeed be used by others.

I’d like to share, but my lab is little and/or under-funded and/or in a lower-income country, and once I share the big guys can jump on the data and do cool things with it before I can do them myself. We of course have sympathy with this perspective, and PLOS journals work specifically with authors in less-developed countries to help them publish their work. However, as noted by panellist Joe DeRisi in a discussion of data sharing at UCSF earlier this month, it would be perverse to suggest that we delay, for example, progress in malaria research in order to allow researchers in the most-affected countries to contribute optimally.

There is one additional argument that has been made, and we acknowledge this one reflects a genuine concern, namely that it takes work to make data sharing-ready. Previously we required all PLOS authors to share data “on request”, but some bloggers have noted that no-one ever requested much of their data, or that when it was requested they in fact refused, whereas now all data should be made ready for sharing, whether ultimately needed or not. We agree that this does require work, however it takes less work to prepare the data at the same time as the publication than was previously required when trying to dig into archives to find material some time later (humorously summarised in a video cartoon). And we would note that increasingly all funders require, as a condition of a grant, that a data-management plan be included, and that this is driving researchers and their institutions to have good systems in place that will meet the criteria of the PLOS policy.

We focused on where, when and how to share, but many are still concerned about what to share

The new PLOS data policy refers to sharing the data underlying a publication, just as our previous policy did. The new part of the policy is to ask for sharing ‘up front’, at the time a manuscript is submitted, rather than subsequently and on request. But an awful lot of the responses to the policy have focussed on the issue of which datasets need to be included. It was not quite as apparent from our prior consultation as it is now: researchers in many fields don’t know which data to archive and share and which should be considered ‘disposable’ moments en route to data worth preserving. Although funders such as NIH and community organisations such as MIBBI try to outline the requirements either generally or specifically, it seems it will be a sisyphean task to provide detailed guidance for every type of experiment in every domain within science. We are therefore currently considering the extent to which we at PLOS can or should aim to provide this type of guidance, and would welcome your input on this issue.

There can be real difficulties about ‘limited sharing’, whether during peer review or after publication. Most repositories, whether subject-specific or general, institutional or international, are set up to allow full open access. But there are two main circumstances in which more limited sharing is appropriate. The first is during peer review, when editors and reviewers need access to the data but the authors may not want it to be public; we are aware of only a few databases (e.g. Dryad) that routinely provide this facility. The second circumstance is when datasets contain sensitive information – whether about patients or, for example, endangered species’ locations – such that it may be appropriate to share only a subset of the information more widely, and/or to share only with appropriately screened individuals. Several databases and repositories have plans to allow more limited access (e.g. Dataverse, figshare), which should help address concerns in this area, but this is a work in progress. For now, it remains a major challenge for clinical studies to both provide controlled access to the data and preserve patient confidentiality.

There is plenty still to do

The 2014 PLOS data policy deliberately set out to take just one step towards improved integration between the published literature and the data underlying it, by asking authors to say where there data can be found ( that somewhere being not on their own hard drives). We know that much more will be needed before we are dealing with data satisfactorily. For a small minority of commenters, we did not go far enough. For many more, we leave too many open questions, and we agree there are many, most of which are not unique to PLOS and need further community discussion. Among the most pressing, from our perspective:

When should an author choose Supplementary Files vs. a repository vs. figures and tables.
Should software/code be treated any differently from ‘data’? How should materials-sharing differ?
What does peer review of data mean, and should reviewers and editors be paying more attention to data than they did previously, now that they can do so?
And getting at the reason why we encourage data sharing: how much data, metadata, and explanation is necessary for replication?
A crucial issue that is much wider than PLOS is how to cite data and give academic credit for data reuse, to encourage researchers to make data sharing part of their everyday routine.
And for long-term preservation, we must ask who funds the costs of data sharing? What file formats should be acceptable and what will happen in the future with data in obsolete file formats? Is there likely to be universal agreement on how long researchers should store data, given the different current requirements of institutions and funders?

As we continue to work on these issues and others, we would once again welcome your feedback and input, here on the blog, via individual journals, or at [email protected].

Theo Bloom and Jennifer Lin, for the PLOS Data Policy group.