By Jonas Waldenström
This post was intended as a comment on a post by Terry McGlynn at Small Pond Science, but once I started writing it soon swelled and I transformed into a blog post instead. I suggest you start with reading the original post here. The short version is: the leading scientific publisher PLOS have taking the open access movement one step ahead, from not only making the studies freely available, but to also make it mandatory to include the original data used to draw the conclusions of a paper. It seems as a good thing – too many studies can’t be replicated, much data is lost when people leave science, or are deposited in ways that don’t stand the test of time. However, it is also problematic, as good quality data is painstakingly hard to gather and could be viewed as a currency on its own.
I have published quite frequently with PLOS journals, and in particular with PLOS ONE. In fact, over the last couple of years I have authored/co-authored 15 papers in in PLOS ONE and two in PLOS Pathogens. My experiences so far have been very positive: good review processes, beautiful final prints, and, because of absence of pay walls, a very good spread among peers. I regularly check the altmetrics of the articles and it is exiting to see how many times they are viewed, downloaded, and cited. I have been very pro-PLOS, even in times when many ecologist peers didn’t consider PLOS ONE as a venue for publication. However, all the good things with PLOS considered, the new policy launched a little while ago have made me a bit more reluctant for submitting future work to the journal.
So what has changed? Is it a revolution, or ‘same old, same old’ – the answer is no one knows for sure (at least I don’t). The short version of the policy was published as an editorial in PLOS Biology, and although it states that the new data policy will make ‘more bang for the buck’ and ‘foster scientific progress’ it wasn’t overly clear what it means in practice for the researcher about to submit a paper:
PLOS defines the “minimal dataset” to consist of the dataset used to reach the conclusions drawn in the manuscript with related metadata and methods, and any additional data required to replicate the reported study findings in their entirety. Core descriptive data, methods, and study results should be included within the main paper, regardless of data deposition. PLOS does not accept references to “data not shown”. Authors who have datasets too large for sharing via repositories or uploaded files should contact the relevant journal for advice.
In many cases it wouldn’t make a huge difference. There are already options to upload supporting data as appendixes, and repositories like Figshare, Genebank and Dryad are already out there. As an example, in one paper published in PLOS Pathogens we had 20 supplementary files, including details on statistical analyses, plenty of extra tables and figures. And for another publication in Molecular Ecology, the full alignments of genes analyzed were uploaded (per the journal instructions) to Dryad to facilitate others to replicate the analysis if need be. But for much of my current work on long-term pathogen dynamics in waterfowl it wouldn’t feel good to upload all the raw data. The question is really what a minimal dataset is. And importantly, what data you don’t include in the dataset.
A FAQ from PLOS has been published where this is addressed, but as of yet it remains to see how this is done in practice:
The policy applies to the dataset used to reach the conclusions drawn in the manuscript with related metadata and methods and any additional data required to replicate the reported study findings in their entirety. You need not submit your entire dataset, or all raw data collected during an investigation, but you must provide the portion that is relevant to the specific study.
So why am I a bit reluctant? Let me give you some background. The study system I run was started 12 years ago by professor Björn Olsen, and I have taken over the running of it 5-6 years ago. Over the years we have published quite many papers on avian influenza in this migratory Mallard population, but it is now when the time series is long enough that we can do more advanced studies on the effects of immunity on disease dynamics, long-term subtype dynamics, and influenza A virus evolution. Big stuff, based on the same large datasets but analyzed in different ways. If publishing one paper now means we have to submit 12 years of original data, i.e. the ringing data and disease data of 22,000 mallards, then it comes with a potential cost. I see the dataset as a work in progress, a living entity that is accumulating new data as we go along and where analyses are planned for both now-now and in the distant future. In the cow analogy of Terry McGlynn, the dataset is a herd with a balanced age structure, some cattle destined for the pot already today, some fattening for slaughter, and yet others to grow into the breeding stock.
The unique longevity of the time series has gotten our research into much fruitful collaboration. Since a few years we work with capture-mark-recapture researchers in France for making epidemiological models, just to name one aspect. I have also turned down invitations to collaborate, although much more rarely. In those instances it has been because we have planned to do these analyses ourselves, or that the time wasn’t right to do so. And just to make this clear: the cases of not sending data were not refusals to send background data for replicating a paper, rather they were requests to do new stuff with it. With posting your raw data in close to its entirety, such situations could be cumbersome, and you run the risk of seeing your data being analyzed by someone else.
The problem is much less on genetic data. After all, it is conventional in all fields to submit your sequences to Genebank along with your submission, and you know that it works. I have seen ‘our’ sequences in many phylogenetic trees without having been asked about the usage in advance. But it is one thing where your data is used as a brick in a new construction, and another to have someone taking over your house and having to give away the key. Many people say that risks of getting scooped of your data are exaggerated, and this is likely true. Scientists are usually decent people, after all. But, knowing there is such a risk, albeit small, can make an impact on publication options, or to delay publication of smaller papers until all the big papers from a dataset have been published (which may become problem for graduate student theses).
It is essential that we very soon gets to know what a minimal dataset is. For example, would it be OK to submit the raw data on Mallard infection histories without the unique ring number? Exchanging the individual identifier with an arbitrary number, for example. Or to exclude data such as actual date, morphology or indexes of condition? That way an analysis on the prevalence of a disease in a population (which is a very simple exercise) doesn’t immediately lead to the possibility for another researcher to investigate the effect of infection on a bird’s condition, effect of immunity, migration dynamics, to name a few options. Would PLOS allow that? I don’t know.
An issue raised by Terry McGlynn was the differences between a small lab and those at resource-intense research labs. In a small lab, the research takes longer time due to limited resources and smaller staff, and a good dataset is extremely precious and could also act as a currency, enabling co-authorships through collaborations. I would like to end with an additional story. A story of long-term data collected by volunteers at Ottenby Bird Observatory.
Ottenby Bird Observatory was founded in 1946 and have run without end since then. Each year volunteers help out with the trapping and banding of birds, mostly passerines, but also a chunk of waders, ducks and birds of prey. The number of birds caught each year is between 15 and 20 thousands, and in total more than 1 million birds have been banded. This dataset, together with all morphometrics collected in connection with the banding and all band recoveries is a unique and extremely valuable data series. The problem is that few pay for it. The observatory receives a small fund from the Swedish Environmental Board, but not enough to cover the costs. Additional money comes from tourists and subsidies from the Swedish Ornithological Society. And, although not substantial, from researchers that pay for the service of collecting data or getting data from the trapping series.
The observatory really provides a service. To date at least 278 peer-reviewed articles have been published with data emanating from Ottenby, including two papers in Science and >10 in PLOS journals. The unbroken trapping series has proven to be one of few datasets where the time scale is sufficiently long to investigate effects of climate change on biological phenologies, measured as timing of migration of common passerine birds. Researchers that want to use the data put in a request to the observatory, and a sort of contract is settled between the parties. Usually the money is little, but also small sums are essential for a volunteer-based operation. What happens when all data becomes immediately available for everyone without restrictions? A question to ponder, really.
In many ways, PLOS has revolutionized scholarly publishing and the open access movement has made research results available fast for the masses. I sincerely hope that the new data policy does not inadvertently work the opposite way, by making researchers less prone on submitting their studies to PLOS journals. It is still too early to tell, but I think many like me really wonders what the ‘minimal dataset’ really means in practice.
If you enjoyed this post, or other posts on this blog, why not follow the blog via email, Feedly or get updates via Twitter by following @DrSnygg?