By Jonas Waldenström
This post was intended as a comment on a post by Terry McGlynn at Small Pond Science, but once I started writing it soon swelled and I transformed into a blog post instead. I suggest you start with reading the original post here. The short version is: the leading scientific publisher PLOS have taking the open access movement one step ahead, from not only making the studies freely available, but to also make it mandatory to include the original data used to draw the conclusions of a paper. It seems as a good thing – too many studies can’t be replicated, much data is lost when people leave science, or are deposited in ways that don’t stand the test of time. However, it is also problematic, as good quality data is painstakingly hard to gather and could be viewed as a currency on its own.
I have published quite frequently with PLOS journals, and in particular with PLOS ONE. In fact, over the last couple of years I have authored/co-authored 15 papers in in PLOS ONE and two in PLOS Pathogens. My experiences so far have been very positive: good review processes, beautiful final prints, and, because of absence of pay walls, a very good spread among peers. I regularly check the altmetrics of the articles and it is exiting to see how many times they are viewed, downloaded, and cited. I have been very pro-PLOS, even in times when many ecologist peers didn’t consider PLOS ONE as a venue for publication. However, all the good things with PLOS considered, the new policy launched a little while ago have made me a bit more reluctant for submitting future work to the journal.
So what has changed? Is it a revolution, or ‘same old, same old’ – the answer is no one knows for sure (at least I don’t). The short version of the policy was published as an editorial in PLOS Biology, and although it states that the new data policy will make ‘more bang for the buck’ and ‘foster scientific progress’ it wasn’t overly clear what it means in practice for the researcher about to submit a paper:
PLOS defines the “minimal dataset” to consist of the dataset used to reach the conclusions drawn in the manuscript with related metadata and methods, and any additional data required to replicate the reported study findings in their entirety. Core descriptive data, methods, and study results should be included within the main paper, regardless of data deposition. PLOS does not accept references to “data not shown”. Authors who have datasets too large for sharing via repositories or uploaded files should contact the relevant journal for advice.
In many cases it wouldn’t make a huge difference. There are already options to upload supporting data as appendixes, and repositories like Figshare, Genebank and Dryad are already out there. As an example, in one paper published in PLOS Pathogens we had 20 supplementary files, including details on statistical analyses, plenty of extra tables and figures. And for another publication in Molecular Ecology, the full alignments of genes analyzed were uploaded (per the journal instructions) to Dryad to facilitate others to replicate the analysis if need be. But for much of my current work on long-term pathogen dynamics in waterfowl it wouldn’t feel good to upload all the raw data. The question is really what a minimal dataset is. And importantly, what data you don’t include in the dataset.
A FAQ from PLOS has been published where this is addressed, but as of yet it remains to see how this is done in practice:
The policy applies to the dataset used to reach the conclusions drawn in the manuscript with related metadata and methods and any additional data required to replicate the reported study findings in their entirety. You need not submit your entire dataset, or all raw data collected during an investigation, but you must provide the portion that is relevant to the specific study.
So why am I a bit reluctant? Let me give you some background. The study system I run was started 12 years ago by professor Björn Olsen, and I have taken over the running of it 5-6 years ago. Over the years we have published quite many papers on avian influenza in this migratory Mallard population, but it is now when the time series is long enough that we can do more advanced studies on the effects of immunity on disease dynamics, long-term subtype dynamics, and influenza A virus evolution. Big stuff, based on the same large datasets but analyzed in different ways. If publishing one paper now means we have to submit 12 years of original data, i.e. the ringing data and disease data of 22,000 mallards, then it comes with a potential cost. I see the dataset as a work in progress, a living entity that is accumulating new data as we go along and where analyses are planned for both now-now and in the distant future. In the cow analogy of Terry McGlynn, the dataset is a herd with a balanced age structure, some cattle destined for the pot already today, some fattening for slaughter, and yet others to grow into the breeding stock.
The unique longevity of the time series has gotten our research into much fruitful collaboration. Since a few years we work with capture-mark-recapture researchers in France for making epidemiological models, just to name one aspect. I have also turned down invitations to collaborate, although much more rarely. In those instances it has been because we have planned to do these analyses ourselves, or that the time wasn’t right to do so. And just to make this clear: the cases of not sending data were not refusals to send background data for replicating a paper, rather they were requests to do new stuff with it. With posting your raw data in close to its entirety, such situations could be cumbersome, and you run the risk of seeing your data being analyzed by someone else.
The problem is much less on genetic data. After all, it is conventional in all fields to submit your sequences to Genebank along with your submission, and you know that it works. I have seen ‘our’ sequences in many phylogenetic trees without having been asked about the usage in advance. But it is one thing where your data is used as a brick in a new construction, and another to have someone taking over your house and having to give away the key. Many people say that risks of getting scooped of your data are exaggerated, and this is likely true. Scientists are usually decent people, after all. But, knowing there is such a risk, albeit small, can make an impact on publication options, or to delay publication of smaller papers until all the big papers from a dataset have been published (which may become problem for graduate student theses).
It is essential that we very soon gets to know what a minimal dataset is. For example, would it be OK to submit the raw data on Mallard infection histories without the unique ring number? Exchanging the individual identifier with an arbitrary number, for example. Or to exclude data such as actual date, morphology or indexes of condition? That way an analysis on the prevalence of a disease in a population (which is a very simple exercise) doesn’t immediately lead to the possibility for another researcher to investigate the effect of infection on a bird’s condition, effect of immunity, migration dynamics, to name a few options. Would PLOS allow that? I don’t know.
An issue raised by Terry McGlynn was the differences between a small lab and those at resource-intense research labs. In a small lab, the research takes longer time due to limited resources and smaller staff, and a good dataset is extremely precious and could also act as a currency, enabling co-authorships through collaborations. I would like to end with an additional story. A story of long-term data collected by volunteers at Ottenby Bird Observatory.
Ottenby Bird Observatory was founded in 1946 and have run without end since then. Each year volunteers help out with the trapping and banding of birds, mostly passerines, but also a chunk of waders, ducks and birds of prey. The number of birds caught each year is between 15 and 20 thousands, and in total more than 1 million birds have been banded. This dataset, together with all morphometrics collected in connection with the banding and all band recoveries is a unique and extremely valuable data series. The problem is that few pay for it. The observatory receives a small fund from the Swedish Environmental Board, but not enough to cover the costs. Additional money comes from tourists and subsidies from the Swedish Ornithological Society. And, although not substantial, from researchers that pay for the service of collecting data or getting data from the trapping series.
The observatory really provides a service. To date at least 278 peer-reviewed articles have been published with data emanating from Ottenby, including two papers in Science and >10 in PLOS journals. The unbroken trapping series has proven to be one of few datasets where the time scale is sufficiently long to investigate effects of climate change on biological phenologies, measured as timing of migration of common passerine birds. Researchers that want to use the data put in a request to the observatory, and a sort of contract is settled between the parties. Usually the money is little, but also small sums are essential for a volunteer-based operation. What happens when all data becomes immediately available for everyone without restrictions? A question to ponder, really.
In many ways, PLOS has revolutionized scholarly publishing and the open access movement has made research results available fast for the masses. I sincerely hope that the new data policy does not inadvertently work the opposite way, by making researchers less prone on submitting their studies to PLOS journals. It is still too early to tell, but I think many like me really wonders what the ‘minimal dataset’ really means in practice.
If you enjoyed this post, or other posts on this blog, why not follow the blog via email, Feedly or get updates via Twitter by following @DrSnygg?
so the problem is to establish a system where the collecting of data is appropriately being paid ?
Or you just want to use it as an “advantage” over other researchers wrt. publishing future
papers based on the data-set
As an alternative “reward” wrt. funding, career, so to speak.
I don’t really know, how it goes with that publishing/payment/funding/career thing.
The latter however is worse for the research as a whole, the system, the purpose of publicly
funded research. Why do we have publishing of results in the first place ?
I think, we can have papers with published data and papers without – but it should
be declared, should be easily filterable and searchable so I can read the latter first or only.
As for the influenza sequences, it’s easier for me to just download all the genbank influenza
sequences in bulk and analyse them, without having to read the papers.
This way I miss some accompanying details and information but on the other site I get
I’d vote for just paying for data-collection, even without paper.
The idea of science and the reality of the scientist are two slightly different things. An academic career is very competitive these days, not only in terms of getting a position, but also in getting grants to do science. I do not say that this is an easy balance, because it is not. But there must be a system that try to maximize the gain for all involved. Thus, that is why I stress the need of knowing what a ‘minimal dataset’ is in practice.
Of course open access to studies, sequences and other types of data is positive for the system. More analyses can be done, more hypothesis tested etc. No one is really against that. But we have to acknowledge the perils someone took to collect them, and that more than one analysis could have been planned for the dataset. Few agencies want to fund long-term studies, and when they do it is often rather meager sums. Thus, the data come with large efforts. And in the case of Ottenby, without support from funding agencies what so ever.
The dividing line here is not whether data should be posted at all – it is on how much of it that needs to be shown and for what purpose. In the example of prevalence data, for instance: is it enough to provide the data tabulated per month, or does it need to be all 22,000 datapoints together with banding data? It is a huge difference. And without knowing what should go in, and what can stay out, it is a bit of a momentary limbo moment.
For genetic analyses I am all in for giving the accession numbers, and it would be great if more papers contained links to the pruned alignments. We are sequencing viruses in the US now, and once they are finished they will be on Genebank after 45 days, regardless if we have published a study or not (actually it will be not, since it takes a long time analyzing the data, writing the paper and pass it through peer-review). Anyone could then use them, but hopefully if they form a large part of a new analysis the researcher would contact us and make it a collaboration. And I like to stress that often the accompanying details are very important.
Some very good points raised here. I have one dataset that will go into two papers, but if I had submitted it all with publication of the first, I would probably have a hard time with publication of the second. I don’t have any long-term data, but you make a good point with Ottenby and the relative value of these data to the facility.
There are a number of people in my department working on a field ecology and epidemiology project that has spanned 10’s of thousands of organisms over decades (and is planned for decades in the future). The people responsible for the lion’s share of the work are not in a position to make these data sets available upon publication, simply because it is stored in stacks of binders dated by field season. When they find something of interest going on in the population…THEN they hire hordes of undergraduates to aggregate it.
All of this speaks to a central need that I believe the researcher’s home institution should fill…long-term data management and retention. At Penn State, the library is beginning to roll out services to aid in these types of endeavors. While its very much in its birthing throes, I applaud the effort.
On a related note, I’ve heard rumblings that support the idea that a well curated data set is currency in and of itself. Specifically, in the “Big Data” world there is support for journals whose sole purpose is to serve as a repository for curated data that would be accompanied by a brief, peer-reviewed write-up that detail collection methods, potential biases, and a summary of the existing literature related to these data.
My advisor has a massive data set that is highly prized by digital epidemiologists. While our group has published on these data, there is enormous potential there that isn’t being capitalized on. Simply because it is only available to the group that collected it. If there were an incentive, a scholarly/academic incentive, there would be a lot more sharing going on. Sure, we might “get scooped” but that should be the least of our worries.
PLOS updated their policy text a few days ago, and it is now not as sharply formulated. I think that is good. My guess is that they were somewhat overwhelmed by the big discussions on the web. In one corner we have those that treat open access as it was a religious belief. I have read comments like “being a scientist is a privilege”. Those people tend to forget it is also a job, and one with a hard and winding road to get a faculty position. Very few thinks that open data is bad; rather most us are on a sliding scale where the needs and possibilities varies depending on our circumstances. The tipping point is to find a balance between what is good for the system, and at the same time good for the individuals that contribute the data. In most fields, the risk of being scooped is slim, but in very competitive fields it may be an issue.
I think what you describe from Penn State sounds excellent – because large data sets need to be maintained and stored in a safe environment. Unfortunately, different unis have different means to do so. At my university, there are no such intentions yet. Thus, I have had to bring a database builder to structure a large SQL database, and constantly work with trying to get in even better shape with the funds I have. The idea that data sets could/should get formal peer recognition is good – it remains to see how that can come about. Especially, the time it will take before it is implemented in evaluations of grant applications or tenure. My gut feeling is that such a process will take time. Already now, we have a hard time balancing merits from research, teaching and outreach (where research is valued highest).
Thanks for your comments – it is very nice to see that people actually read what I write.
“sell” the data to the highest bidder ! The University that wants to host it, that benefits from the publicity and reputation.
It could also be political: ” see here, country xy provides the data here for the benefit of mankind ”
hmm, maybe private companies could do it for advertising, public relations
I could send you my programs, then you run it on your data
and you can publish the results, if you want ?!
randomly manipulate,falsify,mix the data, so that the data is “incorrect”
but the expected statistics that is done on it is still essentially the same.
Thus you keep the original data for your own and others can’y publish at PLOS
because they only have “wrong” data ;:)
but results, science is not hindered/delayed
the risk with withholding data is however, that it loses in originality
over time, when others develope/gather similar data and publish it first
I think the whole PLOS debate has been good for science. We are progressing more and more to open data, but sometimes the steps tend to be too large in one go. Raising the scholarly value of data providers will take time, as all other aspects of academia evolves slowly (hiring committees, grant agencies etc). And there are many factors to the equation, too. We’ll see what happens.