To share, or not to share your data – some thoughts on the new data policy for the PLOS journals

By Jonas Waldenström

This post was intended as a comment on a post by Terry McGlynn at Small Pond Science, but once I started writing it soon swelled and I transformed into a blog post instead. I suggest you start with reading the original post here. The short version is: the leading scientific publisher PLOS have taking the open access movement one step ahead, from not only making the studies freely available, but to also make it mandatory to include the original data used to draw the conclusions of a paper. It seems as a good thing – too many studies can’t be replicated, much data is lost when people leave science, or are deposited in ways that don’t stand the test of time. However, it is also problematic, as good quality data is painstakingly hard to gather and could be viewed as a currency on its own.

I have published quite frequently with PLOS journals, and in particular with PLOS ONE. In fact, over the last couple of years I have authored/co-authored 15 papers in in PLOS ONE and two in PLOS Pathogens. My experiences so far have been very positive: good review processes, beautiful final prints, and, because of absence of pay walls, a very good spread among peers. I regularly check the altmetrics of the articles and it is exiting to see how many times they are viewed, downloaded, and cited. I have been very pro-PLOS, even in times when many ecologist peers didn’t consider PLOS ONE as a venue for publication. However, all the good things with PLOS considered, the new policy launched a little while ago have made me a bit more reluctant for submitting future work to the journal.

So what has changed? Is it a revolution, or ‘same old, same old’ – the answer is no one knows for sure (at least I don’t). The short version of the policy was published as an editorial in PLOS Biology, and although it states that the new data policy will make ‘more bang for the buck’ and ‘foster scientific progress’ it wasn’t overly clear what it means in practice for the researcher about to submit a paper:

 PLOS defines the “minimal dataset” to consist of the dataset used to reach the conclusions drawn in the manuscript with related metadata and methods, and any additional data required to replicate the reported study findings in their entirety. Core descriptive data, methods, and study results should be included within the main paper, regardless of data deposition. PLOS does not accept references to “data not shown”. Authors who have datasets too large for sharing via repositories or uploaded files should contact the relevant journal for advice.

In many cases it wouldn’t make a huge difference. There are already options to upload supporting data as appendixes, and repositories like Figshare, Genebank and Dryad are already out there. As an example, in one paper published in PLOS Pathogens we had 20 supplementary files, including details on statistical analyses, plenty of extra tables and figures. And for another publication in Molecular Ecology, the full alignments of genes analyzed were uploaded (per the journal instructions) to Dryad to facilitate others to replicate the analysis if need be. But for much of my current work on long-term pathogen dynamics in waterfowl it wouldn’t feel good to upload all the raw data. The question is really what a minimal dataset is. And importantly, what data you don’t include in the dataset.

A FAQ from PLOS has been published where this is addressed, but as of yet it remains to see how this is done in practice:

The policy applies to the dataset used to reach the conclusions drawn in the manuscript with related metadata and methods and any additional data required to replicate the reported study findings in their entirety. You need not submit your entire dataset, or all raw data collected during an investigation, but you must provide the portion that is relevant to the specific study.

So why am I a bit reluctant? Let me give you some background. The study system I run was started 12 years ago by professor Björn Olsen, and I have taken over the running of it 5-6 years ago. Over the years we have published quite many papers on avian influenza in this migratory Mallard population, but it is now when the time series is long enough that we can do more advanced studies on the effects of immunity on disease dynamics, long-term subtype dynamics, and influenza A virus evolution. Big stuff, based on the same large datasets but analyzed in different ways. If publishing one paper now means we have to submit 12 years of original data, i.e. the ringing data and disease data of 22,000 mallards, then it comes with a potential cost. I see the dataset as a work in progress, a living entity that is accumulating new data as we go along and where analyses are planned for both now-now and in the distant future. In the cow analogy of Terry McGlynn, the dataset is a herd with a balanced age structure, some cattle destined for the pot already today, some fattening for slaughter, and yet others to grow into the breeding stock.

The unique longevity of the time series has gotten our research into much fruitful collaboration. Since a few years we work with capture-mark-recapture researchers in France for making epidemiological models, just to name one aspect. I have also turned down invitations to collaborate, although much more rarely. In those instances it has been because we have planned to do these analyses ourselves, or that the time wasn’t right to do so. And just to make this clear: the cases of not sending data were not refusals to send background data for replicating a paper, rather they were requests to do new stuff with it. With posting your raw data in close to its entirety, such situations could be cumbersome, and you run the risk of seeing your data being analyzed by someone else.

The problem is much less on genetic data. After all, it is conventional in all fields to submit your sequences to Genebank along with your submission, and you know that it works. I have seen ‘our’ sequences in many phylogenetic trees without having been asked about the usage in advance. But it is one thing where your data is used as a brick in a new construction, and another to have someone taking over your house and having to give away the key. Many people say that risks of getting scooped of your data are exaggerated, and this is likely true. Scientists are usually decent people, after all. But, knowing there is such a risk, albeit small, can make an impact on publication options, or to delay publication of smaller papers until all the big papers from a dataset have been published (which may become problem for graduate student theses).

It is essential that we very soon gets to know what a minimal dataset is. For example, would it be OK to submit the raw data on Mallard infection histories without the unique ring number? Exchanging the individual identifier with an arbitrary number, for example. Or to exclude data such as actual date, morphology or indexes of condition? That way an analysis on the prevalence of a disease in a population (which is a very simple exercise) doesn’t immediately lead to the possibility for another researcher to investigate the effect of infection on a bird’s condition, effect of immunity, migration dynamics, to name a few options. Would PLOS allow that? I don’t know.

An issue raised by Terry McGlynn was the differences between a small lab and those at resource-intense research labs. In a small lab, the research takes longer time due to limited resources and smaller staff, and a good dataset is extremely precious and could also act as a currency, enabling co-authorships through collaborations. I would like to end with an additional story. A story of long-term data collected by volunteers at Ottenby Bird Observatory.

Ottenby Bird Observatory was founded in 1946 and have run without end since then. Each year volunteers help out with the trapping and banding of birds, mostly passerines, but also a chunk of waders, ducks and birds of prey. The number of birds caught each year is between 15 and 20 thousands, and in total more than 1 million birds have been banded. This dataset, together with all morphometrics collected in connection with the banding and all band recoveries is a unique and extremely valuable data series. The problem is that few pay for it. The observatory receives a small fund from the Swedish Environmental Board, but not enough to cover the costs. Additional money comes from tourists and subsidies from the Swedish Ornithological Society. And, although not substantial, from researchers that pay for the service of collecting data or getting data from the trapping series.

The observatory really provides a service. To date at least 278 peer-reviewed articles have been published with data emanating from Ottenby, including two papers in Science and >10 in PLOS journals. The unbroken trapping series has proven to be one of few datasets where the time scale is sufficiently long to investigate effects of climate change on biological phenologies, measured as timing of migration of common passerine birds. Researchers that want to use the data put in a request to the observatory, and a sort of contract is settled between the parties. Usually the money is little, but also small sums are essential for a volunteer-based operation. What happens when all data becomes immediately available for everyone without restrictions? A question to ponder, really.

In many ways, PLOS has revolutionized scholarly publishing and the open access movement has made research results available fast for the masses. I sincerely hope that the new data policy does not inadvertently work the opposite way, by making researchers less prone on submitting their studies to PLOS journals. It is still too early to tell, but I think many like me really wonders what the ‘minimal dataset’ really means in practice.

*******************************************************************************************************************

If you enjoyed this post, or other posts on this blog, why not follow the blog via email, Feedly or get updates via Twitter by following @DrSnygg?

How bats in Peru change our view of flu (and it rhymes!)

By Jonas Waldenström

I am the real Bat Man and here to bite y'all

I am the real Bat Man and here to bite y’all

One of the major news in the virology community last year was the publication in PNAS describing a completely new influenza A virus. In line with the taxonomy traditionally used for influenza viruses it got the name H17N10, illustrating that it possessed novel hemagglutinin (H17) and neuraminidase variants (N10). However, it wasn’t the numbers that was the ground breaking news, it was the fact that the virus was detected in a Central American bat, and not in a bird. A tropical bat is very far from the ‘normal’ diversity of influenza A viruses seen in wetland birds and waterfowl. Although bats and ducks both have wings, in evolutionary terms they separate a very, very long time ago in the age of dinosaurs. In fact, there are more differences than similarities between bats and gulls in ecology, physiology and aspects of cellular biology. Hence, the bat flu was a remarkable observation. A real shaker. In one sweep, the whole flu field needed to come with terms that not all viruses are bird viruses.

The initial findings also hinted that the first bat influenza virus was unlikely to be alone. An influenza-iceberg, of sorts, made up of fluffy, winged mammals. This week, a first follow-up was published in PLOS Pathogens. A crew of (mainly American) scientists analyzed samples from bats sampled in the Amazonian parts of Peru in 2010, collected as part of CDC’s tropical pathogen surveys. In total, 114 individuals of 18 bat species were taken out from the freezers and different sample types were screened with a molecular method designed to broadly pick-up the RNA of any influenza A virus. They got one hit from a fecal sample in a single bat! A lucky shot at the Tivoli, given the low sample size. Prompted by this, the authors brought in the big machinery and sequenced the totality of the genetic material in the samples from this poor, long-dead bat and used bioinformatic tools to resolve the genome of the virus that had infected its intestines. When bit by bit was added it became clear that it was indeed a completely new influenza A virus, very different from avian viruses, and similar, but still distinctly different from the earlier H17N10 bat virus. And the name? H18N11 of course!

Please take a close look at the figure below. It shows the phylogenetic relationships of each of the influenza A virus’ eight RNA segments – in black are all ‘non-bat viruses’ and in red the two new bat viruses H17N10 and H18N11. For all the segments coding for ‘internal’ proteins, i.e. those involved in the polymerase machinery or the structural properties of the virus, you see that the two bat viruses are always found in a neat little red outgroup. This signals a long evolution away from other known influenza A viruses. It is a little prematurely to say exactly how long, but the branch lengths indicate that this happened a long time ago.

Phylogenetic trees for the 8 different IAV segements, see http://www.plospathogens.org/article/info%3Adoi%2F10.1371%2Fjournal.ppat.1003657

Phylogenetic trees for the 8 different IAV segements, see http://www.plospathogens.org/article/info%3Adoi%2F10.1371%2Fjournal.ppat.1003657

Now look at the hemagglutinin and the neuraminidase trees (HA and NA, respectively). The same pattern is repeated for the NA, but not the HA. In fact, the two novel hemagglutinins are nested within avian hemagglutinins. How can we interpret this? At first this doesn’t make any sense, but one has to remember that influenza viruses don’t evolve in the same way you or me, trees, shrimps or ferns do. Influenza viruses can reassort, meaning that if two viruses of different origin infect the same cell the different RNA segments can be put in new combinations in the resulting virions. Imagine two decks of cards being shuffled, one red and one black, and that each virion randomly consists of a draw of card from the combined shuffled deck, sometimes red, sometimes black, and sometimes mixed.  This is a rapid way in which new variants can arise, and a reason behind the genesis of pandemic flu in humans.

Returning to the bats, it seems that bat and avian viruses have met in a not too distant evolutionary past, and that a HA variant have sailed into the bat influenza gene pool. It will be interesting to see how the picture changes when more bat viruses are sequenced. Has there been one reassortment event, followed by drift and a subsequent separation into H17 and H18? Or, has there been many? Are there, perhaps, avian H17 and H18 to be found in South American birds? What about bats in North America, Europe, Africa and Asia?

One thing we can be sure of is that there are more viruses waiting to be detected and described. One sign of this comes from the current paper. The authors used the sequenced genomes to construct recombinant HA and NA molecules (using fancy virologist tricks) and used these to build assays (ELISAs) where bat sera could be screened for antibodies against the new HA and NA variants. Where the molecular screening yielded one positive bat, the serology approach found 55 of 110 bats showing signs of having been infected with flu earlier in life. This clearly indicates that influenza viruses are widespread in Peruvian bats, and likely in other parts of the world too. Moreover, they found cases of bats with antibodies to one of the recombinant HA or NA, but not to the other, suggesting that are more combinations of HA/NA to be found.

Finally, and perhaps the most interestingly of all results was that the hemagglutinin of bat influenza viruses does not to behave in the same way as avian hemagglutinins. When a virus is to infect a cell it needs the hemagglutinin protein to serve as a key, docking with a sialic acid receptor – the lock – on the cell. If the key and the lock don’t fit infection will not occur. For instance, a major division between human flu and avian flu is the preferred conformation of a galactose residue on the sialic acid receptors. This little difference makes it hard for avian viruses to infect humans, and vice versa. But with bat viruses it seems sialic acid receptors are not used at all! Instead bat HA uses an unknown receptor for cell entry. Holy Moses!

More to follow shortly, I suppose. Major obstacle at present is the lack of a culturing method for bat influenza viruses. Neither cell lines nor eggs have worked so far. Without the means to grow the virus it is very tricky to study it. But there are many clever virologists out there, so it is likely not too far away.

But I still prefer feathers to fur, and will stick with ducks.

Links to articles:

Tong S, Zhu X, Li Y, Shi M, Zhang J, et al. (2013) New World Bats Harbor Diverse Influenza A Viruses. PLoS Pathog 9(10): e1003657. doi:10.1371/journal.ppat.1003657

Tong S, Li Y, Rivailler P, Conrardy C, Castillo DA, et al. (2012) A distinct lineage of influenza A virus from bats. Proc Natl Acad Sci USA 109: 4269–4274.

Disease is a property of the individual

Ecologists are obsessed with variation, in any form, the more bizarre, the better. We really love it! But why?

The textbook explanation is that variation among individuals, if heritable, work as a template for selection and thus drives evolution. Without variation, little can change. Evolutionary important variation relates to genetic traits that make the organism better adapted to its environment, a better competitor, more disease resistant, or relates to traits that make him/her more attractive to the other sex, thereby increasing the likelihood of siring offspring.

l_015_04_l

And additional explanation, and sometimes equally important, is that it is fun with variation: an animal may be short or long, have a peculiar nostril shape, vary in the curvature of antlers, or have striking plumage colors. Simply, humans like variation, and the diversity in itself therefore drives curiosity-driven researchers.

This said, when it comes to disease in animals most researchers tend to neglect variation. Disease is commonly treated as a constant; the animal is either infected with parasite X or is not. However, in reality what the researcher denotes as parasite X may actually be a plethora of different pathogen genotypes, all seemingly dressed in the same costume (the phenotype), or sometimes even consist of cryptic species. This is dangerous, as things that look the same in the microscope (or in a conserved gene used for molecular screening) may have fundamental differences in traits that are relevant for infection processes, such as pathogenicity, transmission and virulence. Simply, we may run the risk of not seeing patterns that are there, or jump to the wrong conclusion based on simplified assumptions.

Further, surprisingly often wildlife diseases are treated at the level of the population (especially abundant in veterinary medicine), and not at the level of the individual animal. For instance, prevalence, the proportion of individuals carrying a particular disease at a given time, is much more frequently used than estimates of incidence, which relates to the risk of acquiring infection. In the former you can adhere to a ‘hit and run’ sampling approach, in the latter you need to monitor individuals across time and take repeated samples.

For a long time, actually since 2002, we have studied influenza A virus in a migratory population of Mallards in SE Sweden. We also started at the level of population, describing temporal variation in influenza A virus prevalence in the duck population, and describing differences in prevalence among ages and sexes. And yes, we treated the virus as pathogen X, not at the level of subtype (which there are many of in flu). But with time we have moved to assessing what is happening at the individual level, and how differences among individuals in susceptibility drive disease dynamics, and how disease histories and immunity patterns in turn drive evolution in the virus.

These efforts are starting to pay, and in a paper published this week (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0061201) we address the issue of individual variation among Mallards in influenza A virus infection risk. The question we asked is how individuals with the same background, in a shared environment with similar exposure to influenza, differ in disease histories and immune responses.

In our monitoring program we use a large duck trap to catch wild ducks. By providing grain we give the birds an incentive to visit the trap, and as additional attraction we have a compartment with lure ducks, that are supposed to get the wild ducks to enter. In this study, we used the lure ducks as a natural infection experiment. Ten immunologically naïve, juvenile Mallards from a farm were placed in the trap and were then followed throughout an autumn season, and then for the next spring, summer and autumn. Fecal samples were collected daily and blood samples approximately every second week. A lot of samples, and collected with a precision that allowed us to give very detailed infection histories for each individual.

cropped-oimg_3582.jpg

In turned out that our study ducks varied tremendously in disease patterns, despite being of the same age, raised in the same farm, sharing the same little experimental enclosure and being exposed to the same environmental variation. All ducks became infected with flu within the first five days of being placed in the trap, but the number of infection days varied tremendously. And so did the number of retrieved virus subtypes, thus different individuals were infected with varying number of virus variants, in this case equal to different infection events.

Furthermore, we got really nice long-term patterns. After the initial primary infections early on in the first autumn, and a number of secondary infections later the same autumn, we recorded only a single infection day the next spring and summer. It wasn’t until the second autumn, when migration of wild ducks started in earnest again, that new infections were seen in the lure ducks. And in this case, no infection was of a subtype the individual had experienced the year before, suggesting very strong and long-lasting homosubtypic immunity.

Individuals also varied profoundly in their immune responses. We measured the humoral immune response, manifested as anti-influenza-antibodies (raised against the conserved nucleoprotein of the virus), across time. Have a look at the figure below; it really shows variation both on a temporal scale, but also at the individual scale, both in patterns and in height of response.

journal.pone.0061201.g004

So what does it tell us? To start with, there is a large difference between individuals in resistance/susceptibility to influenza A virus infection in Mallards. This difference is not only manifested in different infection histories, but also as very variable immune responses. Second, these differences are very likely determined by genetic differences, meaning that there are heritable differences, and thus traits that could be selected for by natural selection. Not all ducks are equal – and this important for our ability to model disease dynamics in this system. Is it really the mean that is important for assessing the transmission probabilities along migration? Perhaps it is the outliers that are driving the processes?

This study is a first step to adress individual variation, and there are already a couple of follow-up publications in the peer-review tube, so we will have opportunities to get back to this topic.

That’s all for now. Live long and prosper – and don’t treat disease simply as a property of the population.

Jonas Waldenström