File format and reproducibility

Science is more about limiting doubts than creating certainties. To limit doubt, the reproduction of the same experiment should lead to draw the same conclusions, again and again. This is called the reproducibility, and it is a big deal in science. Actually, this is what makes scientific findings valid. This topic has received an increasing attention in the last years:

An important part of research-related tasks takes place on computer nowadays. But no need to do fancy things such as modelisation and simulation, something as basic as data is numeric. One of the keystones for reproducibility therefore lies in data storage and data access. Both questions are addressed by online databases fed and maintained by communities of researchers world-wide. However, only a small portion of scientific data ever makes it to one of these databases. I am also guilty for that. The reason being, in my experience, the need to comply to specific formatting and arrangement of the data. That’s a bit tedious and time consuming, and I understand it can constitutes an obstacle. But this obstacle is easier to overcome if you first get acquainted with.

An excellent way to be familiar with the issue of data formatting and arrangement is to consider it already with small, basic, internal data. By internal data I mean the dozen of spreadsheets we all have inside our computers that serve no other purpose than ease our own organisation. That’s already a noble destiny, and a reason to pay them more attention and care. The best example in my workflow is my list of samples. For each project I create and maintain a spreadsheet that lists all samples ever retrieved, what is their status (to prepare / to analyse / analysed), where to find them, where to find the data collected from them, etc. . I long used to save this spreadsheet as an Excel file, but I now prefer an open file format, either comma (CSV) or tab (TXT) delimited values. This doesn’t forbid using Excel for updating it. This is everyone own decision. The important thing is not the software but the format!

The objective is reproducibility. If you read this blog, you already know that I am dedicated R user. I base the entire processing of data—transformation, statistical analyses—and (almost) all of their graphical representation on R. My internal data are also part of that workflow, since I often check the data against the list of samples for instance using scripts where everything is transparent. Having all the files in open formats makes it easier to share them with your colleagues. Moreover, sharing the scripts as well allow them to replicate your workflow themselves, eventually spot mistakes, and reproduce your experiment. That’s a virtuous circle for the direct benefit of scientists, likely to be useable as long as computers live.

You too, give it a try! Just remember to start simple 😉

Posted by Benjamin

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.