Logging your work – The Devil in the Details

I often collaborate with biologists on different projects. Sometimes I do most of the bioinformatics stuff, but most of the time I try to help them do the work themselves. There are two main reasons for why I prefer this route. First of all – self preservation. There are too many projects out there that are interesting. If I were to do all of them myself, I would drown. Second – I enjoy teaching. It is fun seeing somebody understanding things and managing to do something new.

There is however one thing that I try to teach that I am beginning to think can only truly be learned after having botched things up. This is the importance of logging your work. I myself had to learn this the hard way. During my early PhD years I designed a tiling microarray chip for one specific bug. Six months later, I was asked to do the same thing again, but for a different bug. I thought that this would be a breeze. After all, I had done this before, hadn’t I? I went back to my notes, and to my horror discovered that I could not for the life of me reproduce the files that I had produced for the initial bug. I did know the programs that I had used, and also some of the settings, but no matter how I tweaked things, I could not get the same results. Since my previous design had not yet been put into production, I ended up redoing the entire thing for both bugs, and this time I meticulously wrote down the entire process. This time I knew that if we got good results and could publish on it, I would actually be able to tell how I designed the chip and why.

I often tell this story when I talk with biologists about logging computational work. I usually get a lot of nods from those listening, and I know that at least some of them will actually start logging their work. I am uncertain though of how many actually stick with this habit. Many people seem to think that they can trust their own memories. Don’t get me wrong – there are probably people out there who are capable of remembering in great detail how they did an analysis. But, I do not believe that this goes for the great majority of scientists. I believe that for most people, the only way of keeping track of what was done and why is to write it down.

An additional complication is that I believe that many when they first get their data do not really see a reason to log what they do. Most people start just exploring their data, making some graphs and tables just to see what the data looks like. In my opinion, this exploratory data analysis phase is vital – it gives the researcher a feel for the data that in my experience can be essential to discovering errors in both the data and the analyses. However, I think that for many, this exploratory data analysis phase silently and without fanfare slides over into a final production phase. Results that were initially produced in a “let’s just see what this looks like” fashion end up being used as figures and tables in the final paper without a real track record of where the results came from.

Creating a new habit can be difficult. Writing down what is done to the data and why can seem tedious and may seem like just a waste of time. However, instead of just saying “log your work” in a stern voice, I thought I would hold out some of the more tangible benefits that a good log can provide. Your mileage may vary, but if there is no log, these benefits will certainly not be available.

Error detection. If you know what you did, it is easier later to discover what went wrong if there is something in the results that do not add up. It can be very easy to write 2 instead of 3 in an option setting, and when working with sequences, ATGGC is very close to ATGCC.
Internal reuse of methods. Maybe you have a different data set to run on, or just want to change some small elements in the analysis. If the current procedure is already written down, reusing and changing it is a lot easier. For some people this spells writing a script for running an analysis, but even just a cut-and-paste sequence of commands can go a long way.
Writing the materials and methods section. If the results are good, you will want to publish on them. If there is a written log stating how the results were produced, writing the M&M section should be a walk in the park.
Defending the results in reply to a reviewer. A reviewer might ask questions about the analyses. If the log files detail not only how the results were produced, but also why various decisions were made, it is easier to respond to questions about the whys and the wherefores of the analysis.
Reproducibility. In theory, all science should be reproducible. If it is not reproducible for the one creating the results in the first place, nobody else can reproduce it either. If your work is reproducible by others it might not benefit you directly here and now, but may increase the citation rate of your paper. Many people dislike citing papers where they are not quite certain of what was done.

The last question is then what should be logged and how to keep a log. In my logs I usually note things such as:

program versions
program options
location of files
file versions. Calculating a check sum can go a long way – use for instance md5sum.
urls for where files were downloaded from, together with the download date
thoughts about results and solutions and discussions about how choices were made.

For a long time my own logs have simply been a dated journal where I copy-paste commands, links to files, md5sums of input and result files, and where I discuss with myself the reason for my decisions. I keep this in a plain text file. I have tried other solutions that allowed me to paste in pictures, and which would let me import pdfs and other documents, but the plain text file still sticks with me to this day. This file can be read on any computer, does not require special software to open, and is easy to keep track of. I do know of those that use Evernote for this, and others again that use TiddlyWiki. The technical solution behind a log is in my opinion not all that important. The really important thing is that it should be something that is easy for you to use, otherwise it just becomes another barrier to writing things down. Keep it simple, keep it easy, and in the end the log will work for you.

2 Replies to “Logging your work”

Richard Smith says:

October 18, 2013 at 15:52

I also learned this lesson in the first year of my PhD. And like you I keep a daily log in plain-text format (in my case Markdown), one file per week. I would go a step further and say that if you’re running a bunch of commands, make them a script. Even if you don’t think you’ll run them again, if you make a mistake in the first run you can just tweak the script and do the whole thing over. It’s great for repeatability and for allowing someone else to continue your work.

Another thing I always do these days is to make all my scripts output a protocol. That is, a plain text document with a timestamp that logs the command used, the time and resources consumed, etc.

1. Wei Shen says:
  
  May 8, 2014 at 09:05
  
  Good practice. Thank you.
  
  Those had also been discussed in Ten Simple Rules for Reproducible Computational Research and Ten Simple Rules for Reproducible Computational Research published on PLOS Computational Biology.

2 Replies to “Logging your work”

Leave a Reply to Wei Shen Cancel reply