Basic bioinformatics python course, part II

This is the second part of the python bioinformatics course that I have taught biologists. This module is about control flow and how to handle input and output. Control flow is needed in mainly two different situations, either that a decision based on data has to be made, or that a piece of code should be repeated.  In Python, decisions are made using an IF statement, while iterations (repeating code) are done with either a FOR loop, or a WHILE loop. How to handle input and output from files is also described –  in most cases that is where the data in question is to be found, and it is easier to keep track of results if the program prints the results to a file.

[gview file=”http://blog.karinlag.no/wp-content/uploads/2013/10/FlowIO.ppt”]

Logging your work

I often collaborate with biologists on different projects. Sometimes I do most of the bioinformatics stuff, but most of the time I try to help them do the work themselves. There are two main reasons for why I prefer this route. First of all – self preservation. There are too many projects out there that are interesting. If I were to do all of them myself, I would drown. Second – I enjoy teaching. It is fun seeing somebody understanding things and managing to do something new.

There is however one thing that I try to teach that I am beginning to think can only truly be learned after having botched things up. This is the importance of logging your work. I myself had to learn this the hard way. During my early PhD years I designed a tiling microarray chip for one specific bug. Six months later, I was asked to do the same thing again, but for a different bug. I thought that this would be a breeze. After all, I had done this before, hadn’t I? I went back to my notes, and to my horror discovered that I could not for the life of me reproduce the files that I had produced for the initial bug. I did know the programs that I had used, and also some of the settings, but no matter how I tweaked things, I could not get the same results. Since my previous design had not yet been put into production, I ended up redoing the entire thing for both bugs, and this time I meticulously wrote down the entire process. This time I knew that if we got good results and could publish on it, I would actually be able to tell how I designed the chip and why.

I often tell this story when I talk with biologists about logging computational work. I usually get a lot of nods from those listening, and I know that at least some of them will actually start logging their work. I am uncertain though of how many actually stick with this habit. Many people seem to think that they can trust their own memories. Don’t get me wrong – there are probably people out there who are capable of remembering in great detail how they did an analysis. But, I do not believe that this goes for the great majority of scientists. I believe that for most people, the only way of keeping track of what was done and why is to write it down.

An additional complication is that I believe that many when they first get their data do not really see a reason to log what they do. Most people start just exploring their data, making some graphs and tables just to see what the data looks like. In my opinion, this exploratory data analysis phase is vital – it gives the researcher a feel for the data that in my experience can be essential to discovering errors in both the data and the analyses. However, I think that for many, this exploratory data analysis phase silently and without fanfare slides over into a final production phase. Results that were initially produced in a “let’s just see what this looks like” fashion end up being used as figures and tables in the final paper without a real track record of where the results came from.

Creating a new habit can be difficult. Writing down what is done to the data and why can seem tedious and may seem like just a waste of time. However, instead of just saying “log your work” in a stern voice, I thought I would hold out some of the more tangible benefits that a good log can provide. Your mileage may vary, but if there is no log, these benefits will certainly not be available.

  • Error detection. If you know what you did, it is easier later to discover what went wrong if there is something in the results that do not add up. It can be very easy to write 2 instead of 3 in an option setting, and when working with sequences, ATGGC is very close to ATGCC.
  • Internal reuse of methods. Maybe you have a different data set to run on, or just want to change some small elements in the analysis. If the current procedure is already written down, reusing and changing it is a lot easier. For some people this spells writing a script for running an analysis, but even just a cut-and-paste sequence of commands can go a long way.
  • Writing the materials and methods section. If the results are good, you will want to publish on them. If there is a written log stating how the results were produced, writing the M&M section should be a walk in the park.
  • Defending the results in reply to a reviewer. A reviewer might ask questions about the analyses. If the log files detail not only how the results were produced, but also why various decisions were made, it is easier to respond to questions about the whys and the wherefores of the analysis.
  • Reproducibility. In theory, all science should be reproducible. If it is not reproducible for the one creating the results in the first place, nobody else can reproduce it either. If your work is reproducible by others it might not benefit you directly here and now, but may increase the citation rate of your paper. Many people dislike citing papers where they are not quite certain of what was done.

The last question is then what should be logged and how to keep a log. In my logs I usually note things such as:

  • program versions
  • program options
  • location of files
  • file versions. Calculating a check sum can go a long way – use for instance md5sum.
  • urls for where files were downloaded from, together with the download date
  • thoughts about results and solutions and discussions about how choices were made. 

For a long time my own logs have simply been a dated journal where I copy-paste commands, links to files, md5sums of input and result files, and where I discuss with myself the reason for my decisions. I keep this in a plain text file. I have tried other solutions that allowed me to paste in pictures, and which would let me import pdfs and other documents, but the plain text file still sticks with me to this day. This file can be read on any computer, does not require special software to open, and is easy to keep track of. I do know of those that use Evernote for this, and others again that use TiddlyWiki. The technical solution behind a log is in my opinion not all that important. The really important thing is that it should be something that is easy for you to use, otherwise it just becomes another barrier to writing things down. Keep it simple, keep it easy, and in the end the log will work for you.

‘Sorting out Sorting’

There are times when my nerd shows more than usual. I recently found a movie on YouTube that that brought back a lot of nerdy memories from my studies. I got my bachelor and masters degree at the University of Bergen. I really liked chemistry and biology, but I also really liked computers. My father had shown me the joy that could be found in putting together a computer without the manual without any blue smoke appearing. The consequence of this was that I studied both molecular biology and computer science.

The movie in case is one that used to be shown in the last lecture in the ‘Data Structures and Algorithms’ course. This movie was made in 1981 and illustrates nine different sorting algorithms using fairly hefty graphics for the time and has a very distinct plink-plonk sound track. The tradition among the students was that when this movie was shown, we would show up with biscuits in the shape of letters, and rødbrus, which is a soda primarily made for children. We would then during the movie sort the letter biscuits using the algorithm that was currently shown in the film.  At the end, when all of the algorithms were shown at once and are racing against each other, we would all cheer for bubblesort. I do believe that somebody at some point actually made a banner in support of bubblesort.

So – in case you have been sitting there wondering which sorting algorithm is the fastest – sit back and enjoy. For my part, I have to go find some biscuits and rødbrus.

RNAmmer 1.2 install issues

RNAmmer is getting on in years, but it is still heavily used, something that we, the authors deeply appreciate. However, it is not always easy to install. Here, I describe what needs to be done in order to get it up and running.

Path changes

The changes that have to be introduced are to be found in this section:

## PROGRAM CONFIGURATION BEGIN

# the path of the program
my $INSTALL_PATH = "/usr/cbs/bio/src/rnammer-1.2";

# The library in which HMMs can be found
my $HMM_LIBRARY = "$INSTALL_PATH/lib";
my $XML2GFF = "$INSTALL_PATH/xml2gff";
my $XML2FSA = "$INSTALL_PATH/xml2fsa";

# The location of the RNAmmer core module
my $RNAMMER_CORE     = "$INSTALL_PATH/core-rnammer";

# path to hmmsearch of HMMER package
chomp ( my $uname = `uname`);
my $HMMSEARCH_BINARY;
my $PERL;
if ( $uname eq "Linux" ) {
        $HMMSEARCH_BINARY = "/usr/cbs/bio/bin/linux64/hmmsearch";
        $PERL = "/usr/bin/perl";
} elsif ( $uname eq "IRIX64" ) {
        $HMMSEARCH_BINARY = "/usr/cbs/bio/bin/irix64/hmmsearch";
        $PERL = "/usr/sbin/perl";
} else {
        die "unknown platform\n";
}

The program was in the first place written to be run on the servers at the Danish Technical University, hence the $INSTALL_PATH setting. This should be set to wherever you keep your RNAmmer installation. In my case, I am setting it to /home/karinlag/projects/rnammer, since I am here having it as a local install in my home directory.

The next thing that has to be done, is to get the right HMMer installation and to figure out where perl is.

You will need version 2.3 of HMMer, which you can download from this location. Download it, and read the INSTALL instructions. It should install cleanly on most *nix systems.

I installed hmmer-2.3 in /home/karinlag/src, where it created the directory hmmer-2.3. Inside the src directory you will find the hmmsearch program. Set the $HMMSEARCH_BINARY variable to point to the hmmsearch program. Note: you need to check what the command uname tells you to figure out what system you have so that you know which of the if clauses to modify things in. If it does not say Linux or IRIX64 (which is unlikely these days), you will need to set either the Linux string or the IRIX64 string to what you have, and set the paths below accordingly.

You also need to check that you have the right perl path. You can figure that out by doing ‘which perl’.

You should now be able to do

perl rnammer -S bac -m lsu,ssu,tsu -gff - example/ecoli.fsa

and get results.

Posix errors

You may end up with errors that say something along the lines of 

FATAL: POSIX threads support is not compiled into HMMER; --cpu doesn't have any effect

If you get this, you need to find the following in the core-rnammer script:

system sprintf('%s --cpu 1 --compat ...... and so on

Remove the two instances of –cpu 1 (not the whole sentence, just ‘–cpu 1’), and you should be good to go.

XML/Simple.pm

You might end up with having RNAmmer complain about not being able to find XML/Simple.pm in @INC. To solve this, you need to install perl-XML-Simple. Installing perl modules is something I consider to be deep voodoo, so I won’t even try to describe how to do that. Refer to your system to figure that one out.

Other errors?

If you discover other errors than those I have described here, let me know in the comments!

Basic bioinformatics python course, part I

bassibook

I have on several occasions had the privilege of teaching basic programming to biology students. My preferred language in this situation as in many others is python. I have also been fortunate enough to find a book which I think does a fairly good job of teaching basic python in a way that biologists find useful. In this context that mostly means dealing with sequences in a sensible way. The book in question is “Python for Bioinformatics” by Sebastian Bassi.

The only note here is that there are some spelling mistakes in it, and that it is from 2009. Python has progressed to Python version 3 now, whereas the book is at version 2. However, for a beginning programmer, this should not make too much of a difference.

I am here putting out the slides that I used for a one-day intro course for biologists. The course is very interactive, meaning that in the slides there are  many short exercises which are followed by the answer. I am in this post putting out the first set of slides that deal with the basics, the rest will follow during the next couple of weeks.

Note: I have tried to ensure that these slides are bug free, but there are bound to be some mistakes somewhere. Please let me know if you spot any!

Enjoy!

Part 1: The basics

The first lesson begins with discussing programming a bit, and the two modes in which python can be used – interactively and batch mode. I then go through the basic datatypes in python, i.e. what kind of “things” that are available. I cover how to use python as a calculator, how to work with strings, and also what a list and a dictionary is and how to use them.

[gview file=”http://blog.karinlag.no/wp-content/uploads/2013/10/Basics.pdf” save=”0″]