Lecture on bacterial typing and whole genome data

A week ago, I was at a workshop in Denmark. This was for a Nordic Working Group for Microbiology and Animal Health and Welfare (NMDD) meeting on Source attribution of Campylobacter in the Nordic countries. We were around 12 people or thereabouts, mostly from the modelling side of the source attribution table.

Since starting my new job at the Norwegian Veterinary Institute in July, I have finally had time to focus something that I have found quite fascinating for some time, and that is how to use whole genome data for tracing bacterial infections. The Campylobacter source attribution project has as its goal to figure out how to use MLST data gathered from the Nordic countries to elucidate which reservoirs the human Campylobacter cases in these countries stem from. This merges very nicely with my interest in using whole genome data for such purposes.  My contribution to this workshop was a presentation concerning how whole genome sequencing is making its way into bacterial typing. I am including the slides from this meeting in this post.

Through this meeting I got a very useful insight into how the modelling side of these issues work. However, I also discovered that not everybody present were necessarily completely aware of the “shifty” nature of the bacterial genome, and by extension the characteristics of the MLST data the modelling within the project is done on. To alleviate that, I added more about what a bacterial genome actually is in the beginning, and also added more about horizontal gene transfer towards the end. In working with these things on the eve of lecturing, I was very happy to have the assistance of the twitter community, which helped me dig out details such as the rate of horizontal gene transfer in Campylobacter (heck, it apparently even happens with core genes, can’t we trust anything anymore?), which proved very useful in the discussions.

Comments and thoughts are very welcome!


Teaching Software Carpentry workshops – some tricks of the trade

These days I am gearing up to teach two more Software Carpentry workshops, one in Wageningen, Netherlands, and one in Oslo, Norway. In the Netherlands workshop I will be teaching a module that I haven’t even looked at before. This led me to think about the things I do to prepare for a workshop. So, here is a list (in no particular order) of things that I do that others might find useful.

  • Go through the instructor checklist. Have a look at the other checklists too, that helps with figuring out what you can expect of the other parties involved in the workshop.
  • Recently, all of the workshop modules have been put into their own github repos. Go sign up for notifications for those that you are teaching. It is highly likely that discussions about the material will prove useful. These can contain both information about technical issues and about how to teach that particular module.
  • Go through your module(s) on as many platforms that are available to you. If you are thusly inclined, consider creating a virtual machine or two and go through both the installation procedure and the module there. Remember, this takes time, so start before you think you have to, there will always be weird hickups.
  • Print out a copy of the lessons on paper and make notes on them as you go along. Take them with you to the workshop. During my first workshop I did not have a printout, and it was not a pleasant experience trying to switch back and forth between windows. I don’t know if I or the students ended up being the more confused.
  • Have a look at the wiki for technical issues and familiarize yourself with the latest technical annoyances.  Ensure that you have an easy way to get back to it again during the workshop. I have forgotten where it is a couple of times, and it was equally annoying each time having to spend time figuring out where it was.
  • Make sure that the host supplies stickies, and consider taking a backup stash with you in case the host misplaces them or simply did not get them because they didn’t believe in them. Ensure that you have at least twice as many stickies as students, sometimes they lose them, sometimes they spill coffee on them, sometimes they distractedly end up tearing them into tiny tiny little pieces. You get the picture.
  • During the workshop – USE THE STICKIES! They are a lifesaver. If you have not taught with them before, just give them one single go and that should be enough to convince you. It is a lot easier to keep track of where people are with them than without, you can keep a higher speed through the material without loosing anybody, and it is a lot easier to see who needs help. It also saves students sore shoulders since they don’t have to keep their hands up in the air until they fall off. On a more serious note, I suspect students ask for help more quickly with stickies since the overhead cost associated with it is reduced – it is not very taxing to put a stickie on your screen.
  • When you are live coding (typing on your computer) for the entire lesson it is tempting to sit down. Consider teaching standing up instead. It helps with speaking clearly and loudly enough so that people can hear. I also suspect that instructors may be quicker to go and help people when teaching standing up, because you don’t actually have to get up first. If you decide to teach standing up, tell the organizer so that they can fix something to have the computer on.
  • Bring good walking shoes. If you enjoy wearing heels, leave them at home. You are likely to do a lot of standing up and walking about, both during the workshop and in the evenings. You do not want to end up teaching with blisters. Also, you are likely to be walking around in a room with a lot of extension cords and leads lying on the floor. The risk of tripping over something is already higher than normal.
  • Bring throat lozenges or cough drops or whatever they are called, and a bottle of water. You will end up speaking a lot more than you are used to, which might lead to a sore throat, coughing and in a worst case scenario, losing your voice. I once got a coughing fit while teaching and it was not a fun experience.
  • If you can, try to get together with the helpers, the other instructors and the organizers the evening before the workshop. It really helps to have met before the workshop. Everybody, especially the helpers, are bound to have questions about things, questions that won’t have occurred to them until they are actually talking with others involved in the workshop. This is also good for giving last minute information, ensuring that everybody knows where and when to show up, organizing transport etc.
  • Ensure that you get to the workshop in plenty of time in the morning. The building you are teaching in might be confusing to navigate, so give yourself enough time to get there. You will then also have time to set up your own computer, sort out your papers etc.
  • Last but not least: have fun!

So there you have it!

The one where I went to Sweden

I spent some days two weeks ago in Stockholm, Sweden. Lex Nederbragt and I were invited by SciLifeLab to teach a Software Carpentry workshop there. This coincided with the very first PyCon Sweden Conference, and as the organizers would have it, I got to present a talk.

The workshop

The workshop went very well. Lex and I taught the by-now fairly well known novice workshop (if you want one at your institution, let them know!). Oxana Sachenkova, the local organizer, had also set up an intermediate workshop. The teachers in that one were Konrad Hinsen and Nelle Varoquaux, both flying in from Paris. Their workshop focused more on object oriented programming and intermediate git use. It was great meeting them, the only sad thing is that I could not sit in on their workshop

The division of labor between Lex and I have until now been that he teaches shell and unit testing, while I teach git and python. This time I taught both these parts from the new lesson material that has been developed. I had taught the git lesson earlier once before, so that material was well known to me. I think this lesson is reasonably easy to teach, the real challenge is to convey to the students why version control is useful at all. At this stage I am leaning towards most people not really understanding the need for version control before they have either messed up their work pretty badly, or have become involved in a joint development project.

I had not taught the python lessons before. These now take place entirely in the iPython Notebook. The first time I went through them, I actually wondered if I should return to the old lesson material, if nothing else because on the printout I had somewhere around 50 pages to go through. On the second run through, however, I realized that the notebook is a game changer. With the notebook, I could have the students editing and copy-paste code from earlier in the lesson, which would reduce the typing time and hence the teaching time dramatically. There were still things that I cut from this lesson – I did for example not go through he python call stack, simply because I still think this is too complicated for novices. Instead, I teach them the basic tenant “What happens in a function, stays in a function”, and that does seem to stick.

The conference and my talk

Due to teaching I only got to attend the last day of the conference. The programme looked really nice, and I got to see some really great talks. The morning of the last day opened with Laurens Van Houtven speaking about cryptography, and Jackie Kazil speaking about how she started using programming in her journalism and how that lead her to new pastures. After lunch there were several other talks, most of which were pretty technical. Such talks can be really good, but to me they lose their value when they don’t even have a 3 minute “subject of my talk for dummies” intro. 

My talk was at the end of the day, and was entitled “Python and Biology: a shotgun wedding” (pardon the pun, when the title appeared in my head, resistance was futile). The background for the talk was that I have several times during the last couple of years helped people – primarily biologists – start programming. Naturally, as opinionated as I am, I have ended up with some do’s and don’ts on where to start. I also included a bit of background on why life scientists have had to get into this game, and also showed some examples. I have included the slides below.

The talk seemed to be fairly well received – it was however aimed at novices, and there did not seem to be too many of those in attendance. I did however see some people nodding vigorously in the front, and got some really nice questions at the end, so all in all I think it went over well.



Basic bioinformatics python course, part II

This is the second part of the python bioinformatics course that I have taught biologists. This module is about control flow and how to handle input and output. Control flow is needed in mainly two different situations, either that a decision based on data has to be made, or that a piece of code should be repeated.  In Python, decisions are made using an IF statement, while iterations (repeating code) are done with either a FOR loop, or a WHILE loop. How to handle input and output from files is also described –  in most cases that is where the data in question is to be found, and it is easier to keep track of results if the program prints the results to a file.

Logging your work

I often collaborate with biologists on different projects. Sometimes I do most of the bioinformatics stuff, but most of the time I try to help them do the work themselves. There are two main reasons for why I prefer this route. First of all – self preservation. There are too many projects out there that are interesting. If I were to do all of them myself, I would drown. Second – I enjoy teaching. It is fun seeing somebody understanding things and managing to do something new.

There is however one thing that I try to teach that I am beginning to think can only truly be learned after having botched things up. This is the importance of logging your work. I myself had to learn this the hard way. During my early PhD years I designed a tiling microarray chip for one specific bug. Six months later, I was asked to do the same thing again, but for a different bug. I thought that this would be a breeze. After all, I had done this before, hadn’t I? I went back to my notes, and to my horror discovered that I could not for the life of me reproduce the files that I had produced for the initial bug. I did know the programs that I had used, and also some of the settings, but no matter how I tweaked things, I could not get the same results. Since my previous design had not yet been put into production, I ended up redoing the entire thing for both bugs, and this time I meticulously wrote down the entire process. This time I knew that if we got good results and could publish on it, I would actually be able to tell how I designed the chip and why.

I often tell this story when I talk with biologists about logging computational work. I usually get a lot of nods from those listening, and I know that at least some of them will actually start logging their work. I am uncertain though of how many actually stick with this habit. Many people seem to think that they can trust their own memories. Don’t get me wrong – there are probably people out there who are capable of remembering in great detail how they did an analysis. But, I do not believe that this goes for the great majority of scientists. I believe that for most people, the only way of keeping track of what was done and why is to write it down.

An additional complication is that I believe that many when they first get their data do not really see a reason to log what they do. Most people start just exploring their data, making some graphs and tables just to see what the data looks like. In my opinion, this exploratory data analysis phase is vital – it gives the researcher a feel for the data that in my experience can be essential to discovering errors in both the data and the analyses. However, I think that for many, this exploratory data analysis phase silently and without fanfare slides over into a final production phase. Results that were initially produced in a “let’s just see what this looks like” fashion end up being used as figures and tables in the final paper without a real track record of where the results came from.

Creating a new habit can be difficult. Writing down what is done to the data and why can seem tedious and may seem like just a waste of time. However, instead of just saying “log your work” in a stern voice, I thought I would hold out some of the more tangible benefits that a good log can provide. Your mileage may vary, but if there is no log, these benefits will certainly not be available.

  • Error detection. If you know what you did, it is easier later to discover what went wrong if there is something in the results that do not add up. It can be very easy to write 2 instead of 3 in an option setting, and when working with sequences, ATGGC is very close to ATGCC.
  • Internal reuse of methods. Maybe you have a different data set to run on, or just want to change some small elements in the analysis. If the current procedure is already written down, reusing and changing it is a lot easier. For some people this spells writing a script for running an analysis, but even just a cut-and-paste sequence of commands can go a long way.
  • Writing the materials and methods section. If the results are good, you will want to publish on them. If there is a written log stating how the results were produced, writing the M&M section should be a walk in the park.
  • Defending the results in reply to a reviewer. A reviewer might ask questions about the analyses. If the log files detail not only how the results were produced, but also why various decisions were made, it is easier to respond to questions about the whys and the wherefores of the analysis.
  • Reproducibility. In theory, all science should be reproducible. If it is not reproducible for the one creating the results in the first place, nobody else can reproduce it either. If your work is reproducible by others it might not benefit you directly here and now, but may increase the citation rate of your paper. Many people dislike citing papers where they are not quite certain of what was done.

The last question is then what should be logged and how to keep a log. In my logs I usually note things such as:

  • program versions
  • program options
  • location of files
  • file versions. Calculating a check sum can go a long way – use for instance md5sum.
  • urls for where files were downloaded from, together with the download date
  • thoughts about results and solutions and discussions about how choices were made. 

For a long time my own logs have simply been a dated journal where I copy-paste commands, links to files, md5sums of input and result files, and where I discuss with myself the reason for my decisions. I keep this in a plain text file. I have tried other solutions that allowed me to paste in pictures, and which would let me import pdfs and other documents, but the plain text file still sticks with me to this day. This file can be read on any computer, does not require special software to open, and is easy to keep track of. I do know of those that use Evernote for this, and others again that use TiddlyWiki. The technical solution behind a log is in my opinion not all that important. The really important thing is that it should be something that is easy for you to use, otherwise it just becomes another barrier to writing things down. Keep it simple, keep it easy, and in the end the log will work for you.

‘Sorting out Sorting’

There are times when my nerd shows more than usual. I recently found a movie on YouTube that that brought back a lot of nerdy memories from my studies. I got my bachelor and masters degree at the University of Bergen. I really liked chemistry and biology, but I also really liked computers. My father had shown me the joy that could be found in putting together a computer without the manual without any blue smoke appearing. The consequence of this was that I studied both molecular biology and computer science.

The movie in case is one that used to be shown in the last lecture in the ‘Data Structures and Algorithms’ course. This movie was made in 1981 and illustrates nine different sorting algorithms using fairly hefty graphics for the time and has a very distinct plink-plonk sound track. The tradition among the students was that when this movie was shown, we would show up with biscuits in the shape of letters, and rødbrus, which is a soda primarily made for children. We would then during the movie sort the letter biscuits using the algorithm that was currently shown in the film.  At the end, when all of the algorithms were shown at once and are racing against each other, we would all cheer for bubblesort. I do believe that somebody at some point actually made a banner in support of bubblesort.

So – in case you have been sitting there wondering which sorting algorithm is the fastest – sit back and enjoy. For my part, I have to go find some biscuits and rødbrus.

RNAmmer 1.2 install issues

RNAmmer is getting on in years, but it is still heavily used, something that we, the authors deeply appreciate. However, it is not always easy to install. Here, I describe what needs to be done in order to get it up and running.

Path changes

The changes that have to be introduced are to be found in this section:


# the path of the program
my $INSTALL_PATH = "/usr/cbs/bio/src/rnammer-1.2";

# The library in which HMMs can be found
my $XML2GFF = "$INSTALL_PATH/xml2gff";
my $XML2FSA = "$INSTALL_PATH/xml2fsa";

# The location of the RNAmmer core module
my $RNAMMER_CORE     = "$INSTALL_PATH/core-rnammer";

# path to hmmsearch of HMMER package
chomp ( my $uname = `uname`);
my $PERL;
if ( $uname eq "Linux" ) {
        $HMMSEARCH_BINARY = "/usr/cbs/bio/bin/linux64/hmmsearch";
        $PERL = "/usr/bin/perl";
} elsif ( $uname eq "IRIX64" ) {
        $HMMSEARCH_BINARY = "/usr/cbs/bio/bin/irix64/hmmsearch";
        $PERL = "/usr/sbin/perl";
} else {
        die "unknown platform\n";

The program was in the first place written to be run on the servers at the Danish Technical University, hence the $INSTALL_PATH setting. This should be set to wherever you keep your RNAmmer installation. In my case, I am setting it to /home/karinlag/projects/rnammer, since I am here having it as a local install in my home directory.

The next thing that has to be done, is to get the right HMMer installation and to figure out where perl is.

You will need version 2.3 of HMMer, which you can download from this location. Download it, and read the INSTALL instructions. It should install cleanly on most *nix systems.

I installed hmmer-2.3 in /home/karinlag/src, where it created the directory hmmer-2.3. Inside the src directory you will find the hmmsearch program. Set the $HMMSEARCH_BINARY variable to point to the hmmsearch program. Note: you need to check what the command uname tells you to figure out what system you have so that you know which of the if clauses to modify things in. If it does not say Linux or IRIX64 (which is unlikely these days), you will need to set either the Linux string or the IRIX64 string to what you have, and set the paths below accordingly.

You also need to check that you have the right perl path. You can figure that out by doing ‘which perl’.

You should now be able to do

perl rnammer -S bac -m lsu,ssu,tsu -gff - example/ecoli.fsa

and get results.

Posix errors

You may end up with errors that say something along the lines of 

FATAL: POSIX threads support is not compiled into HMMER; --cpu doesn't have any effect

If you get this, you need to find the following in the core-rnammer script:

system sprintf('%s --cpu 1 --compat ...... and so on

Remove the two instances of –cpu 1 (not the whole sentence, just ‘–cpu 1’), and you should be good to go.


You might end up with having RNAmmer complain about not being able to find XML/Simple.pm in @INC. To solve this, you need to install perl-XML-Simple. Installing perl modules is something I consider to be deep voodoo, so I won’t even try to describe how to do that. Refer to your system to figure that one out.

Other errors?

If you discover other errors than those I have described here, let me know in the comments!

Basic bioinformatics python course, part I


I have on several occasions had the privilege of teaching basic programming to biology students. My preferred language in this situation as in many others is python. I have also been fortunate enough to find a book which I think does a fairly good job of teaching basic python in a way that biologists find useful. In this context that mostly means dealing with sequences in a sensible way. The book in question is “Python for Bioinformatics” by Sebastian Bassi.

The only note here is that there are some spelling mistakes in it, and that it is from 2009. Python has progressed to Python version 3 now, whereas the book is at version 2. However, for a beginning programmer, this should not make too much of a difference.

I am here putting out the slides that I used for a one-day intro course for biologists. The course is very interactive, meaning that in the slides there are  many short exercises which are followed by the answer. I am in this post putting out the first set of slides that deal with the basics, the rest will follow during the next couple of weeks.

Note: I have tried to ensure that these slides are bug free, but there are bound to be some mistakes somewhere. Please let me know if you spot any!


Part 1: The basics

The first lesson begins with discussing programming a bit, and the two modes in which python can be used – interactively and batch mode. I then go through the basic datatypes in python, i.e. what kind of “things” that are available. I cover how to use python as a calculator, how to work with strings, and also what a list and a dictionary is and how to use them.