“Education is our passport to the future, for tomorrow belongs to those who prepare for it today” – Malcolm X
Once upon a time I was labouring about with pipettes in hand, cutting and pasting plasmids, asking questions about the edibility of my agarose gels. I eventually parted ways with the wet lab and moved into teaching myself Computational skills that I could apply to Biological questions. The journey has been one of the most exciting, challenging but ultimately fulfilling. There is a persistence that the auto-didactic approach requires; a humility and continually curious energy. Given that Biology proper is increasingly pulling from the Computational spheres in order to effectively pursue certain questions, it is becoming increasingly important to be literate in these spheres. Even a basic understanding of key concepts will take one quite far.
I have pulled from several individuals for inspiration and guidance - some of them proficient by training, others, self taught. If not for their willingness to share both their mistakes and their successes, I believe my path would have been substantially more deviant than it has been already. It does not require great effort to refer others to material which will help them. And so I want to pay it forward, putting together a list of courses, books, essays, readings and so on, which I have personally used (and hereby endorse), in understanding Computational Biology, along with all the auxiliary disciplines which mesh into the field (Probability, Statistics, Mathematics, Logic, Information Theory, Computer Science etc.).
There is a parallel repository over (though somewhat outdated) at Github which contains the same information as here.
The Command Line - Shell
The Unix Workbench
The fundamental starting point in computational biology is the command line
interface (CLI). It may terrify you at first, but please persist, and move
forward with excitement as you begin traversing your system in a completely new
way. Believe me, right now things may seem so rosy and magical as you whiz
around in your general user interface (GUI), coddled by the ease of point and
click, but with enough time, you will begin dreading the clunkiness of some GUI
programs. The majority of tools in computational biology are built around a
unix/linux system, and as such these are generally the very basic requirements.
Embarking on a journey of learning unix based systems, in my opinion, comes
with committing to a philosophy of open software and freedom of access to
information. This is fundamentally about love. Love of education, love of
knowledge, love of others, and a love of those that come after us. It is not
surprising that almost every resource I have come across has offered a free
version alongside paid options.
You will begin seeing how fundamental and necessary these programs are for the proper functioning of modern day Scientific research. Hopefully you will appreciate how beautifully efficient they are, and the elegance of simple bash programs who aim to do one thing right.
You likely want to begin here, and take yourself so far, probably stopping right before you reach the vim vs. emacs wars. After you’re proficient in simple processes, then you can spend your precious time pondering whether org-mode is worth trying.
- The Unix Workbench by Sean Kross.
- https://github.com/seankross/the-unix-workbench
- Sean also kindly provides a Coursera unit which follows the structure of the book - highly recommended also.
Learn Enough Command Line to Be Dangerous
This is a companion resource to Unix Workbench, it begins at the same skill
level (beginner), and like Unix Workbench, works through the essentials, with a
focus on pragmatism. Having a go at the exercises is worth while, and creating
a personal cheat sheet of sorts is also not a bad idea. Straight forward,
stimulating and very helpful for the beginner.
- Learn Enough Command Line to Be Dangerous by Michael Hartl.
- Free online.
Data Carpentry
Probably the most basic of all the introductory information resources for getting into data analyses and bioinformatics. They are very short - often ~10 minute per page explanations of key concepts which may appear mundane and basic to experienced users, but contain just the right amount of information to ease beginners into the approach. I liked the short section on Cloud Genomics. If you’d like to venture into other fields which utilise similar data analytic approaches, such as Epidemiology or Ecology, then they also have some carpentry courses in this too.
Computational Biology
Books and Textbooks
Biostar Handbook
One of the most empowering developments to come out of
bioinformatics/computational biology education is
Biostars. I would put my money on it that the word
“Biostars” is familiar to every single student in this field. The legendary
forum for finding answers to what seems to be, every practical question you
could think of. Istvan Albert and his colleagues
have gone ahead and condensed their understanding into the Biostar Handbook,
which, at a very reasonable price (especially when it comes to textbooks)
brings you up to speed, and in my opinion, gets you very close to being
competent and independent. What I love most about this book is its brutal
honesty, its transparency and its emphasis on diligence and patience. They have
slowly started breaking the book up into more specific and parcelized
mini-books that focus on a particular topic. Another big benefit of this is
that Istvan generously provides full access to his University lectures, which
are also terrific.
- Most recommended for beginners/entry level.
- https://www.biostarhandbook.com
Computational Biology: A Hypertextbook
In searching for a beginner level computational biology text that was up to
date with more recent developments in the field, I found it surprisingly
difficult to track anything down that was released less than 5 years ago.
Either the text’s were focused on a specific section of comp. bio., such as
sequence matching algorithms, or they were highly recommended relics released
at the advent of high throughput sequencing - classics indeed, but insufficient
for a wholesome entry into the current field. Scrolling through twitter I saw a
recommendation for Scott Kelly’s and Denis Didulo’s “Computational Biology: A
Hypertextbook” - it seemed to tick all the boxes. Despite the lack of reviews
and anecdotes, I took the chance and purchased the e-book. Given it’s main
selling point is that it is a Hypertext-book, I figured using the print form
would be too clunky. So far I am very happy with the purchase - it is
generalised enough for a beginner, yet still specific in ways that Biostars
isn’t. For example it has chapters on essentials such as how exactly the
Smith-Waterman algorithm functions and so on. An excellent reference text which
deserves more attention! My biggest hunch here is not with the textbook
exactly, but rather with the ebook medium which it uses - a service called
VitalSource, a underpowered platform which allows you to purchase books on how
to use linux, only to then realise that their standalone offline apps do not
come with Linux compatibility!! How you would open the book on Linux if you
lacked internet connection is a mystery to me. Either way - the book is great.
- Recommended for beginners/entry level.
- https://www.amazon.com.au/Computational-Biology-Hypertextbook-Scott-Kelley/dp/1683670027
- The Kelley lab.
Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
A classic in the field, Biological Sequence Analysis imports quantitative
methodology to answer interesting and challenging Biological questions. An
extremely relevant book which has aged extremely well. It was published in 1998
some serious mileage. The name speaks for itself; within it, you will find writings on RNA structure co-variation analysis, sequence alignment algorithms and so on. Pairing this text with the other works listed here on Probability and Statistics, you may come to realise how clever and useful the probabilistic framework is. Many of the approaches this book discusses deserve the word essential, as they have build the field of Biology as we know it today. I was recommended it by reading Sean Eddy’s blog, and using programs he has been involved in such as Infernal and RScape. I will say this - they dive right into it, and the recommended “map” for reading the book is super cool. It stands as my go-to reference text on most things computational-algorithmic, particularly when the detail matters and I’m not searching for a quick and dirty method from internet forums. Everybody should have a copy close by.
Paid, may be some free copies floating around though! It is very affordable considering what you get for your money.
You won’t work through this book in a few weeks, it’ll take patience and you’ll most often have to go away and think on the material, test it out, see it in action, then come back for more.
Late beginner all the way to advanced.
Courses
Applied Computational Genomics at the University of Utah (2020/2021) If you’ve had to do play around with *.bam and *.bed files than you’ve very likely come in contact with the excellent program called bedtools. Almost every time I need to use this program I discover something new about it, it just keeps on giving. The brain behind the Bedtools is Aaron Quinlan, a leading Computational Biologist out of the University of Utah . He offers a Semester long, completely free course which lives on GitHub. Do you notice the trend here? World class thinkers who are willing to spread their knowledge, help others, and move Science forward, all without paywalls. Pay it forward if you ever get the chance! This is a great, in depth course which has many practical tutorials embedded within it. The homework is challenging and fulfilling - I have learned a lot here. Did I mention that Aaron is a terrific, down to earth teacher?
- Beginner/ entry level.
- Head over to https://github.com/quinlan-lab/applied-computational-genomics
Foundations of Computational and Systems Biology
I can’t be the only one who’s jaw hits the floor when they see how rich MIT
Open CourseWare has become, and how far back their content reaches in time.
There are perhaps no better examples of the spirit of education than this
initiative. Free lectures by some of the worlds top thinkers? You’ve gotta be
kidding me. MIT Open Courseware was made for auto-didacts, there is little more
you could ask for when seeking to educate yourself. Detailed course structures
and trajectories, additional recommended readings, good quality videos, and no
pay walls - yes!
This course is run by a couple of great educators (Christopher Burge, David Gifford & Ernest Fraenkel), who are also highly capable Scientists in their own right. For one, Chris Burge is one of the pioneers of ab initio gene prediction, a highly successful paradigm which allowed us to understand and annotate much of the early high throughput sequencing data. He is also centrally involved in the popular “Mixture of Isoforms” (MISO) package. I recommend watching each lecture closely and definitely reading the accompanying writings. This is quite the intensive program if you decide to apply yourself, and it covers a sufficiently broad sweep of the field to give you the confidence to move forwards.
- Late beginner/intermediate level.
- https://ocw.mit.edu/courses/biology/7-91j-foundations-of-computational-and-systems-biology-spring-2014/
Case Studies in Functional Genomics
Rafael Irizarry is an analyst and statician whom I stumbled upon early into my self-study, and have been keeping updates on ever since. His honesty, breadth of knowldege and sober perspectives on fundamental problems in computational analysis are extremely valuable. He’s far from a hype man in an industry of hyperbolics. Given his basis in statistics, his courses are all rich in quantitative information, so I would say this is a good place to start for the more advanced students.
Mathematics, Probability and Statistics
Coming from Biology, a field closely wedded to the qualitative aspects of Scientific inference, where our formal training for the most part omits many of the approaches utilised by the ‘harder sciences’, the transition to the quantitative world has perhaps been the most challenging part of this. In some respects, you must undergo a great change in how you approach problems, how you approach data and measures, and your relationship to truth and validity. Much of this can be uncomfortable, as you must confront the fragility of your prior approaches to questions. Personally, this is an ongoing project which demands a lot of effort and grit. I have immensely appreciated this change in my thought processes, and am very grateful that it has taken place. The world is a bigger place now than it ever was. A world where precision, consistency, and repetition are emphasised. You will likely develop an obsession with priors, and with starting assumptions. Sometimes before we can even approach a problem, we must sketch out some vague axioms we believe to be important. Unfortunately, these topics are taught in notoriously bland ways across campuses alike - they are premature, are forced, and very often, the student barely has any confidence in their own logic and reasoning. Such courses may at times skip over the very basic reason for using tools such as probability theory in the first place; to make better decisions in the presence of uncertainty.
Books and Textbooks
Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking
I believe this is the most comfortable introduction to conceptualising
problems, and answers, in a more quantitative manner. As the title outlines,
this book is almost purely built on intuitive explanations of key, widely used
procedures in statistics. It explains when you would do x, why you would do it,
why you wouldn’t, and more appropriately, the assumptions and biases associated
with it. If you’re afraid of equations (overcome this fear as soon as possible,
there is nothing to fear), you’ll be given that first bit of confidence that
should then give you the enthusiasm and energy to continue developing. The
language here is very clear, very direct, very concise and to the point. The
text is very honest, and works hard to provide bountiful examples of both good
and bad uses of statistics in the literature. You will enjoy working through
it, trust! The author Harvey Motulsky is the founder of
GraphPad Statistical Analysis software, so
you may have already come across his creations without knowing! There is also a
smaller streamlined text from Harvey called “Essential Biostatistics”, which is
also worth reading if you just need a straight forward and reduced explanation.
One last note, this book leans heavily towards biomedical research, and so most
of the examples pull directly from this field.
- PERFECT for absolute beginners. Great reference to have on hand for those more familiar.
- I must admit, the book is not cheap and the smaller “Essential” book is just not worth buying - it is an anorexic ~150 pages of widely spaced formatting. I used library copies for both until I had enough savings to purchase the larger of the two books. If you prefer eBooks than I’d go looking.
Modern Statistics for Modern Biology
Chances are that you’ve progressed enough in your education that you’re now
dealing with some very real datasets. And as such, you likely have many
outstanding questions directly related to the data on your hands, the answers
to which are scattered all across the internet, and within various texts. This
book will probably come down to you as a revelation. It covers much of the same
core content on statistical inference as volumes of analogous books in this
category do, but the biggest differences is it’s almost singular focus on the
application to biological datasets, very likely ones similar to those that you
have on your hands right now. It’s clear, filled with witty and insightful
comments, and even provides some of the history behind the stats. You may
choose to buy a physical of digital copy of the book, or use it online,
completely free - once again, given the quality of the book, it’s hard to
believe that you can access it free of charge; that’s love. Another really
impressive aspect of this is that it was completely written in R(markdown),
and the source code for compiling the book is open. As it was written in R,
all of the analyses are also undertaken in R, giving you that visceral feel
for the analysis. Bravo
Susan
and Wolfgang, you
are leading the way. I purchased a physical copy to support the authors, but I
mainly use the online version as frankly it’s easier to way.
- Beginner to Intermediate with a keen interest.
- Paid or free, you choose!
Probability Theory: The Logic of Science
The more I have sunken into Statistics, the more I have fallen in love with
Probability Theory. This one is a real heavyweight for you, one of those books
you put on a shrine and hope that may someday you’ll be able to comprehend it’s
entirety. It marries Bayesian and Frequentist approaches, and makes evident
that probability theory can be approached as an extension of Logic - as another
tool of fundamental reasoning, giving the ability to make decisions under
uncertainty. This book, despite it’s density, it’s mystery, reads beautifully.
I only wish that Jaynes’ was still with us so that I could perhaps one day sit
in on one of his lectures. There is a honesty here which reminds me of Taleb,
so you’ll find yourself cracking up throughout. You’ll need to put your savings
together for this one! But if you’d like to get deep into Probability and have
some loose coin than this is a great option.
- Intermediate to Advance level - have to be very keen to work through it.
- You’ll need to touch up on your Math if you’re like me, go at your own pace.
- Used in Elena Rivas Mathematics in Biology course (see below).
- Here.
Mathematics for the Non-Mathematician
A great book. Cannot recommend it enough. There are many many positive reviews
all over the web, I will just say, go and get it. Morris
Kline has a way of explaining
cryptic things in simple ways and for me personally, learning about the History
of a concept helps me to understand it a great deal. A joy to read.
- Paid, but you should be able to find it from second hand book sellers such as Thrift Books, Abe Books, World of Books etc.
No Bullshit Guide to Math & Physics/Linear Algebra.
Yes. No bullshit. Teach yourself linear algebra and maths. Go ahead. Yes.
The Art of Problem Solving Vol 1. Richard Rusczyk and Sandor Lehoczky have written perhaps the best starting point for getting comfortable with math. A huge emphasis on exercises and practice. Basic highschool math is the only prerequisite.
Introduction to Counting and Probability
If you’re looking for a bare bones, absolutely novice introduction to Probability
theory, this is an amazing option. It is from the same series as The Art of
Problem Solving mentioned above. Well worth the effort.
Book of Proof By Richard Hammack. A book of this type can very well change your life by revealing the beauty and simplicity of deductive procedures. A lovely work. Richard has chosen to provide this book free of charge online. Pass it on!
Courses
Mathematics in Biology
I greatly admire the work of Elena
Rivas, I
think she is an excellent Scientist (under-appreciated too), so when I saw that
she was doing a semester long Mathematics for Biology course from Harvard, I
was over the moon! My excitement, soon transformed into humility, as the course
structure showed me just how far further I have to go in my own knowledge base.
It takes long enough to replace the brainwashing that Biologists go through in
our undergraduate days, being taught to freeze in horror at anything remotely
mathematical and quantitative. Courses such as this allow one to lay a new
foundation for Mathematics, and understand that so many Biological phenomena
can be better understood with the tools of Math. It is comprehensive and
rigorous, and is everything you expect the course to be. Another one of the
worlds leading thinkers, providing content to the world without reservations.
- Completely free
- Head over to the course website.
Khan Academy
Khan Academy is pretty self explanatory - I think everybody under the age of 50
has watched one of their videos at some point in their life. Their dedicated
website gives you a terrific and rather comprehensive introduction to
Mathematics and Statistics at most skill levels. I have no issues admitting
that I used, and still use Khan for self education. Plus, Khan is a great human
being, so why not.
Programming Languages
R
I think it’s safe to simply point one in the direction of Hadley Wickham and his collaborators. You’ll findhis works referenced by almost all introductory courses + websites on R. Head over to his main site and have a browse. The philosophy behind the tidyverse and tidydata is a breath of clarity. I’m just a beginner in R myself, as I was able to do 80% of my ~2 years worth of analysis in basic shell programming, thanks to the immense catalogue of free computational tools. To be blunt… as the joke goes, you can probably replace most computationalists with an automated bedtools applet.
Coursera
Quite inexpensive and they do have financial assistance which I have no shame in admitting I took ahold of on numerous occasians during my studies.
Introduction to Bioconductor
Bioconductor is the main R repository of analytic packages used by scientists, and most specifically, biologists. It’s integration and organization will make you wonder where conda went wrong (HA!).
Bioconductor for Genomics
Although I haven’t taken this course yet, perusing it’s materials and noting it’s popularity, I’ve got it saved for a future date as it covers the essentials that one needs to get up and running in R, analysing datasets and making sense of our questions at hand.
Bioconductor for Genomic Data Science
Exploring, Visualizing, and Modeling Big Data with R
Another resource which is on my to-do list. What attracted me to this was the emphasis on Big Data. Large datasets can be really tricky to wrangle and manipulate, and given that R stores most data in memory, it pays off to know how to avoid crashing your system and making yours and your coworkers lives a little easier.
https://okanbulut.github.io/bigdata/
Grammar of Graphics
The best introductory two-part video on plotting in R - using the marvelous ggplot package. ggplot2 workshop part 1 by Thomas Lin Pederson.
Julia
Another fancy language, one of another million that you’ve heard of.. oh boy.. Trust, Julia is phenomenal - do a bit of background reading as to why it’s a worthwhile option for you. For me, it’s the speed and power, the supportive community for Science and the quantiative fields, and the fact that it’s easy to learn.
Think Julia: How to Think like a Computer Scientist
Perhaps the best introductory book to Julia and the thought patterns which programming gradually develops in one. Those who are familiar with Python may notice the
similarity of the title to another work “Think Python…” – Think Julia is
actually this exact book translated into Julia!
Julia for Data Science
Took Bogumil’s short course…. loved his DataFrames package…. bought his textbook…. put a smile on my face.
Coursera
Nextflow
Reproducibility, portability and containerisation, workflows. Four words which have rightly generated much activity in recent years. The basic idea is to create a closed, referential and singular ‘container’ which can be shared easily, run comfortably by both computationalists and wet lab experimentalists, and most importantly, on most operating systems and architectures with minimal leakage and requirements for local depencies. If I send you my pipeline the goal is for you to be able to run it virtually anywhere, and for it to behave in the exact same way each and every time, across every single run. If you have any experience in chasing datasets and programs from published literature, you’ll know just how murky the waters can get.
Nextflow Carpentries I have to admit that the “introductory” Nextflow tutorials on the official Nextflow site, are woefully ambiguous and simplistic. I found the Carpentries website a lot more forgiving.
Blogs and Pages to Follow
Coming soon.
http://compeau.cbd.cmu.edu/home/teaching/great-ideas-in-computational-biology/