October 8, 2022

Computational Biology for Autodidacts

“Education is our passport to the future, for tomorrow belongs to those who prepare for it today” – Malcolm X

Once upon a time I was labouring about with pipettes in hand, cutting and pasting plasmids, asking questions about the edibility of my agarose gels. I eventually parted ways with the wet lab and moved into teaching myself Computational skills that I could apply to Biological questions. The journey has been one of the most exciting, challenging but ultimately fulfilling. There is a persistence that the auto-didactic approach requires; a humility and continually curious energy. Given that Biology proper is increasingly pulling from the Computational spheres in order to effectively pursue certain questions, it is becoming increasingly important to be literate in these spheres. Even a basic understanding of key concepts will take one quite far.

I have pulled from several individuals for inspiration and guidance - some of them proficient by training, others, self taught. If not for their willingness to share both their mistakes and their successes, I believe my path would have been substantially more deviant than it has been already. It does not require great effort to refer others to material which will help them. And so I want to pay it forward, putting together a list of courses, books, essays, readings and so on, which I have personally used (and hereby endorse), in understanding Computational Biology, along with all the auxiliary disciplines which mesh into the field (Probability, Statistics, Mathematics, Logic, Information Theory, Computer Science etc.).

There is a parallel repository over (though somewhat outdated) at Github which contains the same information as here.

The Command Line - Shell

The Unix Workbench
The fundamental starting point in computational biology is the command line interface (CLI). It may terrify you at first, but please persist, and move forward with excitement as you begin traversing your system in a completely new way. Believe me, right now things may seem so rosy and magical as you whiz around in your general user interface (GUI), coddled by the ease of point and click, but with enough time, you will begin dreading the clunkiness of some GUI programs. The majority of tools in computational biology are built around a unix/linux system, and as such these are generally the very basic requirements. Embarking on a journey of learning unix based systems, in my opinion, comes with committing to a philosophy of open software and freedom of access to information. This is fundamentally about love. Love of education, love of knowledge, love of others, and a love of those that come after us. It is not surprising that almost every resource I have come across has offered a free version alongside paid options.

You will begin seeing how fundamental and necessary these programs are for the proper functioning of modern day Scientific research. Hopefully you will appreciate how beautifully efficient they are, and the elegance of simple bash programs who aim to do one thing right.

You likely want to begin here, and take yourself so far, probably stopping right before you reach the vim vs. emacs wars. After you’re proficient in simple processes, then you can spend your precious time pondering whether org-mode is worth trying.

Learn Enough Command Line to Be Dangerous
This is a companion resource to Unix Workbench, it begins at the same skill level (beginner), and like Unix Workbench, works through the essentials, with a focus on pragmatism. Having a go at the exercises is worth while, and creating a personal cheat sheet of sorts is also not a bad idea. Straight forward, stimulating and very helpful for the beginner.

Data Carpentry

Probably the most basic of all the introductory information resources for getting into data analyses and bioinformatics. They are very short - often ~10 minute per page explanations of key concepts which may appear mundane and basic to experienced users, but contain just the right amount of information to ease beginners into the approach. I liked the short section on Cloud Genomics. If you’d like to venture into other fields which utilise similar data analytic approaches, such as Epidemiology or Ecology, then they also have some carpentry courses in this too.

Computational Biology

Books and Textbooks

Biostar Handbook
One of the most empowering developments to come out of bioinformatics/computational biology education is Biostars. I would put my money on it that the word “Biostars” is familiar to every single student in this field. The legendary forum for finding answers to what seems to be, every practical question you could think of. Istvan Albert and his colleagues have gone ahead and condensed their understanding into the Biostar Handbook, which, at a very reasonable price (especially when it comes to textbooks) brings you up to speed, and in my opinion, gets you very close to being competent and independent. What I love most about this book is its brutal honesty, its transparency and its emphasis on diligence and patience. They have slowly started breaking the book up into more specific and parcelized mini-books that focus on a particular topic. Another big benefit of this is that Istvan generously provides full access to his University lectures, which are also terrific.

Computational Biology: A Hypertextbook
In searching for a beginner level computational biology text that was up to date with more recent developments in the field, I found it surprisingly difficult to track anything down that was released less than 5 years ago. Either the text’s were focused on a specific section of comp. bio., such as sequence matching algorithms, or they were highly recommended relics released at the advent of high throughput sequencing - classics indeed, but insufficient for a wholesome entry into the current field. Scrolling through twitter I saw a recommendation for Scott Kelly’s and Denis Didulo’s “Computational Biology: A Hypertextbook” - it seemed to tick all the boxes. Despite the lack of reviews and anecdotes, I took the chance and purchased the e-book. Given it’s main selling point is that it is a Hypertext-book, I figured using the print form would be too clunky. So far I am very happy with the purchase - it is generalised enough for a beginner, yet still specific in ways that Biostars isn’t. For example it has chapters on essentials such as how exactly the Smith-Waterman algorithm functions and so on. An excellent reference text which deserves more attention! My biggest hunch here is not with the textbook exactly, but rather with the ebook medium which it uses - a service called VitalSource, a underpowered platform which allows you to purchase books on how to use linux, only to then realise that their standalone offline apps do not come with Linux compatibility!! How you would open the book on Linux if you lacked internet connection is a mystery to me. Either way - the book is great.

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
A classic in the field, Biological Sequence Analysis imports quantitative methodology to answer interesting and challenging Biological questions. An extremely relevant book which has aged extremely well. It was published in 1998

Courses

Applied Computational Genomics at the University of Utah (2020/2021) If you’ve had to do play around with *.bam and *.bed files than you’ve very likely come in contact with the excellent program called bedtools. Almost every time I need to use this program I discover something new about it, it just keeps on giving. The brain behind the Bedtools is Aaron Quinlan, a leading Computational Biologist out of the University of Utah . He offers a Semester long, completely free course which lives on GitHub. Do you notice the trend here? World class thinkers who are willing to spread their knowledge, help others, and move Science forward, all without paywalls. Pay it forward if you ever get the chance! This is a great, in depth course which has many practical tutorials embedded within it. The homework is challenging and fulfilling - I have learned a lot here. Did I mention that Aaron is a terrific, down to earth teacher?

Foundations of Computational and Systems Biology
I can’t be the only one who’s jaw hits the floor when they see how rich MIT Open CourseWare has become, and how far back their content reaches in time. There are perhaps no better examples of the spirit of education than this initiative. Free lectures by some of the worlds top thinkers? You’ve gotta be kidding me. MIT Open Courseware was made for auto-didacts, there is little more you could ask for when seeking to educate yourself. Detailed course structures and trajectories, additional recommended readings, good quality videos, and no pay walls - yes!

This course is run by a couple of great educators (Christopher Burge, David Gifford & Ernest Fraenkel), who are also highly capable Scientists in their own right. For one, Chris Burge is one of the pioneers of ab initio gene prediction, a highly successful paradigm which allowed us to understand and annotate much of the early high throughput sequencing data. He is also centrally involved in the popular “Mixture of Isoforms” (MISO) package. I recommend watching each lecture closely and definitely reading the accompanying writings. This is quite the intensive program if you decide to apply yourself, and it covers a sufficiently broad sweep of the field to give you the confidence to move forwards.

Case Studies in Functional Genomics

Rafael Irizarry is an analyst and statician whom I stumbled upon early into my self-study, and have been keeping updates on ever since. His honesty, breadth of knowldege and sober perspectives on fundamental problems in computational analysis are extremely valuable. He’s far from a hype man in an industry of hyperbolics. Given his basis in statistics, his courses are all rich in quantitative information, so I would say this is a good place to start for the more advanced students.

Mathematics, Probability and Statistics

Coming from Biology, a field closely wedded to the qualitative aspects of Scientific inference, where our formal training for the most part omits many of the approaches utilised by the ‘harder sciences’, the transition to the quantitative world has perhaps been the most challenging part of this. In some respects, you must undergo a great change in how you approach problems, how you approach data and measures, and your relationship to truth and validity. Much of this can be uncomfortable, as you must confront the fragility of your prior approaches to questions. Personally, this is an ongoing project which demands a lot of effort and grit. I have immensely appreciated this change in my thought processes, and am very grateful that it has taken place. The world is a bigger place now than it ever was. A world where precision, consistency, and repetition are emphasised. You will likely develop an obsession with priors, and with starting assumptions. Sometimes before we can even approach a problem, we must sketch out some vague axioms we believe to be important. Unfortunately, these topics are taught in notoriously bland ways across campuses alike - they are premature, are forced, and very often, the student barely has any confidence in their own logic and reasoning. Such courses may at times skip over the very basic reason for using tools such as probability theory in the first place; to make better decisions in the presence of uncertainty.

Books and Textbooks

Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking
I believe this is the most comfortable introduction to conceptualising problems, and answers, in a more quantitative manner. As the title outlines, this book is almost purely built on intuitive explanations of key, widely used procedures in statistics. It explains when you would do x, why you would do it, why you wouldn’t, and more appropriately, the assumptions and biases associated with it. If you’re afraid of equations (overcome this fear as soon as possible, there is nothing to fear), you’ll be given that first bit of confidence that should then give you the enthusiasm and energy to continue developing. The language here is very clear, very direct, very concise and to the point. The text is very honest, and works hard to provide bountiful examples of both good and bad uses of statistics in the literature. You will enjoy working through it, trust! The author Harvey Motulsky is the founder of GraphPad Statistical Analysis software, so you may have already come across his creations without knowing! There is also a smaller streamlined text from Harvey called “Essential Biostatistics”, which is also worth reading if you just need a straight forward and reduced explanation. One last note, this book leans heavily towards biomedical research, and so most of the examples pull directly from this field.

Modern Statistics for Modern Biology
Chances are that you’ve progressed enough in your education that you’re now dealing with some very real datasets. And as such, you likely have many outstanding questions directly related to the data on your hands, the answers to which are scattered all across the internet, and within various texts. This book will probably come down to you as a revelation. It covers much of the same core content on statistical inference as volumes of analogous books in this category do, but the biggest differences is it’s almost singular focus on the application to biological datasets, very likely ones similar to those that you have on your hands right now. It’s clear, filled with witty and insightful comments, and even provides some of the history behind the stats. You may choose to buy a physical of digital copy of the book, or use it online, completely free - once again, given the quality of the book, it’s hard to believe that you can access it free of charge; that’s love. Another really impressive aspect of this is that it was completely written in R(markdown), and the source code for compiling the book is open. As it was written in R, all of the analyses are also undertaken in R, giving you that visceral feel for the analysis. Bravo Susan and Wolfgang, you are leading the way. I purchased a physical copy to support the authors, but I mainly use the online version as frankly it’s easier to way.

Probability Theory: The Logic of Science
The more I have sunken into Statistics, the more I have fallen in love with Probability Theory. This one is a real heavyweight for you, one of those books you put on a shrine and hope that may someday you’ll be able to comprehend it’s entirety. It marries Bayesian and Frequentist approaches, and makes evident that probability theory can be approached as an extension of Logic - as another tool of fundamental reasoning, giving the ability to make decisions under uncertainty. This book, despite it’s density, it’s mystery, reads beautifully. I only wish that Jaynes’ was still with us so that I could perhaps one day sit in on one of his lectures. There is a honesty here which reminds me of Taleb, so you’ll find yourself cracking up throughout. You’ll need to put your savings together for this one! But if you’d like to get deep into Probability and have some loose coin than this is a great option.

Mathematics for the Non-Mathematician
A great book. Cannot recommend it enough. There are many many positive reviews all over the web, I will just say, go and get it. Morris Kline has a way of explaining cryptic things in simple ways and for me personally, learning about the History of a concept helps me to understand it a great deal. A joy to read.

No Bullshit Guide to Math & Physics/Linear Algebra.
Yes. No bullshit. Teach yourself linear algebra and maths. Go ahead. Yes.

The Art of Problem Solving Vol 1. Richard Rusczyk and Sandor Lehoczky have written perhaps the best starting point for getting comfortable with math. A huge emphasis on exercises and practice. Basic highschool math is the only prerequisite.

Introduction to Counting and Probability
If you’re looking for a bare bones, absolutely novice introduction to Probability theory, this is an amazing option. It is from the same series as The Art of Problem Solving mentioned above. Well worth the effort.

Book of Proof By Richard Hammack. A book of this type can very well change your life by revealing the beauty and simplicity of deductive procedures. A lovely work. Richard has chosen to provide this book free of charge online. Pass it on!

Courses

Mathematics in Biology
I greatly admire the work of Elena Rivas, I think she is an excellent Scientist (under-appreciated too), so when I saw that she was doing a semester long Mathematics for Biology course from Harvard, I was over the moon! My excitement, soon transformed into humility, as the course structure showed me just how far further I have to go in my own knowledge base. It takes long enough to replace the brainwashing that Biologists go through in our undergraduate days, being taught to freeze in horror at anything remotely mathematical and quantitative. Courses such as this allow one to lay a new foundation for Mathematics, and understand that so many Biological phenomena can be better understood with the tools of Math. It is comprehensive and rigorous, and is everything you expect the course to be. Another one of the worlds leading thinkers, providing content to the world without reservations.

Khan Academy
Khan Academy is pretty self explanatory - I think everybody under the age of 50 has watched one of their videos at some point in their life. Their dedicated website gives you a terrific and rather comprehensive introduction to Mathematics and Statistics at most skill levels. I have no issues admitting that I used, and still use Khan for self education. Plus, Khan is a great human being, so why not.

Programming Languages

R

I think it’s safe to simply point one in the direction of Hadley Wickham and his collaborators. You’ll findhis works referenced by almost all introductory courses + websites on R. Head over to his main site and have a browse. The philosophy behind the tidyverse and tidydata is a breath of clarity. I’m just a beginner in R myself, as I was able to do 80% of my ~2 years worth of analysis in basic shell programming, thanks to the immense catalogue of free computational tools. To be blunt… as the joke goes, you can probably replace most computationalists with an automated bedtools applet.

Coursera

Quite inexpensive and they do have financial assistance which I have no shame in admitting I took ahold of on numerous occasians during my studies.

Introduction to Bioconductor

Bioconductor is the main R repository of analytic packages used by scientists, and most specifically, biologists. It’s integration and organization will make you wonder where conda went wrong (HA!).

Bioconductor for Genomics

Although I haven’t taken this course yet, perusing it’s materials and noting it’s popularity, I’ve got it saved for a future date as it covers the essentials that one needs to get up and running in R, analysing datasets and making sense of our questions at hand.
Bioconductor for Genomic Data Science

Exploring, Visualizing, and Modeling Big Data with R

Another resource which is on my to-do list. What attracted me to this was the emphasis on Big Data. Large datasets can be really tricky to wrangle and manipulate, and given that R stores most data in memory, it pays off to know how to avoid crashing your system and making yours and your coworkers lives a little easier.

https://okanbulut.github.io/bigdata/

Grammar of Graphics

The best introductory two-part video on plotting in R - using the marvelous ggplot package. ggplot2 workshop part 1 by Thomas Lin Pederson.

Julia

Another fancy language, one of another million that you’ve heard of.. oh boy.. Trust, Julia is phenomenal - do a bit of background reading as to why it’s a worthwhile option for you. For me, it’s the speed and power, the supportive community for Science and the quantiative fields, and the fact that it’s easy to learn.

Think Julia: How to Think like a Computer Scientist
Perhaps the best introductory book to Julia and the thought patterns which programming gradually develops in one. Those who are familiar with Python may notice the similarity of the title to another work “Think Python…” – Think Julia is actually this exact book translated into Julia!

Julia for Data Science

https://juliadatascience.io/

Took Bogumil’s short course…. loved his DataFrames package…. bought his textbook…. put a smile on my face.

Coursera

Nextflow

Reproducibility, portability and containerisation, workflows. Four words which have rightly generated much activity in recent years. The basic idea is to create a closed, referential and singular ‘container’ which can be shared easily, run comfortably by both computationalists and wet lab experimentalists, and most importantly, on most operating systems and architectures with minimal leakage and requirements for local depencies. If I send you my pipeline the goal is for you to be able to run it virtually anywhere, and for it to behave in the exact same way each and every time, across every single run. If you have any experience in chasing datasets and programs from published literature, you’ll know just how murky the waters can get.

Nextflow Carpentries I have to admit that the “introductory” Nextflow tutorials on the official Nextflow site, are woefully ambiguous and simplistic. I found the Carpentries website a lot more forgiving.

Blogs and Pages to Follow

Coming soon.

http://compeau.cbd.cmu.edu/home/teaching/great-ideas-in-computational-biology/

;