Bioinformatician’s Toolkit for Creating Reproducible Workflows

1029
create reproducible workflows

Reproducibility is key when it comes to your bioinformatics workflows. Ensuring your code produces the same results each time you run it is important to the integrity of your work. Every bioinformatician should be taking the necessary steps in creating reproducible workflows with everything that they do.

Back when I learned how to code, there wasn’t really a standard method for maintaining reproducibility. I first started coding in Perl, which allows for automation, but data integrity checks were simply not robust. Luckily, given the expansion of this field, there are now many tools you can use to automate your workflows while maximizing reproducibility. Here are some of our favourites.

The Good: Shell Scripting

Though we don’t recommend that you rely on shell scripting for creating reproducible workflows, it certainly merits a mention for automation. Shell scripting allows you to create files that hold the commands you wish to run. When you run the file, each command is performed in turn exactly the way you intend it.

Pros: Shell scripting is something that anyone can do without requiring any additional software (except, of course, for those you want to run from your shell script). You can also do some variable assignment and basic data formatting using regex.

Cons: Of course, shell scripting really only has limited functionality for integrity checking at each step of the process. If the output from step 1 is not as expected, then the subsequent steps will fail. You need to be really confident in each step in order for shell scripting to be your go-to for automation.

The Better: Conda Environments

Every bioinformatician should have some experience with Conda. If you haven’t, then it’s never too early to start! Conda is an open source package and environment management system that allows you to keep track of your dependencies to assist in creating reproducible workflows.

At some point during your work, you will realise that each program you need relies on a specific python package version. In a native environment, you can only have one version of each package installed. With Conda, you can create an environment for each workflow and specify your dependencies in each, kind of like the Python virtuanenv system we described earlier.

Pros: You have exquisite control over your dependencies for each environment. You can also choose the python version for each environment that you create. Each time you run your code within the proper environment, you know you are using the same packages as last time.

Cons: You will have a lot of work in going from a native python environment to Conda. You will need to set up all of your environments and dependencies from scratch.

The Best: Snakemake

Snakemake is the holy grail when it comes to creating reproducible workflows for bioinformaticians. As a dedicated workflow manager, it assists in creating automation as well as reproducibility at scale. Snakemake offers a human readable method of setting up and running your workflows, extending the Python programming language. All you need to do is set up rules that specify the inputs, outputs and commands.

Snakemake also runs alongside Miniconda, the minimal installer for Conda. This means that you also benefit from the dependency management to avoid dependency hell. It will save you many headaches, trust us!

Pros: Easy to use, human readable Snakefiles provide exquisite control over your workflows. Support for parallelization and integration with Miniconda allows for seamless scaling when creating reproducible workflows.

Cons: There is little to criticize here, especially for those new to bioinformatics. For advanced users, the lack of integration with container technologies such as Singularity may be a concern. There are more workflow management tools available supporting this that we haven’t mentioned here, such as Nextflow.

Now You Are Ready to Start Creating Reproducible Workflows in Bioinformatics!

This has been a very brief look into the types of options you have for creating reproducible workflows in your bioinformatics work. If you haven’t started using Conda or Snakemake yet, we urge you to become familiar with them now, as they will make your life much easier in the future. Happy coding!