FastQForward just got faster!

The USTAR Center for Genetic Discovery has used the FastQForward pipeline to analyze over a thousand human genome sequences in collaboration with projects like the Utah Genome Project and the Utah Pediatric Cardiac Genomics Consortium.  When a patient suffering from a genetic disorder gets their genome sequenced, the automated FastQForward pipeline receives their sequence data, performs quality control, aligns the sequence against a reference for comparison, determines which genetic variants are present in the sample, and ranks the variants in order of which ones are most likely to cause disease. Researchers and clinicians then use this information to discover the genetic cause of the patient’s disease and help steer the course of treatment.

Due to the launch of new large-scale genomic medicine initiatives worldwide, more than a million human genomes will likely be sequenced in the next decade, and analysis pipelines must get faster to handle this massive data influx.

Carson Holt, PhD. Chief Pipeline Architect for FastQForward and Senior Software Developer for the Yandell Lab and USTAR Center for Genetic Discovery.

Carson Holt, PhD. Chief Pipeline Architect for FastQForward and Senior Software Developer for the Yandell Lab and USTAR Center for Genetic Discovery.

Carson Holt, Senior Developer in the Yandell Laboratory and Chief Pipeline Architect for the USTAR Center for Genetic Discovery, has made FastQForward run much faster. The souped-up pipeline can now analyze a complete human genome sequence in about 8 minutes. That’s fast. Here’s what Carson has to say about his modifications to FastQForward.

Carson Holt: My latest stats for FastQForward: FASTQ to BAM to VCF on a 30x whole genome (NA12878) in about 8 minutes.  We are now by far the fastest analysis pipeline in existence. We are more than twice as fast as the DRAGEN genome analysis on a chip (a hardware based solution to whole genome analysis that made headlines last year for being more than 20x faster than the fastest software based solution). The stats for the figure (below) were produced using all 1232 CPUs we have on the Utah CHPC cluster, and the pipeline was run the same way that the DRAGEN chip group ran their benchmarks (their stats and the GATK comparison come directly from their website).

Mary Anne Karren: What did you do to achieve this improvement in the FastQForward pipeline?

Carson Holt: FastQforward is based on a framework I developed called Parallel::Architect. This framework allows you to write virtually any kind of analysis pipeline to be extremely fast and parallelized. I’ve worked to improve the intercommunication between processes in this framework, and I’ve balanced the size of steps taken by FastQForward, so work can efficiently be distributed to hundreds and even thousands of CPUs. Also I built in optimizations for read/write IO operations to take advantage of high throughput storage platforms like Lustre. When you optimize for these types of storage solutions, single large files can be spread across multiple servers and disks.  Then if you stagger read/write operations across processes so that they read and write as a group rather than individually you can get data throughputs that are orders of magnitude greater than what you would get from a local disk.  This is because each process is technically hitting independent servers and independent disks even though they are all reading the same file, so you get much greater throughput and faster response times.

Mary Anne Karren:  How much faster is FastQForward now than it was before?

Carson Holt: Previously FastQForward could process a whole genome in about an hour, which is still extremely fast considering that other pipelines can take days. It did this by running on about 300-400 CPUs. But changes I made now allow it to run on well over 1000 CPUs, which in addition to other efficiency gains made processing a whole genome in less than 10 minutes possible. Of course I’ve also built in other features to the pipeline as well as flexibility.  For example, if researchers prefer to run the GATK best practices workflow instead of Freebayes (the current default variant caller in FastQforward), they can do that with a simple command line flag. It takes just under 20 minutes to run the full GATK best practices workflow compared to 8 minutes for the Freebayes workflow because it involves a number of additional computationally intensive steps, but that’s still so much faster than any other pipeline out there.

Mary Anne Karren:  How will this speed-up affect the Utah Genome Project and other projects using the FastQForward pipeline?

Carson Holt: It will allow us to present most results back to our partners and collaborators the same day.  When you also think of how these results will immediately be available to fast visualization tools like iobio, you can see how this type of ultra fast analysis empowers researchers to make decisions about the direction of a study within just minutes of receiving the sequencing data.

With FastQforward you can process one sample extremely fast or multiple samples at a slower pace. This better matches how datasets tend to trickle off of the sequencers, and it also gives researchers the flexibility to prioritize individual samples in a sort of emergency room fashion.

Mary Anne Karren:  Where are the bottlenecks to discovery now?

Carson Holt: Current sequencing technology is geared to produce large volumes of data from multiple genomes simultaneously while not actually giving results for any single genome all that fast.  In fact, it usually takes a week or more for any single genome to finish sequencing. Analysis pipelines then take several days to process the sequencing data for each genome individually. The result is that you have a large volume of data (enough to easily overwhelm most analysis pipelines), but it doesn’t come in a timely manner. Having lapses of a month or more between DNA extraction and final results limits the utility of whole genome analysis from a clinical perspective. But having ultra fast analysis puts the bottleneck back on sequencing.  The goal is then not to process even larger volumes of data, but rather to process smaller volumes faster. If a sequencer can finish a single genome within a day, then ultra fast analysis like FastQforward would allow for same day and next day clinical diagnostics which could have a major impact on research and healthcare.

FastQForward speed comparison

Pipeline FastQForward DRAGEN GP 1.3 BWA-MEM 0.7.9a/GATC-HC 3.1.1
Decompress 6m 43s 9m 45s 18m 23s
Map/Align 4h 18m 37s
Sort/Dedup 10m 12s
Compress
Variant Call 1m 31s 4h 24m
Total 8m 14s  19m 57s 9h 1m

 

Note:  Computational hardware for the USTAR Center for Genetic Discovery is housed and maintained at the University of Utah’s Center for High Performance Computing (CHPC).  The expertise and resources provided by the CHPC are essential for this work.

Questions? Contact Mary Anne Karren, USTAR Center for Genetic Discovery
makarren@genetics.utah.edu