bioKepler: A Comprehensive Bioinformatics Scientific Workflow Module for Distributed Analysis of Large-Scale Biological Data

bioKepler: A Comprehensive Bioinformatics Scientific Workflow Module for Distributed Analysis of Large-Scale Biological Data
Ilkay Altintas, Daniel Crawl

Citation
Ilkay Altintas, Daniel Crawl. "bioKepler: A Comprehensive Bioinformatics Scientific Workflow Module for Distributed Analysis of Large-Scale Biological Data". Talk or presentation, 16, October, 2015; Presented at the Eleventh Biennial Ptolemy Miniconference, Berkeley.

Abstract
The enormous data growth due to next-generation gene sequences places unprecedented demands on traditional single-processor bioinformatics algorithms. Efficient and comprehensive analysis of the generated data requires distributed and parallel processing capabilities. Based on this motivation and our previous experiences in bioinformatics and distributed scientific workflows, we have created a Kepler suite called “bioKepler” (biokepler.org), that facilitates the development of Kepler workflows for integrated execution of bioinformatics applications in distributed environments. To develop such an environment, we build scientific workflow components to execute a set of bioinformatics tools using distributed data-parallel execution patterns. Once customized, these components are executed on multiple distributed platforms including various Cloud and HPC computing platforms. bioKepler contains a set of Kepler actors, called “bioActors”. There are two sets of bioActors: bioKepler and Bio-Linux. The bioActors inside bioKepler are configured with specific inputs, outputs, parameters, and documentation. These bioActors are specialized for running bioinformatics tools along with Kepler directors for distributed data-parallel (DDP) execution on Hadoop, Spark, and Stratosphere engines. Over 40 example workflows (http://biokepler.org/demos) demonstrating how to use these actors and directors have been packaged in the first release of bioKepler. The Bio-Linux bioActors, however, are mostly “unconfigured” with only a single input and output. These actors represent most of the 500+ biology-related tools that come with the Bio-Linux distribution (environmentalomics.org/bio-linux/). Due to this large number of bioActors, the current bioKepler release doesn’t provide configured inputs, outputs, etc., for each of these bioActors. However, users can manually create and configure them by using the Execution Choice configuration dialog. The bioActors under either bioKepler or Bio-Linux are grouped into major categories such as Alignment, Assembly, Clustering, RNA-seq, etc. To locate a bioActor, e.g., bwa, you can browse the bioActor tree or directly search from the Kepler search panel. Additionally, we have created bioKepler virtual machine images for Amazon EC2 and Docker, which includes bioKepler suite and 500+ bioinformatics tools.

Electronic downloads

Altintas_BioKepler_PtolemyMiniconference_16Oct15.pdf · application/pdf · 4379 kbytes

Citation formats

HTML

Ilkay Altintas, Daniel Crawl. <a
href="http://chess.eecs.berkeley.edu/pubs/1130.html"><i>bioKepler:
A Comprehensive Bioinformatics Scientific Workflow Module
for Distributed Analysis of Large-Scale Biological
Data</i></a>, Talk or presentation,  16,
October, 2015; Presented at the <a
href="http://ptolemy.eecs.berkeley.edu/conferences/15/"
>Eleventh Biennial Ptolemy Miniconference</a>,
Berkeley.

Plain text

Ilkay Altintas, Daniel Crawl. "bioKepler: A
Comprehensive Bioinformatics Scientific Workflow Module for
Distributed Analysis of Large-Scale Biological Data".
Talk or presentation,  16, October, 2015; Presented at the
<a
href="http://ptolemy.eecs.berkeley.edu/conferences/15/"
>Eleventh Biennial Ptolemy Miniconference</a>,
Berkeley.

BibTeX

@presentation{AltintasCrawl15_BioKeplerComprehensiveBioinformaticsScientificWorkflow,
    author = {Ilkay Altintas and Daniel Crawl},
    title = {bioKepler: A Comprehensive Bioinformatics
              Scientific Workflow Module for Distributed
              Analysis of Large-Scale Biological Data},
    day = {16},
    month = {October},
    year = {2015},
    note = {Presented at the <a
              href="http://ptolemy.eecs.berkeley.edu/conferences/15/"
              >Eleventh Biennial Ptolemy Miniconference</a>,
              Berkeley},
    abstract = {The enormous data growth due to next-generation
              gene sequences places unprecedented demands on
              traditional single-processor bioinformatics
              algorithms. Efficient and comprehensive analysis
              of the generated data requires distributed and
              parallel processing capabilities. Based on this
              motivation and our previous experiences in
              bioinformatics and distributed scientific
              workflows, we have created a Kepler suite called
              â��bioKeplerâ�� (biokepler.org), that facilitates
              the development of Kepler workflows for integrated
              execution of bioinformatics applications in
              distributed environments. To develop such an
              environment, we build scientific workflow
              components to execute a set of bioinformatics
              tools using distributed data-parallel execution
              patterns. Once customized, these components are
              executed on multiple distributed platforms
              including various Cloud and HPC computing
              platforms. bioKepler contains a set of Kepler
              actors, called â��bioActorsâ��. There are two sets
              of bioActors: bioKepler and Bio-Linux. The
              bioActors inside bioKepler are configured with
              specific inputs, outputs, parameters, and
              documentation. These bioActors are specialized for
              running bioinformatics tools along with Kepler
              directors for distributed data-parallel (DDP)
              execution on Hadoop, Spark, and Stratosphere
              engines. Over 40 example workflows
              (http://biokepler.org/demos) demonstrating how to
              use these actors and directors have been packaged
              in the first release of bioKepler. The Bio-Linux
              bioActors, however, are mostly â��unconfiguredâ��
              with only a single input and output. These actors
              represent most of the 500+ biology-related tools
              that come with the Bio-Linux distribution
              (environmentalomics.org/bio-linux/). Due to this
              large number of bioActors, the current bioKepler
              release doesnâ��t provide configured inputs,
              outputs, etc., for each of these bioActors.
              However, users can manually create and configure
              them by using the Execution Choice configuration
              dialog. The bioActors under either bioKepler or
              Bio-Linux are grouped into major categories such
              as Alignment, Assembly, Clustering, RNA-seq, etc.
              To locate a bioActor, e.g., bwa, you can browse
              the bioActor tree or directly search from the
              Kepler search panel. Additionally, we have created
              bioKepler virtual machine images for Amazon EC2
              and Docker, which includes bioKepler suite and
              500+ bioinformatics tools. },
    URL = {http://chess.eecs.berkeley.edu/pubs/1130.html}
}

Posted by Christopher Brooks on 19 Oct 2015.
Groups: ptolemy
For additional information, see the Publications FAQ or contact webmaster at chess eecs berkeley edu.

Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright.