*banner
 

bioKepler: A Comprehensive Bioinformatics Scientific Workflow Module for Distributed Analysis of Large-Scale Biological Data
Ilkay Altintas, Daniel Crawl

Citation
Ilkay Altintas, Daniel Crawl. "bioKepler: A Comprehensive Bioinformatics Scientific Workflow Module for Distributed Analysis of Large-Scale Biological Data". Talk or presentation, 16, October, 2015; Presented at the Eleventh Biennial Ptolemy Miniconference, Berkeley.

Abstract
The enormous data growth due to next-generation gene sequences places unprecedented demands on traditional single-processor bioinformatics algorithms. Efficient and comprehensive analysis of the generated data requires distributed and parallel processing capabilities. Based on this motivation and our previous experiences in bioinformatics and distributed scientific workflows, we have created a Kepler suite called “bioKepler” (biokepler.org), that facilitates the development of Kepler workflows for integrated execution of bioinformatics applications in distributed environments. To develop such an environment, we build scientific workflow components to execute a set of bioinformatics tools using distributed data-parallel execution patterns. Once customized, these components are executed on multiple distributed platforms including various Cloud and HPC computing platforms. bioKepler contains a set of Kepler actors, called “bioActors”. There are two sets of bioActors: bioKepler and Bio-Linux. The bioActors inside bioKepler are configured with specific inputs, outputs, parameters, and documentation. These bioActors are specialized for running bioinformatics tools along with Kepler directors for distributed data-parallel (DDP) execution on Hadoop, Spark, and Stratosphere engines. Over 40 example workflows (http://biokepler.org/demos) demonstrating how to use these actors and directors have been packaged in the first release of bioKepler. The Bio-Linux bioActors, however, are mostly “unconfigured” with only a single input and output. These actors represent most of the 500+ biology-related tools that come with the Bio-Linux distribution (environmentalomics.org/bio-linux/). Due to this large number of bioActors, the current bioKepler release doesn’t provide configured inputs, outputs, etc., for each of these bioActors. However, users can manually create and configure them by using the Execution Choice configuration dialog. The bioActors under either bioKepler or Bio-Linux are grouped into major categories such as Alignment, Assembly, Clustering, RNA-seq, etc. To locate a bioActor, e.g., bwa, you can browse the bioActor tree or directly search from the Kepler search panel. Additionally, we have created bioKepler virtual machine images for Amazon EC2 and Docker, which includes bioKepler suite and 500+ bioinformatics tools.

Electronic downloads

Citation formats  
  • HTML
    Ilkay Altintas, Daniel Crawl. <a
    href="http://chess.eecs.berkeley.edu/pubs/1130.html"><i>bioKepler:
    A Comprehensive Bioinformatics Scientific Workflow Module
    for Distributed Analysis of Large-Scale Biological
    Data</i></a>, Talk or presentation,  16,
    October, 2015; Presented at the <a
    href="http://ptolemy.eecs.berkeley.edu/conferences/15/"
    >Eleventh Biennial Ptolemy Miniconference</a>,
    Berkeley.
  • Plain text
    Ilkay Altintas, Daniel Crawl. "bioKepler: A
    Comprehensive Bioinformatics Scientific Workflow Module for
    Distributed Analysis of Large-Scale Biological Data".
    Talk or presentation,  16, October, 2015; Presented at the
    <a
    href="http://ptolemy.eecs.berkeley.edu/conferences/15/"
    >Eleventh Biennial Ptolemy Miniconference</a>,
    Berkeley.
  • BibTeX
    @presentation{AltintasCrawl15_BioKeplerComprehensiveBioinformaticsScientificWorkflow,
        author = {Ilkay Altintas and Daniel Crawl},
        title = {bioKepler: A Comprehensive Bioinformatics
                  Scientific Workflow Module for Distributed
                  Analysis of Large-Scale Biological Data},
        day = {16},
        month = {October},
        year = {2015},
        note = {Presented at the <a
                  href="http://ptolemy.eecs.berkeley.edu/conferences/15/"
                  >Eleventh Biennial Ptolemy Miniconference</a>,
                  Berkeley},
        abstract = {The enormous data growth due to next-generation
                  gene sequences places unprecedented demands on
                  traditional single-processor bioinformatics
                  algorithms. Efficient and comprehensive analysis
                  of the generated data requires distributed and
                  parallel processing capabilities. Based on this
                  motivation and our previous experiences in
                  bioinformatics and distributed scientific
                  workflows, we have created a Kepler suite called
                  âbioKeplerâ (biokepler.org), that facilitates
                  the development of Kepler workflows for integrated
                  execution of bioinformatics applications in
                  distributed environments. To develop such an
                  environment, we build scientific workflow
                  components to execute a set of bioinformatics
                  tools using distributed data-parallel execution
                  patterns. Once customized, these components are
                  executed on multiple distributed platforms
                  including various Cloud and HPC computing
                  platforms. bioKepler contains a set of Kepler
                  actors, called âbioActorsâ. There are two sets
                  of bioActors: bioKepler and Bio-Linux. The
                  bioActors inside bioKepler are configured with
                  specific inputs, outputs, parameters, and
                  documentation. These bioActors are specialized for
                  running bioinformatics tools along with Kepler
                  directors for distributed data-parallel (DDP)
                  execution on Hadoop, Spark, and Stratosphere
                  engines. Over 40 example workflows
                  (http://biokepler.org/demos) demonstrating how to
                  use these actors and directors have been packaged
                  in the first release of bioKepler. The Bio-Linux
                  bioActors, however, are mostly âunconfiguredâ
                  with only a single input and output. These actors
                  represent most of the 500+ biology-related tools
                  that come with the Bio-Linux distribution
                  (environmentalomics.org/bio-linux/). Due to this
                  large number of bioActors, the current bioKepler
                  release doesnât provide configured inputs,
                  outputs, etc., for each of these bioActors.
                  However, users can manually create and configure
                  them by using the Execution Choice configuration
                  dialog. The bioActors under either bioKepler or
                  Bio-Linux are grouped into major categories such
                  as Alignment, Assembly, Clustering, RNA-seq, etc.
                  To locate a bioActor, e.g., bwa, you can browse
                  the bioActor tree or directly search from the
                  Kepler search panel. Additionally, we have created
                  bioKepler virtual machine images for Amazon EC2
                  and Docker, which includes bioKepler suite and
                  500+ bioinformatics tools. },
        URL = {http://chess.eecs.berkeley.edu/pubs/1130.html}
    }
    

Posted by Christopher Brooks on 19 Oct 2015.
Groups: ptolemy
For additional information, see the Publications FAQ or contact webmaster at chess eecs berkeley edu.

Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright.

©2002-2018 Chess