*banner
 

A Distributed Data-Parallel Execution Framework in the Kepler Scientific Workflow System
Ilkay Altintas, Daniel Crawl, Jianwu Wang

Citation
Ilkay Altintas, Daniel Crawl, Jianwu Wang. "A Distributed Data-Parallel Execution Framework in the Kepler Scientific Workflow System". Talk or presentation, 16, October, 2015; Presented at the Eleventh Biennial Ptolemy Miniconference, Berkeley.

Abstract
Kepler is a scientific workflow system that is built on the Ptolemy II framework. Kepler provides a graphical user interface that allows users to easily design scientific workflows by simply dragging and dropping actors implementing different operations and linking them together to create the steps necessary for a specific analysis workflow.

Data parallelism describes parallel execution of multiple tasks in an application or workflow while different tasks processing independently on different parts of the same dataset. As a special type of data parallelism, distributed data-parallelism (DDP) is to execute tasks in parallel with partitioned data on distributed computing nodes. DDP patterns, e.g., Map, Reduce, Match, CoGroup, and Cross, provide opportunities to facilitate Big Data applications and associated analysis workflows. These patterns: (i) support data distribution and parallel data processing with distributed data on multiple nodes/cores; (ii) provide a higher-level programming model to facilitate user program parallelization; (iii) follow a “moving computation to data” principle that reduces data movement overheads; (iv) have good scalability and performance acceleration when executing on distributed resources; (v) support runtime features such as fault-tolerance; and (vi) simplify the difficulty for parallel programming in comparison to traditional parallel programming interfaces such MPI and OpenMP. Existing single node programs could be executed in parallel without modification using DDP patterns.

The DDP suite in Kepler supports DDP operations on distributed computational resources. The suite includes a set of special composite actors, called DDP pattern actors, that provide an interface to data-parallel patterns. Each DDP pattern actor corresponds to a particular data-parallel pattern, and actors have been implemented for Map, Reduce, CoGroup, Match, and Cross.

Similar to other actors, the DDP actors can be linked together to form a chain of tasks, and can be nested hierarchically as part of larger workflows. Each DDP actor corresponds to a single pattern thereby allowing greater flexibility to express workflow logic. A DDP pattern actor provides two important pieces of information: the data-parallel pattern, and the tasks to perform once the pattern has been applied. The type of data-parallel pattern specifies how the data is grouped among computational resources. Once the data is grouped, a set of processing steps is applied to the data. DDP pattern actors provide two mechanisms specifying these steps: choose a predefined function or create a sub-workflow.

The DDP framework also includes I/O actors for reading and writing data between the data-parallel patterns and the underlying storage system. The DDPDataSource actor specifies the location of the input data in the storage system as well as how to partition the data. Similarly, the DDPDataSink actor specifies the data output location and how to merge the data. The partitioning and merging methods depend on the format of the data and are application-specific.

The DDP suite contains DDPDirector, which executes workflows composed of DDP patterns and I/O actors on a Big Data engine. Three engines are supported in DDP suite version 1.1: Hadoop (version 2.2.0), Stratosphere (version 0.4), and Spark (version 1.1.0). If more than one execution engine is available, you can achieve flexible DDP workflow execution by simply switching the engine within the DDP sub-workflow.

Electronic downloads

Citation formats  
  • HTML
    Ilkay Altintas, Daniel Crawl, Jianwu Wang. <a
    href="http://chess.eecs.berkeley.edu/pubs/1126.html"><i>A
    Distributed Data-Parallel Execution Framework in the Kepler
    Scientific Workflow System</i></a>, Talk or
    presentation,  16, October, 2015; Presented at the <a
    href="http://ptolemy.eecs.berkeley.edu/conferences/15/"
    >Eleventh Biennial Ptolemy Miniconference</a>,
    Berkeley.
  • Plain text
    Ilkay Altintas, Daniel Crawl, Jianwu Wang. "A
    Distributed Data-Parallel Execution Framework in the Kepler
    Scientific Workflow System". Talk or presentation,  16,
    October, 2015; Presented at the <a
    href="http://ptolemy.eecs.berkeley.edu/conferences/15/"
    >Eleventh Biennial Ptolemy Miniconference</a>,
    Berkeley.
  • BibTeX
    @presentation{AltintasCrawlWang15_DistributedDataParallelExecutionFrameworkInKeplerScientific,
        author = {Ilkay Altintas and Daniel Crawl and Jianwu Wang},
        title = {A Distributed Data-Parallel Execution Framework in
                  the Kepler Scientific Workflow System},
        day = {16},
        month = {October},
        year = {2015},
        note = {Presented at the <a
                  href="http://ptolemy.eecs.berkeley.edu/conferences/15/"
                  >Eleventh Biennial Ptolemy Miniconference</a>,
                  Berkeley},
        abstract = {Kepler is a scientific workflow system that is
                  built on the Ptolemy II framework. Kepler provides
                  a graphical user interface that allows users to
                  easily design scientific workflows by simply
                  dragging and dropping actors implementing
                  different operations and linking them together to
                  create the steps necessary for a specific analysis
                  workflow. <p>Data parallelism describes parallel
                  execution of multiple tasks in an application or
                  workflow while different tasks processing
                  independently on different parts of the same
                  dataset. As a special type of data parallelism,
                  distributed data-parallelism (DDP) is to execute
                  tasks in parallel with partitioned data on
                  distributed computing nodes. DDP patterns, e.g.,
                  Map, Reduce, Match, CoGroup, and Cross, provide
                  opportunities to facilitate Big Data applications
                  and associated analysis workflows. These patterns:
                  (i) support data distribution and parallel data
                  processing with distributed data on multiple
                  nodes/cores; (ii) provide a higher-level
                  programming model to facilitate user program
                  parallelization; (iii) follow a âmoving
                  computation to dataâ principle that reduces data
                  movement overheads; (iv) have good scalability and
                  performance acceleration when executing on
                  distributed resources; (v) support runtime
                  features such as fault-tolerance; and (vi)
                  simplify the difficulty for parallel programming
                  in comparison to traditional parallel programming
                  interfaces such MPI and OpenMP. Existing single
                  node programs could be executed in parallel
                  without modification using DDP patterns. <p>The
                  DDP suite in Kepler supports DDP operations on
                  distributed computational resources. The suite
                  includes a set of special composite actors, called
                  DDP pattern actors, that provide an interface to
                  data-parallel patterns. Each DDP pattern actor
                  corresponds to a particular data-parallel pattern,
                  and actors have been implemented for Map, Reduce,
                  CoGroup, Match, and Cross. <p>Similar to other
                  actors, the DDP actors can be linked together to
                  form a chain of tasks, and can be nested
                  hierarchically as part of larger workflows. Each
                  DDP actor corresponds to a single pattern thereby
                  allowing greater flexibility to express workflow
                  logic. A DDP pattern actor provides two important
                  pieces of information: the data-parallel pattern,
                  and the tasks to perform once the pattern has been
                  applied. The type of data-parallel pattern
                  specifies how the data is grouped among
                  computational resources. Once the data is grouped,
                  a set of processing steps is applied to the data.
                  DDP pattern actors provide two mechanisms
                  specifying these steps: choose a predefined
                  function or create a sub-workflow. <p>The DDP
                  framework also includes I/O actors for reading and
                  writing data between the data-parallel patterns
                  and the underlying storage system. The
                  DDPDataSource actor specifies the location of the
                  input data in the storage system as well as how to
                  partition the data. Similarly, the DDPDataSink
                  actor specifies the data output location and how
                  to merge the data. The partitioning and merging
                  methods depend on the format of the data and are
                  application-specific. <p>The DDP suite contains
                  DDPDirector, which executes workflows composed of
                  DDP patterns and I/O actors on a Big Data engine.
                  Three engines are supported in DDP suite version
                  1.1: Hadoop (version 2.2.0), Stratosphere (version
                  0.4), and Spark (version 1.1.0). If more than one
                  execution engine is available, you can achieve
                  flexible DDP workflow execution by simply
                  switching the engine within the DDP sub-workflow. },
        URL = {http://chess.eecs.berkeley.edu/pubs/1126.html}
    }
    

Posted by Christopher Brooks on 19 Oct 2015.
Groups: ptolemy
For additional information, see the Publications FAQ or contact webmaster at chess eecs berkeley edu.

Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright.

©2002-2018 Chess