*banner
 

Workflow Recovery for Different Models of Computation and Models of Provenance
Sven Koehler, Bertram Ludaescher, Timothy McPhillips

Citation
Sven Koehler, Bertram Ludaescher, Timothy McPhillips. "Workflow Recovery for Different Models of Computation and Models of Provenance". Talk or presentation, 16, February, 2011; Presented at the Ninth Biennial Ptolemy Miniconference, Berkeley, CA.

Abstract
An increasing number of scientific workflows contain long running jobs but there has been little research on how to recover the workflow execution after the workflow execution engine itself failed. Most related work considers state within a fault tolerance framework where the workflow engine itself is not failing. Other approaches do not deal with stateful actors which can rarely be used in practice. We describe how to save and use internal state of actors for a recovery of a workflow execution engine from provenance information recorded up to the failure. We demonstrate in which degree the model of computation influences the recovery and we use the provenance currently recorded by other approaches such as the Kepler provenance recorder in order to resume a workflow execution. We show that while DAGMan workflows are easy to recover, SDF and PN directed workflows in Kepler are more complex to recover and offer more possibilities to optimize execution and recovery. In a serially scheduled SDF workflow at most one actor invocation can be effected by a workflow engine crash while in PN possibly all actors could be effected. We explore which models of provenance are suitable for a non-trivial (i.e., optimized) resume of a workflow execution and then present extended provenance models that are well suited for speeding up the workflow recovery in context of different models of computation. Modeling PN workflows using the iteration provided by the framework instead of a loop construct within the firing method could speed up workflow recovery dramatically. On the other hand the conventional read/write provenance model was found to insufficient for an efficient recovery of various models of computation because it lacks information about the internal state of actors. Our goal is to find more generic provenance models that offer a good performance for a restore as well as during execution and that will also be practical in a distributed workflow execution. We argue that concepts of global time and global unique IDs should be avoided wherever possible. Finally, we show how provenance queries are used to determine the status of a workflow execution as well as to extract important information that needs to be restored in order to do a resume.

Electronic downloads

Citation formats  
  • HTML
    Sven Koehler, Bertram Ludaescher, Timothy McPhillips. <a
    href="http://chess.eecs.berkeley.edu/pubs/812.html"><i>Workflow
    Recovery for Different Models of Computation and Models of
    Provenance</i></a>, Talk or presentation,  16,
    February, 2011; Presented at the <a
    href="http://ptolemy.eecs.berkeley.edu/conferences/11"
    >Ninth Biennial Ptolemy Miniconference</a>,
    Berkeley, CA.
  • Plain text
    Sven Koehler, Bertram Ludaescher, Timothy McPhillips.
    "Workflow Recovery for Different Models of Computation
    and Models of Provenance". Talk or presentation,  16,
    February, 2011; Presented at the <a
    href="http://ptolemy.eecs.berkeley.edu/conferences/11"
    >Ninth Biennial Ptolemy Miniconference</a>,
    Berkeley, CA.
  • BibTeX
    @presentation{KoehlerLudaescherMcPhillips11_WorkflowRecoveryForDifferentModelsOfComputationModels,
        author = {Sven Koehler and Bertram Ludaescher and Timothy
                  McPhillips},
        title = {Workflow Recovery for Different Models of
                  Computation and Models of Provenance},
        day = {16},
        month = {February},
        year = {2011},
        note = {Presented at the <a
                  href="http://ptolemy.eecs.berkeley.edu/conferences/11"
                  >Ninth Biennial Ptolemy Miniconference</a>,
                  Berkeley, CA.},
        abstract = {An increasing number of scientific workflows
                  contain long running jobs but there has been
                  little research on how to recover the workflow
                  execution after the workflow execution engine
                  itself failed. Most related work considers state
                  within a fault tolerance framework where the
                  workflow engine itself is not failing. Other
                  approaches do not deal with stateful actors which
                  can rarely be used in practice. We describe how to
                  save and use internal state of actors for a
                  recovery of a workflow execution engine from
                  provenance information recorded up to the failure.
                  We demonstrate in which degree the model of
                  computation influences the recovery and we use the
                  provenance currently recorded by other approaches
                  such as the Kepler provenance recorder in order to
                  resume a workflow execution. We show that while
                  DAGMan workflows are easy to recover, SDF and PN
                  directed workflows in Kepler are more complex to
                  recover and offer more possibilities to optimize
                  execution and recovery. In a serially scheduled
                  SDF workflow at most one actor invocation can be
                  effected by a workflow engine crash while in PN
                  possibly all actors could be effected. We explore
                  which models of provenance are suitable for a
                  non-trivial (i.e., optimized) resume of a workflow
                  execution and then present extended provenance
                  models that are well suited for speeding up the
                  workflow recovery in context of different models
                  of computation. Modeling PN workflows using the
                  iteration provided by the framework instead of a
                  loop construct within the firing method could
                  speed up workflow recovery dramatically. On the
                  other hand the conventional read/write provenance
                  model was found to insufficient for an efficient
                  recovery of various models of computation because
                  it lacks information about the internal state of
                  actors. Our goal is to find more generic
                  provenance models that offer a good performance
                  for a restore as well as during execution and that
                  will also be practical in a distributed workflow
                  execution. We argue that concepts of global time
                  and global unique IDs should be avoided wherever
                  possible. Finally, we show how provenance queries
                  are used to determine the status of a workflow
                  execution as well as to extract important
                  information that needs to be restored in order to
                  do a resume. },
        URL = {http://chess.eecs.berkeley.edu/pubs/812.html}
    }
    

Posted by Christopher Brooks on 18 Feb 2011.
Groups: ptolemy
For additional information, see the Publications FAQ or contact webmaster at chess eecs berkeley edu.

Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright.

©2002-2018 Chess