Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.CESNET2

Start of topic | Skip to actions

Second Provenance Challenge -- CESNET

Participating Team

  • Short team name: CESNET
  • Participant names: Frantisek Dvorak, Jiri Filipovic, Ales Krenek, Ludek Matyska, Milos Mulac, Jiri Sitera, Zdenek Sustr
  • Project URL: http://egee.cesnet.cz/en/JRA1/
  • Reference to first challenge results (if participated): CESNET

Differences from First Challenge

Note here any changes in your provenance representation, workflow enactment or system since the first challenge. Alternatively, if you did not participate in the first challenge, please provide the same details as were required for those who did (particularly workflow representation and provenance representation).

Implicit workflow representation

The CESNET implementation of the First Provenance Challenge relied on an explicit representation of workflow structure that was extracted from the native workflow representation in gLite -- dependencies among DAG subjobs specified by the user on its submission. These dependencies were decoded and recorded as ancestor and successor attributes of the DAG subjobs and used for query implmentation.

This restriction is relaxed in the Second Challenge. Instead, dependence between two workflow processes is inherited from data: Process A is makred as ancestor of B (and vice versa, B is successor of A) if there is a data file F that is output of A and input of B. Logical filenames are considered for this purpose (name in the file elements in the format definition bellow, not physical filenames -- content of url elements).

For the purpose the challenge we implement this process in an external "sew" script. The script is seeded with one or more identifiers of processes, it queries recursively JP, data dependences (common input-output files) are traversed in both directions until the complete graph closure is found. The found dependences are recorded with processes in terms of ancestor and successor attributes of the first challenge; then the challenge queries implementation remains unchaged in this sense.

Currently the script is invoked on demand. However, it can be transformed into a part of the JP infrastructure -- an agent which subscribes for receiving notifications on input/output file assignments to processes, and generates the workflow dependencies automatically. The mechanism of generating such notifications is already available in JP. It is used in the communication of JP Primary storage and JP Index server.

Query implementation

The queries implementation remains unchaged as implemented for the first challenge except small adaptations described in next paragraphs.

Executable naming

The First Challenge query scripts used hardcoded executable names. This was not a problem, the names matched exactly the values recorded by our implementation of the workflow.

However, the naming varies among the teams, eg. it may or may not contain absolute path to the executable. Therefore the scripts had to be parametrized to be run with the names appropriate for the particular data source

Timestamps

JP starts gatering data on a job virtually at the same time the job is submitted to the Grid. Therefore, during the First Challenge, we could have used times of job registration with JP to approximate the job run time quite accurately. (Queries on the exact execution time were not implemented in JP that time.)

This is not true anymore in the Second Challenge. The job is registered with JP when the data are imported, ie. typically much later wrt. its real execution.

The query scripts were adjusted to use the true execution time.

Provenance Data for Workflow Parts

Give links here to your provenance data files for the workflow parts of the challenge: three parts for the original workflow and three parts for the modified workflow (as per provenance query 7). The data files could be attached to the results page.

Challenge data format

For the purpose of the Challenge, data are exported from Job Provenance in an XML format conforming to a schema available here.

The format is custom-made specifically for the Challenge in order to facilitate the data exchange with other teams, however, it is a full-featured export format from Job Provenance:

  • it is generated in an automatic way from data available in JP after running the First Challenge workflow, without any manual intervention,
  • virtually all information in JP is included, despite it may not be needed for the Second Challenge as a whole,
  • the exported files can be taken "as is" for importing back into JP, resulting in an equivalent functionality

An export utility used to generate the exchange files with JP queries is available here.

Commented example

Here we show an example of the data format. This example was hand-edited for the sake of better readablility.

<?xml version="1.0"?>

<workflow xmlns="http://egee.cesnet.cz/en/Schema/JP/Challenge2">
   <exportedStages>1 2</exportedStages>

   <job id="https://skurut1.cesnet.cz:9000/yM3sz8v6WCIPgi5-0m8L4w">
      <owner>/DC=cz/DC=cesnet-ca/O=Masaryk University/CN=Ales Krenek</owner>
      <regtime>2006-07-11T12:22:34</regtime>

<!-- input and output files of this job -->
      <inputs>
         <file name="urn:challenge:anatomy1.img">
            <url>gsiftp://umbar.ics.muni.cz:1414/home/mulac/pch06/anatomy1.img</url>
            <url>gsiftp://umbar.ics.muni.cz:1414/home/mulac/pch06/anatomy1.hdr</url>
         </file>
      </inputs>

      <outputs>
         <file name="urn:challenge:anatomy1_yM3sz8v6WCIPgi5-0m8L4w.warp">
            <url>gsiftp://umbar.ics.muni.cz:1414/home/mulac/pch06/anatomy1_yM3sz8v6WCIPgi5-0m8L4w.warp</url>
         </file>
      </outputs>

<!-- workflow structure: jobs that preceed and follow this one in the workflow -->

      <ancestors>
<!-- empty for stage 1 -->
      </ancestors>

      <successors>
<!-- note the reference to the other job bellow -->
         <jobid>https://skurut1.cesnet.cz:9000/wdWQHL0-RXkd3VeNcSrTaw</jobid>
      </successors>

<!-- gLite middleware processing and job execution details -->
      <gliteJobRecord>
<!-- omitted for readability --> 
      </gliteJobRecord>

<!-- user annotations, including Challenge-specific; only the latter are shown -->
      <annotations>
         <annotation>
            <name>http://egee.cesnet.cz/en/WSDL/jp-lbtag:IPAW_STAGE</name>
            <value>1</value>
         </annotation>
         <annotation>
            <name>http://egee.cesnet.cz/en/WSDL/jp-lbtag:IPAW_PROGRAM</name>
            <value>align_warp</value>
         </annotation>
         <annotation>
            <name>http://egee.cesnet.cz/en/WSDL/jp-lbtag:IPAW_PARAM</name>
            <value>-m 12</value>
         </annotation>
         <annotation>
            <name>http://egee.cesnet.cz/en/WSDL/jp-lbtag:IPAW_PARAM</name>
            <value>-q</value>
         </annotation>
         <annotation>
            <name>http://egee.cesnet.cz/en/WSDL/jp-lbtag:IPAW_HEADER</name>
            <value>global_maximum=4095</value>
         </annotation>
      </annotations>
   </job>

   <job id="https://skurut1.cesnet.cz:9000/wdWQHL0-RXkd3VeNcSrTaw">

<!-- another job in the workflow, omitted -->

   </job>

<!-- further jobs follow -->

</workflow>

The root element of the file is workflow, correstponding to an entire exported workflow or its parts as given by the Challenge definition. The stages present in this file are listed in exportedStages.

Further second level elements are job 's, representing the individual processes in the workflow. Each one is assigned a unique ID already when processed by the gLite middleware. Besides general metadata (owner and registration time) the data can be organized in the following sections:

Inputs and outputs

file elements refer to concrete inputs and outputs of the job. The attribute name is a URI identifying the particular file uniquely. As we didn't follow any given file naming scheme in Challenge 1, custom urn: 's are shown in the example. However, any suitable file identifier can be used instead.

File name of input of the shown job has now suffix as it is the input of the entire workflow and only a single set of inputs was given. On the contrary, the output file name contains a unique suffix, suggesting that this file was generated by a particular workflow run.

As some of the files in the Challenge workflow are collections of files in fact (.img and .hdr files), we use nested url 's (that may occur multiple times) to denote also physical file locations.

Workflow structure

Structure of the workflow is denoted by links between job 's using their unique identifiers, and grouped in ancestors and successors. These links are present in the exported format regardless their targets are exported in this part of the workflow or not.

The links are sufficient to "stitch" together separately exported workflow parts in a unique and reliable way. However, if they are not available explicitely, they can be still reconstructed by searching matching inputs and outputs of the jobs.

Job processing details

gliteJobRecord contains details on processing the job in gLite middleware. It conforms to the schema originally defined for the purpose of computing job statistics in EGEE project.

These data are virtually irrelevant for the Challenge, therefore they are omitted in this example. However, they are present in the full exported data bellow.

The contained elements are either described within the schema or they are self-explanatory.

User annotations

JP allows the user to add arbitrary "namespace:name = value" annotations to the job, while "value" can have arbitrary complex XML structure. The same "name" can also occur multiple times. The annotations can be added either during job execution (usually via L&B, the gLite service that tracks the job during its active life), or later via native JP interface.

The annotations of particular interest for the Challenge are shown above. They correspond to tags recorded and described in Challenge 1, with the exception of IPAW_INPUT and IPAW_OUTPUT which are mapped specifically in this format.

Full workflow data

Original workflow

Modified workflow Not addressed in this challenge.

Model Integration Results

In order to get better understanding of the issues of translations between the provenance data models we extend the challenge specification into two stages:

  • translation and evaluation of homogeneous workflows (ie. data recorded in one provenance system only)
  • evaluation of heterogeneous workflows (combining data from multiple systems, as requested by the orignal specification)
In both the stages available data were translated, imported into JP, and the challege queries run. This approach allows us to focus on issues specific to translation of data from a particular system separately, while discussing issues arising intrinsically from the combinations (not many, actually) independently.

The translation and import process

Translation and eventual combination of the provenance data (see Translation tools bellow) is done in the following steps:
  1. separate translation of parts of the workflow from they native format to our format (as defined above)
  2. unification of the input and output file names of the softmean process (part 2) to match outpus and inputs of parts 1 and 3
  3. adjustment of all output filenames with a unique suffix
  4. assignment of new unique id's to all the workflow processes
  5. import of the adjusted files into JP (also
  6. run the sew script to determine dependences between processes

Steps 2--4 are rather artificial and serve the purpose of the challenge only.

Unification of names of softmean inputs/outputs is necessary to trigger inheriting dependences. If all the provenance systems gathered data on the same workflow execution, the matching filenames in all the parts of the workflow would be the same either.

Similarly adding the unique suffix to all filenames allows us to run multiple imports on the same input data without the need to purge the JP database between the attempts. The same holds for assigning the new unique id's to the imported processes in step 4.

Step 6, as its side effect, produces a graph representation of the imported data. These graphs are shown in the result section bellow.

Homogeneous workflows

ES3

Import graph

Provenance Query summary:

  1. OK, output
  2. OK, output
  3. OK, output
  4. Impossible, missing align_warp parameters
  5. Impossible, missing global maximum parameter
  6. Impossible, missing align_warp parameters
  7. Not addressed in Challenge 2
  8. Out of scope of JP
  9. Impossible, missing studyModality annotation

TODO:

  • what are the additional three processes (coming from stage 3) in the graph?
  • upload query #6 results

Karma

Import graph

More complicated due to duplicated arcs. This is caused by using different logical names for .img and .hdr pairs of files (unlike CESNET format which groups them together under a single logical name). Otherwise the graph matches expectations exactly.

Provenance Query summary:

  1. OK, output
  2. OK, output
  3. OK, output
  4. OK, output
  5. Impossible, missing global maximum parameter
  6. OK, output
  7. Not addressed in Challenge 2
  8. Out of scope of JP
  9. Not implemented. studyModality annotation is present, should be doable

TODO: more comments on Q9

MyGrid

Import graph

The graph contains number of "producer" nodes (see Translation Details bellow), a manually adjusted version (by removing these nodes) meets the expectation.

Provenance Query summary:

  1. OK, output
  2. OK, output
  3. OK, output
  4. Not implemented, information on align_warp parameters is present but not processed by our translator
  5. Impossible, global maximum parameter may be present in the j.0:global tag, however, the name is not unique, so the translator can't rely on it
  6. Not implemented, information on align_warp parameters is present but not processed by our translator
  7. Not addressed in Challenge 2
  8. Out of scope of JP
  9. Not implemented. studyModality annotation is present, should be doable

SDG

Import graph

The graph contains the first row of "producer" jobs, otherwise it matches expectations.

Provenance Query summary:

  1. OK, output
  2. OK, output
  3. OK, output
  4. OK, output
  5. OK, output
  6. OK, output
  7. Not addressed in Challenge 2
  8. Out of scope of JP
  9. Impossible, missing studyModality annotation

MINDSWAP

Import graph

Provenance Query summary:

  1. OK, output
  2. OK, output
  3. OK, output
  4. OK. (wrong parameters format in MINDSWAP) output,
  5. ipaw_header missing.
  6. OK, output
  7. Not addressed in Challenge 2
  8. Out of scope of JP
  9. Impossible, missing studyModality annotation

Heterogeneous workflows

Most of the challenge queries are affected by availability of data in a particular part of the workflow. Therefore, in general, the results of heterogeneous queries follow the results of the homogeneous queries on the involved provenance system.

In particular:

  • Q4, Q6: align_warp parameters, follow results of workflow part 1
  • Q5: global maximum parameter, workflow part 1 again
  • Q9: studyModality annotation, part 3

CESNET-Karma-SDG

Import graph

Provenance Query summary:

  1. OK, output
  2. OK, output
  3. OK, output
  4. OK, output
  5. OK, output
  6. OK, output
  7. Not addressed in Challenge 2
  8. Out of scope of JP
  9. Impossible, studyModality annotation missing in SDG data

ES3-MyGrid-SDG

Import graph

Provenance Query summary:

  1. OK, output
  2. OK, output
  3. OK, output
  4. ipaw_param not presented in ES3
  5. ipaw_head not presented in ES3
  6. ipaw_param not presented in ES3
  7. Not addressed in Challenge 2
  8. Out of scope of JP
  9. Impossible, studyModality annotation missing in SDG data

MyGrid-ES3-SDG

Import graph

The graph contains number of "producer" nodes from MyGrid.

Provenance Query summary:

  1. OK. output
  2. OK. output
  3. OK. output
  4. ipaw_param not presented in MyGrid
  5. ipaw_head not presented in MyGrid
  6. ipaw_param not presented in MyGrid
  7. Not addressed in Challenge 2
  8. Out of scope of JP
  9. Impossible, studyModality annotation missing in SDG data

Karma-SDG2-MINDSWAP2

Import graph

Provenance Query summary:

  1. OK. output
  2. OK. output
  3. OK. output
  4. OK. output
  5. ipaw_head not presented in Karma
  6. OK. output
  7. Not addressed in Challenge 2
  8. Out of scope of JP
  9. Impossible, studyModality annotation missing in MINDSWAP data

Translation Details

Describe details regarding how data models were translated (or otherwise used to answer the query following the team's approach), any data which was absent from a downloaded model, and whether this affected the possibility of translation or successful provenance query, and any data which was excluded in translation from a downloaded model because it was extraneous

Sections bellow briefly describe issues that raised from translating the particular provenance system data, and importing them into JP. The list is not complete wrt. all the participating teams. We were not able to put the necessary effort into evaluation of all, we have chosen more or less random sample, based on a very subjective and brief view on the provided data. Therefore we are not able to provide any serious assessment of the data formats of systems that are not listed in this section.

Translation tools

For the sake of easy repeatablity of the experiments with data translations we implemented fully automated procedures for translating the data formats and importing the results into JP. This is done for both homogeneous and heterogeneous workflows.

Our CVS repository is organized as follows:

  • export/: JP export and import utilities, ``sew'' script for inheriting the dependences, and common code for the automated translations
  • one provenance system directories: conversion tools for the particular format, specific parts of the automatic translation and import of homogeneous workflows
  • three provenance systems directories: specific code for translation and import of this particular heterogeneous workflow

JP assigns job owner to each process (X509 certificate subject). There seems be no analogy in the other formats, therefore we supplied the value as parameter of the translators.

Most of the formats don't include explicitly information on the part of the workflow (that matches the notion of stage in our format). This was also supplied as an additional parameter of the translator.

ES3

  • Different logical names for .hdr and .img file pairs are used (despite we understand these files to be tightly coupled). Consequently duplicate dependences among workflow processes are detected.
  • File names are not consistent across boundaries of the workflow parts (eg. reslice outputs are not the same as softmean inputs). We believe this to be an artifact of the challenge data rather than feature of the system, though, and we fixed the problem by manually renaming the files accordingly.
  • Arguments of align_warp seem to be defined according to Challenge 1 example, however, these data are missing in Challenge 2.
  • The global maximum parameter and studyModality annotation are not supported, therefore queries 5 and 9 can't be run.

MyGrid

  • As described at MyGrid team page each (workflow) input and output file is represented by its own "pseudoprocess" generating it. It is also true for each file on the workflow part edge. Althrough we probably find a sufficiently dicriminating criterion to identify such processes automatically (className of process BeanShellProcessor? versus StringConstantProcessor?) we don't implement it.
  • Both align_warp parameters and global maximum are present in the format, however, their naming is ambiguous (key of parameter is String Value and the global maximum seems to be encoded in Ontology:4095) according to our understanding. Therefore we could have not extracted them from the format.
  • Physical filenames are not present.
  • In general, the file format is rather difficult to understand and parse.

Karma

  • global maximum is missing, yielding query 5 to be impossible
  • Explicit identifiers of the process instance were missing. We used concatenation of workflowNodeID and serviceID, believing it to be sufficiently unique.

SDG

  • stage missing, we supply its value as parameter of the translator.
  • In general well understandable format.

MINDSWAP

  • global maximum is missing, yelding query 5 to be impossible.
  • There is probably bug in output/input files between stage 2 and 3. Reslice jobs produce image and header files, but softmean job inports headers twice (some in hasInputImage and some in hasInputHeader tag) and no image,
  • Another small bug is in parameters of align_warp jobs -- "-m 12" is stored as "-m -12".
  • In general, the file format is rather difficult to understand.

Benchmarks

Describe your proposed benchmark queries, how the comparable quantities are determined, and the results of applying the benchmark to your own system

On Fri, 22 Jun 2007, Simon Miles wrote: There is nothing particular to prepare for this prior to the workshop, though having thought about possible suitable scenarios or queries that would make suitable benchmarks would be welcome when we come to discuss it.

Further Comments

Provide here further comments.

Conclusions

Provide here your conclusions on the challenge, and issues that you like to see discussed at a face to face meeting.

TODO (ljocha)

-- SimonMiles - 26 Oct 2006

-- AlesKrenek - 19 Feb 2007
to top

I Attachment sort Action Size Date Who Comment
out1.xml manage 30.3 K 20 Feb 2007 - 21:41 AlesKrenek Original workflow, part1
out2.xml manage 4.9 K 20 Feb 2007 - 21:45 AlesKrenek Original workflow, part2
out3.xml manage 19.2 K 20 Feb 2007 - 21:46 AlesKrenek Original workflow, part3
es3.ps manage 16.1 K 22 Jun 2007 - 11:06 AlesKrenek ES3 import graph
es3-q1.log manage 6.5 K 22 Jun 2007 - 11:34 AlesKrenek Query #1 results
es3-q2.log manage 2.0 K 22 Jun 2007 - 11:35 AlesKrenek Query #2 results
es3-q3.log manage 1.9 K 22 Jun 2007 - 11:35 AlesKrenek Query #3 results
karma.ps manage 20.5 K 22 Jun 2007 - 11:43 AlesKrenek Karma import graph
karma-q1.log manage 7.0 K 22 Jun 2007 - 11:44 AlesKrenek  
karma-q2.log manage 2.2 K 22 Jun 2007 - 11:44 AlesKrenek  
karma-q3.log manage 2.2 K 22 Jun 2007 - 11:44 AlesKrenek  
karma-q4.log manage 3.1 K 22 Jun 2007 - 11:44 AlesKrenek  
karma-q6.log manage 7.6 K 22 Jun 2007 - 11:44 AlesKrenek  
mygrid.ps manage 44.0 K 22 Jun 2007 - 12:08 AlesKrenek MyGrid import graph
mygrid2.ps manage 21.4 K 22 Jun 2007 - 12:20 AlesKrenek  
mygrid-q1.log manage 15.8 K 22 Jun 2007 - 12:29 AlesKrenek  
mygrid-q2.log manage 3.4 K 22 Jun 2007 - 12:29 AlesKrenek  
mygrid-q3.log manage 5.4 K 22 Jun 2007 - 12:29 AlesKrenek  
sdg.ps manage 28.5 K 22 Jun 2007 - 12:50 AlesKrenek  
sdg-q1.log manage 8.0 K 22 Jun 2007 - 13:02 AlesKrenek  
sdg-q2.log manage 1.3 K 22 Jun 2007 - 13:02 AlesKrenek  
sdg-q3.log manage 1.3 K 22 Jun 2007 - 13:02 AlesKrenek  
sdg-q4.log manage 1.8 K 22 Jun 2007 - 13:02 AlesKrenek  
sdg-q5.log manage 1.3 K 22 Jun 2007 - 13:03 AlesKrenek  
sdg-q6.log manage 0.6 K 22 Jun 2007 - 13:03 AlesKrenek  
cks.ps manage 15.4 K 22 Jun 2007 - 13:07 AlesKrenek CESNET-Karma-SDG import
cks-q1.log manage 5.6 K 22 Jun 2007 - 13:08 AlesKrenek  
cks-q2.log manage 1.9 K 22 Jun 2007 - 13:08 AlesKrenek  
cks-q3.log manage 3.9 K 22 Jun 2007 - 13:08 AlesKrenek  
cks-q4.log manage 13.4 K 22 Jun 2007 - 13:08 AlesKrenek  
cks-q5.log manage 1.4 K 22 Jun 2007 - 13:09 AlesKrenek  
cks-q6.log manage 1.1 K 22 Jun 2007 - 13:09 AlesKrenek  
ems-q1.log manage 9.1 K 25 Jun 2007 - 12:34 JiriSitera es3-mygrid-sdg2 query 1
ems-q2.log manage 2.0 K 25 Jun 2007 - 12:36 JiriSitera es3-mygrid-sdg2 query 2
ems-q3.log manage 4.0 K 25 Jun 2007 - 12:36 JiriSitera es3-mygrid-sdg2 query 3
mes-q1.log manage 12.8 K 25 Jun 2007 - 12:50 JiriSitera mygrid-es3-sdg2 query 1
mes-q2.log manage 2.0 K 25 Jun 2007 - 12:51 JiriSitera mygrid-es3-sdg2 query 2
mes-q3.log manage 2.0 K 25 Jun 2007 - 12:51 JiriSitera mygrid-es3-sdg2 query 3
ksm-q1.log manage 7.3 K 25 Jun 2007 - 13:04 JiriSitera karma-sdg2-mindswap2 query 1
ksm-q2.log manage 1.9 K 25 Jun 2007 - 13:05 JiriSitera karma-sdg2-mindswap2 query 2
ksm-q3.log manage 0.9 K 25 Jun 2007 - 13:05 JiriSitera karma-sdg2-mindswap2 query 3
ksm-q4.log manage 15.4 K 25 Jun 2007 - 13:05 JiriSitera karma-sdg2-mindswap2 query 4
ksm-q6.log manage 1.0 K 25 Jun 2007 - 13:06 JiriSitera karma-sdg2-mindswap2 query 6
ems.ps manage 21.5 K 25 Jun 2007 - 13:10 JiriSitera  
mes.ps manage 31.5 K 25 Jun 2007 - 13:10 JiriSitera  
ksm.ps manage 15.8 K 25 Jun 2007 - 13:11 JiriSitera  
mindswap-q1.log manage 6.3 K 25 Jun 2007 - 13:14 JiriSitera  
mindswap-q2.log manage 1.8 K 25 Jun 2007 - 13:14 JiriSitera  
mindswap-q3.log manage 2.9 K 25 Jun 2007 - 13:15 JiriSitera  
mindswap-q4.log manage 15.2 K 25 Jun 2007 - 13:15 JiriSitera  
mindswap-q6.log manage 2.3 K 25 Jun 2007 - 13:15 JiriSitera  
mindswap.ps manage 17.3 K 25 Jun 2007 - 13:16 JiriSitera  

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback