Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.Karma

Start of topic | Skip to actions

Provenance Challenge Template

In progress

Participating Team

Workflow Representation

Provide here a description of how you have encoded the Challenge workflow.

KarmaBrainAtlasWF.gif

Provenance Trace

Upload a representation of the information you captured when executing the workflow. Explain the structure (provide pointers to documents describing your schemas etc.)

Sa sample log of the provenance activities generated by the workflow/services is shown here notifications.xml.

The Karma Service API supports 2 kinds of provenance retrieval: Data Provenance and Process Provenance. It also supports variations of these that can retrieve RecursiveDataProvenance?, DataUsage?, and WorkflowTrace?. Results of these provenance queries on the given workflow are shown here:

  • karma.xsd: Karma v2.x schema describing provenance documents

  • workflow_trace.xml: Workflow Trace for all invocations in the ProvenanceChallengeBrainWorkflow

These query APIs form the building blocks for constructing the different "canonical" provenance queries in the challenge. Karma does not provide extensive support for annotations at the level of data products. We take the approach that the provenance system is not a generic metadata management system and should be focused mainly on storing and retreiving provenance. In the LEAD project where Karma is used, queries over generic data product metadata and provenance are achieved by pushing the provenance into the metadata for the data product and allow the MyLEAD metadata management system to answer the "join" queries.

Limited support for queries over annotations is present and has been used to answer the challenge queries that include annotations (except for #9). Some of them has required us to query the provenance service's backend relational database, since support for queries over annotation is not present through the service API yet.

Provenance Queries

For each query, if your system can support your query, provide a description of how you implement the query, what result is returned; otherwise, explain whether the query is in the remit of your system.

Also, make sure you complete the ProvenanceQueriesMatrix.

Teams Queries
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
Karma team thumbs up thumbs up thumbs up thumbs up * thumbs up * thumbs up * thumbs up thumbs up * frown

* Complete support not available through Karma's Web-Service API. SQL query on backend database required.

1. Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed etc.

The getRecursiveDataProvenance API provided by the Karma provenance service allows the retrieval of the entire data provenance history of a data product. Invoking that method with the data product ID of Atlas X Graphic (in this case, 'lead:uuid:1157946992-atlas-x.gif') returns the complete process that led to its creation. The result of the provenance query is shown in recursive_data_provenance.xml.

2. Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean.

This query is performed by the client by first invoking the getDataProvenance method on the Karma provenance service to retreive the immediate data provenance for Atlas X Graphic. The client then recursively calls getDataProvenance to get move up the provenance tree until the SoftmeanService is encountered in the data provenance results. The pseudo-code for the client looks like this:

PrintRecursiveDataProvenanceUntil('lead:uuid:1157946992-atlas-x.gif', 'urn:qname:...:SoftmeanService');

void PrintRecursiveDataProvenanceUntil(DataProductID dataProduct, URI processID)
1. let $dataList := [dataProduct]
2. while ($dataList != empty) do
   a. $dataProvenance = karma.getDataProvenance($dataList[0])           // get data provenance for this level
   b. Print $dataProvenance; $dataList.delete(0)                        // print process information & remove data from list
   c. if ($dataProvenance.getProducedBy() == processID) break;  // found Softmean. Stop.
   d. foreach ($inputData in $dataProvenance.getUsingData()) do 
      // get input data used by this data product. recurse up the tree using iteration
      i. $dataList.add($inputData)  
3. End

The results of this operation is shown in query2.txt.

3. Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic.

This query is different from #2 in that the provenance levels are relative to the file, instead of being specified explicitly as 'Softmean'. The getRecursiveDataProvenance API in the Karma provenance service has an optional parameter to specify the depth of recursion. By passing a recursion level of 3 in addition to the data product ID of Atlas X Graphic (in this case, 'lead:uuid:1157946992-atlas-x.gif'), it is possible to retreive the data provenance for stages 3,,4, and 5. The result of the provenance query is shown in query3.xml.

4. Find all invocations of procedure align_warp using a twelfth order nonlinear 1365 parameter model (see model menu describing possible values of parameter "-m 12" of align_warp) that ran on a Monday.

The Karma provenance service is primarilly intended as a provenance recording and querying system, and only has limited capabiltiy for recording generic metadata and annotations. Provenance activities can have annotations and relevant activities also contain the messages that were exchanged by service and client to perform an operation. These activities are recorded in a relational database and free text queries are possible on the annotations using SQL queries. Direct SQL queries is currently not exposed to the client but provenance service has the capability to answer these queries as follows:

  1. SQL Query to locate align_warp invocations (invoker+invokee pairs) that match input parameter of "-m 12" that ran on a Monday
       SELECT 
          invokee.workflow_id, invokee.service_id, invokee.workflow_node_id, invokee.workflow_timestep,
       invoker.workflow_id, invoker.service_id, invoker.workflow_node_id, invoker.workflow_timestep
       FROM 
          invocation_state_table invocation, entity_table invokee, entity_table invoker, notification_table notifications
       WHERE
       invokee.entity_id = invocation.invokee_id AND
       invoker.entity_id = invocation.invoker_id AND
       notifications.source_id = invocation.invokee_id AND
       notifications.notification_type = 'ServiceInvoked' AND
       invokee.service_id = 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' AND
       notifications.notification_xml LIKE '%<ModelMenuNumber>12</ModelMenuNumber>%' AND
       DAYOFWEEK(invocation.request_receive_time) = 2; // 1=Sunday, 2=Monday, ...
    
    In our example (assuming the workflow was run on a Monday instead of actually Sunday), this query returns:
    Entity workflow_id service_id workflow_node_id workflow_timestep
    Invokee 1 'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1' 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' 'AlignWarpService' 6
    Invokee 2 'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1' 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' 'AlignWarpService_2' 8
    Invokee 3 'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1' 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' 'AlignWarpService_3' 10
    Invokee 4 'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1' 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' 'AlignWarpService_4' 12
    Invoker - 'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1' - -

  1. Using the invoker and invokee information from the above query, the client can use the getProcessProvenance API to query for the description of the matching align_warp services. The result of this is show in query4.txt.

5. Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers had an entry global maximum=4095. The contents of a header file can be extracted as text using the scanheader AIR utility.

In the workflow we execute, the command-line applications are wrapped by shell script that can perform pre- and post-processing. We incorporate a call to the scanheader utility within the wrapper for align_warp and have it include the output of the scanheader in the ServiceInvoked activity's annotation. Now the query becomes similar to the previous case:

  1. SQL Query to locate align_warp invocations (invoker+invokee pairs) that have annotation of "global_maximum=4095"
       SELECT
          invokee.workflow_id, invokee.service_id, invokee.workflow_node_id, invokee.workflow_timestep,
       invoker.workflow_id, invoker.service_id, invoker.workflow_node_id, invoker.workflow_timestep
       FROM
          entity_table invokee, entity_table invoker, notification_table notifications, invocation_state_table invocation
       WHERE
       invokee.entity_id = invocation.invokee_id AND
       invoker.entity_id = invocation.invoker_id AND
       notifications.source_id = invocation.invokee_id AND
       notifications.notification_type = 'ServiceInvoked' AND
       invokee.service_id = 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' AND
       notifications.notification_xml LIKE '%global_maximum=4095%'
    
    In our example, this query returns:
    Entity workflow_id service_id workflow_node_id workflow_timestep
    Invokee_1 'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1' 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' 'AlignWarpService' 6
    Invokee_2 'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1' 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' 'AlignWarpService_2' 8
    Invokee_3 'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1' 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' 'AlignWarpService_3' 10
    Invokee_4 'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1' 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' 'AlignWarpService_4' 12
    Invoker_0 - 'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1' - -

  1. Using the invoker and invokee information from the above query, the client can start a recursive descent down the process provenance tree to look for output data files that are images generated by the convert service.
PrintRecursiveDataUsageFor(Invokee_0, Invokee_1, 'urn:qname:...:ConvertService');

void PrintRecursiveDataUsageFor(EntityID invoker, EntityID invokee, URI processID)
   // get initial process's provenance
1. let $processProv := karma.getProcessProvenance(invoker, invokee)       
1. let $processList := [$processProv], $visitedDataList := [], $outputDataList := []
   // start recursing down the data usage tree iteratively
2. while ($processList != empty) do
   a. foreach ($processProv in $processList) do 
          // test if any of the processes in the current list was 'ConvertService'. If so, print it's output image files.
      i.  if $processProv.getInvokee().getServiceID() == processID Print $processProv.getProducingData()
          // add data products that were produced to the list of output to recurse into
      ii. Add all $processProv.getProducingData() to $outputDataList
      // we're done with these processes
   b. $processList := []
   c. foreach ($outputData in $outputDataList) do 
          // get the data usage list for the output data produced
      i.  let $dataUsage := karma.getDataUsage($outputData)
          // get the process provenance for each process that used the output data and add them to process list
      ii. foreach ($usedByProcess in $dataUsage.getUsageList())
          - let $processProv := karma.getProcessProvenance($usedByProcess.invoker, $usedByProcess.invokee)
          - Add $processProv to $processList
      // we're done with these data
   d. let $dataList := []
3. End

The results of this operation is shown in query5.txt.

6. Find all output averaged images of softmean (average) procedures, where the warped images taken as input were align_warped using a twelfth order nonlinear 1365 parameter model, i.e. "where softmean was preceded in the workflow, directly or indirectly, by an align_warp procedure with argument -m 12."

This is a variation of query 4 and query 5. The SQL query used to retreive the align_warp services that had model menu number value of -12 is the same as the query in #4 with the exception of the DAYOFWEEK predicate. Similarly, the client's recursive procedure to locate output of all SoftmeanServices? that were preceeded by these align_warps is similar to the recursive procedure outlined in query #5, with ConvertService being replaced by SoftmeanService. They're reproduced below.

  1.    SELECT 
          invokee.workflow_id, invokee.service_id, invokee.workflow_node_id, invokee.workflow_timestep,
       invoker.workflow_id, invoker.service_id, invoker.workflow_node_id, invoker.workflow_timestep
       FROM 
          invocation_state_table invocation, entity_table invokee, entity_table invoker, notification_table notifications
       WHERE
       invokee.entity_id = invocation.invokee_id AND
       invoker.entity_id = invocation.invoker_id AND
       notifications.source_id = invocation.invokee_id AND
       notifications.notification_type = 'ServiceInvoked' AND
       invokee.service_id = 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' AND
       notifications.notification_xml LIKE '%<ModelMenuNumber>12</ModelMenuNumber>%';
    

PrintRecursiveDataUsageFor(Invokee_0, Invokee_1, 'urn:qname:...:SoftmeanService');

(See Query #5 for definition)

The results of this operation is shown in query6.txt.

7. A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant.

The getWorkflowTrace API if the Karma service returns the complete workflow trace for a workflow as an XML document. Given the workflow traces for two different workflows, it is possible to do a semantic "diff" of the two documents to find out the differences in the processes that were invoked and the data products used and produced, The pseudo-code for printing out the differences between two workflow traces is given below:

void PrintWorkflowTraceDiff(WorkflowTrace trace1, WorkflowTrace trace2)
   // Workflow trace is an extension of process procenance document
1. let $processProv1 := trace1 as ProcessProvenance
2. let $processProv2 := trace2 as ProcessProvenance
3. PrintProcessProvenanceDiff($processProv1, $processProv2)
   // Each step in the workflow trace is a process provenance document
4. foreach($processProv1, $processProv2 in trace1.getTraceSteps(), trace2.getTraceSteps()
   a. PrintProcessProvenanceDiff($processProv1, $processProv2)
5. End

void PrintProcessProvenanceDiff(ProcessProvenance processProv1, ProcessProvenance processProv2)
1. Print "Diff of Processes: ", processProv1.getInvokee(), processProv2.getInvokee()
2. if (processProv1.getInvokee() != processProv2.getInvokee()) 
   a. Print "Invokees Differ: ", processProv1.getInvokee(), processProv2.getInvokee()   
3. if (processProv1.getInvoker() != processProv2.getInvoker()) 
   a. Print "Invokers Differ: ", processProv1.getInvoker(), processProv2.getInvoker()
4. if (processProv1.getStatus() != processProv2.getStatus()) 
   a. Print "Process Completion Status Differ: ", processProv1.getStatus(), processProv2.getInvoker()
5. if (processProv1.getRequestReceiveTime() != processProv2.getRequestReceiveTime()) 
   a. Print "Invocation Times Differ: ", processProv1.getRequestReceiveTime(), processProv2.getRequestReceiveTime()
6. foreach ($dataProd1, $dataProd2 in processProv1.getUsingData(), processProv2.getUsingData()) 
   a. PrintDataProductDiff($dataProd1, $dataProd2)
7. foreach ($dataProd1, $dataProd2 in processProv1.getProducingData(), processProv2.getProducingData()) 
   a. PrintDataProductDiff($dataProd1, $dataProd2)
8. End

void PrintDataProductDiff(DataProduct dataProd1, DataProduct dataProd2)
1. if (dataProd1.getDataProductID() != dataProd2.getDataProductID()) // trivial. IDs always differ.
   a. Print "Produced Data IDs Differ: ", dataProd1.getDataProductID(), dataProd2.getDataProductID()
2. if (dataProd1.getLocation() != dataProd2.getLocation()) 
   a. Print "Produced Data Locations Differ: ", dataProd1.getLocation(), dataProd2.getLocation()
3. if (dataProd1.getTimestamp() != dataProd2.getTimestamp()) 
   a. Print "Produced Data Timestamp Differ: ", dataProd1.getTimestamp(), dataProd2.getTimestamp()
4. End

The second workflow was not run and hence the query results for this are not available.

8. A user has annotated some anatomy images with a key-value pair center=UChicago. Find the outputs of align_warp where the inputs are annotated with center=UChicago.

As noted earlier, the Karma service does not support detailed annotations at the file level, defering to an external Metadata management system such as MyLEAD. However, it supports generic annotations to be submitted as part of the provenance activities that can be queried upon. We use this facility to add metadata about the input anatomy images to the provenance activity and query it. This is again similar to queries #4, #5 and #6 in that a SQL query retrieves the invocations and we use the getProcessProvenance API of Karma to retrieve the output data products.

  1. SQL Query to locate align_warp invocations (invoker+invokee pairs) whose input data products have annotaion "center=UChicago"
       SELECT 
          invokee.workflow_id, invokee.service_id, invokee.workflow_node_id, invokee.workflow_timestep,
       invoker.workflow_id, invoker.service_id, invoker.workflow_node_id, invoker.workflow_timestep
       FROM 
          invocation_state_table invocation, entity_table invokee, entity_table invoker, notification_table notifications
       WHERE
       invokee.entity_id = invocation.invokee_id AND
       invoker.entity_id = invocation.invoker_id AND
       notifications.source_id = invocation.invokee_id AND
       notifications.notification_type = 'ServiceInvoked' AND
       invokee.service_id = 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' AND
       notifications.notification_xml LIKE '%<Center>UChicago</Center>%';
    

  1. We then call getProcessProvenance on the resulting invocations of the above query and print the produced data products elements. If all 4 align_warp services match, the results are shown in query8.txt.

9. A user has annotated some atlas graphics with key-value pair where the key is studyModality. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files.

The Karma service does not support complex queries such as these on the data product annotations. One way to perform this query would have been to retrieve the annotations for atlas graphics with key studyModality having value visual or audio using a query similar to query #8 and then to filter out the keys at the client end. However, we do not expect to answer such queries through the provenance system and these will not be part of the provenance service API.

Suggested Wokflow Variants

Suggest variants of the workflow that can exhibit capabilities that your system support.

  • Workflows with loops.
  • Workflows whose structure changes dynamically (or, as a simpler case, workflows with conditional branches).
  • Hierarchical composition of workflows. (workflows invoking other workflows)

Suggested Queries

Suggest significant queries that your system can support and are not in the proposed list of queries, and how you have implemented/would implement them. These queries may be with regards to a variant of the workflow suggested above.

  • Find all [workflows | processes] with a particular execution status [completed | failed | waiting for input]
  • Show the client view and service view of the provenance and check for differences

Categorisation of queries

According to your provenance approach, you may be able to provide a categorisation of queries. Can you elaborate on the categorisation and its rationale.

  • Provenance Structure
  • Annotation

Live systems

If your system can be accessed live (through portal, web page, web service, or other), provide relevant information here.

Further Comments

Provide here further comments.

Conclusions

Provide here your conclusions on the challenge, and issues that you like to see discussed at a face to face meeting.

-- YogeshSimmhan - 13 Sep 2006
to top

I Attachment sort Action Size Date Who Comment
KarmaBrainAtlasWF.gif manage 254.0 K 11 Sep 2006 - 08:02 YogeshSimmhan Karma's Brain Atlas Workflow Composition in BPEL using XBaya
KarmaBrainAtlasWF-bpel.xml manage 31.8 K 11 Sep 2006 - 08:06 YogeshSimmhan BPEL Script for Workflow
KarmaBrainAtlasWF.xwf manage 192.1 K 11 Sep 2006 - 08:16 YogeshSimmhan Workflow representation that can be viewed/edited/launched from XBaya
recursive_data_provenance.xml manage 28.5 K 12 Sep 2006 - 02:45 YogeshSimmhan Data Provenance retrieved recursively for a data product and its ancestral data products (Results of Query 1)
data_provenance.xml manage 1.2 K 12 Sep 2006 - 02:45 YogeshSimmhan Data Provenance retrieved for a data product
process_provenance.xml manage 1.8 K 12 Sep 2006 - 02:46 YogeshSimmhan Process Provenance for a single service invocation
workflow_trace.xml manage 23.1 K 12 Sep 2006 - 02:46 YogeshSimmhan Workflow Trace for all invocations in a workflow
karma.xsd manage 13.1 K 12 Sep 2006 - 02:57 YogeshSimmhan Karma v2.x schema describing provenance documents
query2.txt manage 5.3 K 12 Sep 2006 - 03:45 YogeshSimmhan Results of Query 2
query3.xml manage 17.3 K 12 Sep 2006 - 03:46 YogeshSimmhan Results of Query 3
query4.txt manage 7.0 K 13 Sep 2006 - 13:44 YogeshSimmhan Results of Query 4
query5.txt manage 0.7 K 13 Sep 2006 - 13:44 YogeshSimmhan Results of Query 5
query8.txt manage 0.9 K 13 Sep 2006 - 13:46 YogeshSimmhan Results of Query 8
notifications.xml manage 123.2 K 13 Sep 2006 - 13:46 YogeshSimmhan Sample Provenance Activity log generated by Workflow
karma.ppt manage 904.0 K 13 Sep 2006 - 13:47 YogeshSimmhan Presentation Draft
query6.txt manage 0.4 K 13 Sep 2006 - 13:55 YogeshSimmhan Results of Query 6

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback