Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.UvAVL-e

Start of topic | Skip to actions

Provenance Challenge: UvA?/VL-e

Participating Team

Team and Project Details

Workflow Representation

The workflow is implemented using a Grid-based workflow system developed at the the University of Amsterdam, called WS-VLAM. The provenance is, however, handled with an initial version of PLIER (Provenance Layer Infrastructure for e-Science Resources) which is being developed by the Information Management Group at PCC UvA. Essentially, PLIER hides the details of handling and managing the provenace from WS-VLAM system when it executes the workflow. The WS-VLAM engine uses PLIER, in its API form, to generate provenance data from the workflow into an OPM graph.

It is clear that the workflow has to be adapted to the WS-VLAM system for its composition and execution. When performing this adaptation, we faced already multiple ways to adjust the command line workflow to its WS-VLAM counter part. It is possible, for instance, to execute some processes by farming them or employ sequential processes instead; or to create a 'big' workflow or use composite workflows (hierarchical). Some of these initial alternatives are described in the slides contained in this powerpoint file.

The following figures present the different stages to generate the provenace. On one hand, the Figure 1 shows the WS-VLAM Composer with the design, or representation, of the workflow. On the other hand, the OPM Graph generated by PLIER can be seen as a XML format, in Figuere 2, or as a diagram, in Figure 3.

Workflow OPM Graph (Tree) OPM Graph (Diagram)
Figure 1. WS-VLAM Composer Figure 2. PLIER (XML Tree) Figure 3. PLIER (Diagram)

Similarly to the previous figures, the below representations show the WS-VLAM Composer, in Figure 4, and the PLIER tool, in Figure 5, with a more detailed workflow.

Detailed Workflow OPM Graph (Detailed)
Figure 4. WS-VLAM Composer (detailed workflow) Figure 5. PLIER (detailed workflow)

Open Provenance Model Output

The following table show the latest OPM Graphs generated by PLIER from WS-VLAM workflows.

Case (Job ID) OPM Graphs (XML format)
J062941 J062941-OPM.xml
J062942 J062942-OPM.xml
J062943 J062943-OPM.xml
J062944 J062944-OPM.xml
J062945 J062945-OPM.xml

Query Results

The PLIER repository is a relational database that is accessed by scientists indirectly through the GUI tool or programatically via the API. Basically, the PLIER repository handles the data but relies on the GUI end-user application to retrieve or query the information elements. These tools, however, are still under development and they may not be able to query rather browse the data. Although PLIER does not provide any low level manipulation mechanisms, it does not restrict the user from accessing through SQL commands. Thus, for the sake of providing some clarification, the OPM data will be then queried using both OQL and SQL.

Query 1

For a given detection, which CSV files contributed to it?

Our solution first notices that the WS-VLAM system is agnostic of what the processes do. The WS-VLAM engine schedules and submits each process for execution, while monitor its progress. The module instead executes its specific task, as a black box, without being aware about the existance of WS-VLAM system. Second, PLIER is meant to provide provenance to the workflow as a whole and, at this moment, it does not consider the granulaity contained by the modules. Under these circumtances, if the given detection is treated by the any module it, the generated provenace data by itself is insufficient to answer the query.

We can interpret this query from another point of view. In our data provenance, the filenames are provided as parameters to the modules to generate the output (or the input). The parameters are expressed in the OPM Graph as Agents, having an ID linking it to the detection and a data Value. Thus, in order to retrieve the CSV files that contributed to a detection, we perform the following query:

   SQL: select Agent.Value from AGENT where Agent.ID like '%Detection%'
Which returns:
    P2_J062945_B001_P2fits0_20081115_P2Detection.csv
Another possibility is to employ with a more general query that retrieves the CSV files participating in all events.
   SQL: select * from AGENT where Agent.value like '%.csv'
Which returns:
   P2_J062945_B001_P2fits0_20081115_P2Detection.csv
   P2_J062945_B001_P2fits0_20081115_P2FrameMeta.csv
   P2_J062945_B001_P2fits0_20081115_P2ImageMeta.csv

Comment:

  • In this second view, we do not see any link between the database internal operations and the generated OPM graph. In the implementation the database is receiving a bulk file that imported into its internal tables. After performing this task, the only information from the database is the number of records either inserted, updated, or deleted. We are answering the query based on the id having 'Detection' or the values having '.csv' (part of the input file name). Clearly, it is not always the case that a file name or its extension reflect its content.

Query 2

The user considers a table to contain values they do not expect. Was the range check ('IsMatchTableColumnRanges') performed for this table?

Query 3

Which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value?

Suggested Workflow Variants

Suggestions for Modification of the Open Provenance Model

OPM Specification

While implementing the necessary mechanisms to import and export information using the OPM model, based on the given XML Schema (http://openprovenance.org), we faced some inconveniences while parsing the XML tags. Therefore, we modified the schema to cope with those problems as well to match the database model from the repository. These changes to the original XML schema are summarized below:

  • The modified schema basically contains the the same definitions but instead of using complexType constructs, we employ element definitions.
  • The main problem for our parser was to unambiguously identify the concept account since the definition for AccountId and Account share the same element (account) and attribute (id).
  • The modified schema takes a generic approach for ids rather than explicitly define ProcessId, AgentId, or AgentId.

In order to better clarify our points, we attached the revised XML schema OPMv101.revised.xsd

Conclusions

-- VictorGuevara - 02 Jun 2009
to top

I Attachment sort Action Size Date Who Comment
PC3-WSVLAM-V1.png manage 76.0 K 16 Apr 2009 - 09:27 VictorGuevara WS-VLAM Workflow
PC3-WSVLAM-V2.png manage 89.7 K 16 Apr 2009 - 09:28 VictorGuevara WS-VLAM Workflow - detailed
wsvlam-PC3-1.pps manage 2505.5 K 16 Apr 2009 - 12:46 VictorGuevara WS-VLAM Workflow Components
PC3-Plier01.jpg manage 169.9 K 29 May 2009 - 15:30 VictorGuevara Screnshoot of Plier Exchange
PC3-Plier02.jpg manage 202.5 K 29 May 2009 - 15:47 VictorGuevara Screnshoot of Plier Exchange (Diagram)
PC3-model-wide.png manage 156.2 K 02 Jun 2009 - 13:24 VictorGuevara WS-VLAM (Detailed workflow)
PC3-model-wide-OPM.png manage 172.2 K 02 Jun 2009 - 13:25 VictorGuevara PLIER GUI (Detailed workflow)
PC3-model-wide-OPM.pdf manage 142.9 K 02 Jun 2009 - 13:29 VictorGuevara OPM Graph (Detailed workflow [PDF])
PC3-j062941.xml manage 13.0 K 02 Jun 2009 - 15:39 VictorGuevara Workflow j062941
PC3-j062942.xml manage 13.0 K 02 Jun 2009 - 15:40 VictorGuevara Workflow j062942
PC3-j062943.xml manage 13.0 K 02 Jun 2009 - 15:41 VictorGuevara Workflow j062943
PC3-j062944.xml manage 13.0 K 02 Jun 2009 - 15:41 VictorGuevara Workflow j062944
PC3-j062945.xml manage 13.0 K 02 Jun 2009 - 15:42 VictorGuevara Workflow j062945

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback