Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.NcsaPc3

Start of topic | Skip to actions

Provenance Challenge: NCSA

Participating Team

Team and Project Details

  • Short team name: NCSA
  • Participant names: Joe Futrelle, Jim Myers, Robert Clark
  • Project Url: http://leovip217.ncsa.uiuc.edu/
  • Project Overview:
  • Relevant Publications:

Workflow Representation

We chose to implement the workflow by means of the given Java code. This decision was ideal to enable the use of Tupelo, a semantic content repository framework being developed at NCSA, to record provenance records. In addition to a Java OPM binding, Tupelo provides a dedicated provenance API which allowed us to weave our code to record provenance records directly into the workflow code. The use of this provenance API records OPM compliant records while abstracting away from the user the means by which these observations are recorded.

Approach to Recording Provenance

To make sense out of our OPM output and graphs in the following sections, we will describe our OPM naming scheme and our approach to recording provenance records for this challenge.

Recording Guidelines
The following rules guided what we recorded.
  • All of the code to record provenance information will be contained in LoadWorkflow?.java.
  • Each new object created in LoadWorkflow?.java will be treated as a provenance artifact until that object's state is modified, then the same object, now in a new state, will be treated as a new artifact.
  • Each conditional, loop iteration and method called in LoadWorkflow?.java will be treated as a process.
  • We say that a successful conditional triggers the next process in the workflow (e.g. IsCSVReadyFileExistsConditional? triggers ReadCSVReadyFileProcess?).
  • We will also say that each loop iteration process triggers the first process inside the body of the loop (e.g. LoopIteration0? triggers IsExistsCSVFile0?).
  • The first iterator process is triggered by the CreateEmptyLoadDB? process, since after returning from CreateEmptyLoadDB? we enter the loop structure.
  • If the conditional within the for each is false (i.e. the loop ends) that iterator process triggers the CompactDatabase? process. In this workflow's case, Iterator3 triggers CompactDatabase? since the for each loop runs 3 times.
  • We say that the last process in the body of the loop (IsMatchTableColumnRanges?) triggers the next iterator process (e.g. IsMatchTableColumnRangesProcess0? triggers Iterator1).

Naming Scheme
  • A process entity representing a method call is given the same name as the method call with Process appended as a suffix, e.g. a call to IsCSVReadyFileExists? is represented by an RDF triple with subject IsCSVReadyFileExistsProcess?, or an OPM XML element with id IsCSVReadyFileExistsProcess?.
  • An artifact entity representing an object generated by a method call or user's input is given the same name as the object with Aritfact appended, e.g. the user's input JobID? string is represented by an RDF triple about subject JobIDArtifact?, or OPM XML element with id JobIDArtifact?.
  • Each iteration of the loop is represented by an RDF triple about subject IteratorN?, where N is the number of completed iterations so far.
  • If an entity is created within the loop, then we use the same naming scheme as entites not in the loop, but with the number of completed iterations appended the the end of the name. E.g. there will be RDF triples about IsExistsCSVFileOutputArtifact0?, IsExistsCSVFileOutputArtifact1? and IsExistsCSVFileOutputArtifact2? and OPM XML entities with the same id.

A small piece of code with lots of comments explaining the process we used when recording a single OPM event in the work flow execution is available here. Additionally, a piece of code illustrating our approach to recording provenance about the loop structure of the work flow can be found here.

Open Provenance Model Output

The OPM output for a successful run of the workflow with each dataset is available in the table below. Additionally, we have developed a tool that takes OPM output generated by Tupelo and generates a gif file of the OPM graph. This tool is similar to the opm2dot tool available in the opm toolbox but the input is not formatted in the OPM XML schema. The graph produced by this tool for a successful run of the J602941 dataset can be found here.

case RDF XML(Tupelo) OPM XML v1.01.a
J062941 J062941_output.rdf J609241_output.xml
J062942 J062942_output.rdf J609242_output.xml
J062943 J062943_output.rdf J609243_output.xml
J062944 J062944_output.rdf J609244_output.xml
J062945 J062945_output.rdf J609245_output.xml

Query Results

Core Query 1
Our approach to this query is as follows. For a given detection, we can use tools provided by Tupelo to associate with that table all artifacts on which the detection causally depend that are of RDF type CSV_file with a PathToFile? property. The following code accomplishes this.
        Unifier u = new Unifier();
        u.setColumnNames("file", "path");
        u.addPattern("file", Rdf.TYPE, PC3Utilities.ns("CSV_file"));
        u.addPattern("file", PC3Utilities.ns("PathToFile"), "path");
        context.perform(u);
        for(Tuple<Resource> r : u.getResult()) {
            System.out.println(r);
        }

This yields the following output.

    [http://pc3#FileEntryArtifact1, C:\dev\workspace\PC3\SampleData/J062941\P2_J062941_B001_P2fits0_20081115_P2ImageMeta.csv]
    [http://pc3#FileEntryArtifact2, C:\dev\workspace\PC3\SampleData/J062941\P2_J062941_B001_P2fits0_20081115_P2Detection.csv]
    [http://pc3#FileEntryArtifact0, C:\dev\workspace\PC3\SampleData/J062941\P2_J062941_B001_P2fits0_20081115_P2FrameMeta.csv]

Core Query 2
In the interest of answering this query, every time an IsMatchTableColumnRanges? check is performed on a particular table, we assert within the execution of the workflow that the IsMatchTableColumnRanges? process has a "PerformedOn" predicate with the table name as the object.

Then, to answer the query, we accept a string representing the table name in question and use Tupelo's Unifier tool to find the process that was "PerformedOn" that table.

        boolean checkPerformed = false;
        Unifier u = new Unifier();
        u.setColumnNames("process");
        u.addPattern("process", PC3Utilities.ns("PerformedOn"), Resource.literal(s));
        context.perform(u);    
        if(u.getResult().iterator().hasNext()) {
            String process = u.getResult().iterator().next().get(0).toString();
            assertTrue(process.substring(process.lastIndexOf("#")+1, process.length()-1).equals("IsMatchTableColumnRangesConditionals"));
            checkPerformed = true;
        }
        if(checkPerformed) {
            System.out.println("IsExistsCSVFile was performed on table "+s);
        }

Answering the query in this way allows us to say that not only was the check performed on the table, but that the check was successful. The assertion should always be true since we recorded provenance such that any process with a "PerformedOn" property will be among the IsMatchTableColumnRangesConditionals? processes.

Core Query 3
We took this query to mean: given an image table (an artifact) on which processes does this table causally depend?

With this understanding, we answered this query as follows. We used a Tupelo tool called a Transformer to query the graph and assert new relations on the results. In particular we created a new relation (removed the provenance API and the workflow execution) called "InferredTrigger."

        Transformer t;
        t = new Transformer();
        t.addInPattern("t", Opm.TRIGGERED_PROCESS, "p1");
        t.addInPattern("t", Opm.TRIGGERED_BY_PROCESS, "p2");
        t.addOutPattern("p1", PC3Utilities.ns("InferredTrigger"), "p2");
        context.perform(t);
        context.addTriples(t.getResult());

        t = new Transformer();
        t.addInPattern("g", Opm.GENERATED_BY_PROCESS, "p2");
        t.addInPattern("g", Opm.GENERATED_ARTIFACT, "a");
        t.addInPattern("u", Opm.USED_ARTIFACT, "a");
        t.addInPattern("u", Opm.USED_BY_PROCESS, "p1");
        t.addOutPattern("p1", PC3Utilities.ns("InferredTrigger"), "p2");
        context.perform(t);
        context.addTriples(t.getResult());

This InferredTrigger? relation is an arc from a process to a process on which the first process causally depends on the second. Once that relation was constructed we used another Tupelo tool called TranstiveClosure? which requires two parameters: an entity called the seed, and the relation on which to close. Once those two properties were set we closed the InferredTrigger? relation on the LoadCSVFileIntoTableProcess? as the seed.

        TransitiveClosure tc;
        tc = new TransitiveClosure();
        tc.setSeed(PC3Utilities.ns("LoadCSVFileIntoTableProcess1"));
        tc.setPredicate(PC3Utilities.ns("InferredTrigger"));
        Set<Resource> result = tc.close(context);
        for(Resource r : result) {
            context.addTriple(PC3Utilities.ns("LoadCSVFileIntoTableProcess1"), PC3Utilities.ns("InferredTrigger"), r);
        }
Finally, we returned all the processes which were related to LoadCSVFileIntoTableProcess? by an InferredTrigger? arc.

        Unifier u = new Unifier();
        u.setColumnNames("p");
        u.addPattern(PC3Utilities.ns("LoadCSVFileIntoTableProcess1"), PC3Utilities.ns("InferredTrigger"), "p");
        context.perform(u);
        System.out.println("Suggested Query Five Results:");
        for(Tuple<Resource> r : u.getResult()) {
            System.out.println("\t"+r);
        }

These processes are the answer to the query.

This is our output:

    [http://pc3#IsCSVReadyFileExistsProcess]
    [http://pc3#LoadCSVFileIntoTableConditional0]
    [http://pc3#IsExistsCSVFileConditional1]
    [http://pc3#LoadCSVFileIntoTableProcess0]
    [http://pc3#IsExistsCSVFileProcess1]
    [http://pc3#UpdateComputedColumnsConditional0]
    [http://pc3#Iterator1]
    [http://pc3#ReadCSVFileColumnNamesProcess0]
    [http://pc3#IsMatchCSVFileTablesConditional]
    [http://pc3#IsCSVReadyFileExistsConditional]
    [http://pc3#UpdateComputedColumnsProcess0]
    [http://pc3#IsMatchTableRowCountProcess0]
    [http://pc3#IsExistsCSVFileProcess0]
    [http://pc3#IsMatchCSVFileTablesProcess]
    [http://pc3#ReadCSVFileColumnNamesProcess1]
    [http://pc3#IsMatchCSVFileColumnNamesProcess1]
    [http://pc3#IsMatchTableColumnRangesProcess0]
    [http://pc3#IsMatchTableColumnRangesConditional0]
    [http://pc3#IsMatchCSVFileColumnNamesProcess0]
    [http://pc3#CreateEmptyLoadDBProcess]
    [http://pc3#Iterator0]
    [http://pc3#IsMatchCSVFileColumnNamesConditional0]
    [http://pc3#ReadCSVReadyFileProcess]
    [http://pc3#IsMatchCSVFileColumnNamesConditional1]
    [http://pc3#IsExistsCSVFileConditional0]
    [http://pc3#IsMatchTableRowCountConditional0]

Optional Query 1
We answered this query by first simulating an exception in the workflow execution and then recursively walk the graph looking for the "dead end" iteration process where the exception was thrown. To accomplish this, we called _findIterators (below) with findOne set to true. The findOne variable tells _findIterators to stop once the first entity of rdf type loop_iteration has been found. Then we call _findIterators again, this time with findOne set to false. Once this returns we will have all entites of rdf type loop_iteration. Finally, the answer to the query is the result from the first call of _findIterators removed from the result of the second call of _findIterators.

    void _findIterators(GraphNode n, Set<Resource> beenThere, Set<Resource> result, boolean findOne) throws  Exception{
        Resource subject = n.getSubject();
        if(beenThere.contains(subject) || (findOne && !result.isEmpty())) {
            return;
        }
        beenThere.add(subject);
        for(GraphEdge edge : n.getOutgoingEdges()) {
            if(session.fetchThing(edge.getSink().getSubject()).getTypes().contains(PC3Utilities.ns("loop_iteration"))) {
                result.add(edge.getSink().getSubject());   
            }
            _findIterators(edge.getSink(),beenThere, result, findOne);
        }
    }

Note that one cannot simply select every entity of rdf type loop iteration from the Tupelo context and take that count minus one to be the answer to the query since it is possible to create many entities of rdf type loop_iteration long before they are used in the provenance trail.

Optional Query 3
To answer this query we used the time annotation support for provenance arcs provided by Tupelo. We first isolated the two used arcs where the first arc represents the checking that IsMatchCSVFileTablesOutput? is true and the second arc represents when a IsExistsCSVFileConditional? failed.

Once we had those two arcs, answering this query was a matter of extracting the intervals of time both events were known to have occurred within and returning the upper and lower bound of time that may have elapsed between the two events.

        ProvenanceContextFacade pcf = new ProvenanceContextFacade();
        pcf.setContext(context);
        ProvenanceArcSeeker s = new ProvenanceArcSeeker(pcf);
        Collection<ProvenanceUsedArc> matchArcs = s.findUsedEdges(
                pcf.getArtifact(PC3Utilities.ns("IsMatchCSVFileTablesArtifact")),
                pcf.getProcess(PC3Utilities.ns("IsMatchCSVFileTablesConditional")),
                null, pcf.getAccount(PC3Utilities.ns("account")));
        assertTrue(matchArcs.size() == 1);
        ObservedTime matchTime = ((ProvenanceRoleArc) matchArcs.toArray()[0]).getInterval();
        Collection<ProvenanceUsedArc> existsArcs = s.findUsedEdges(
                pcf.getArtifact(PC3Utilities.ns("IsExistsCSVFileArtifact1")), 
                pcf.getProcess(PC3Utilities.ns("IsExistsCSVFileConditional1")),
                        null, pcf.getAccount(PC3Utilities.ns("account")));
        assertTrue(existsArcs.size() == 1);
        ObservedTime existTime = ((ProvenanceRoleArc) existsArcs.toArray()[0]).getInterval();
        long lowerBoundElapsed;
        long upperBoundElapsed;
        lowerBoundElapsed = existTime.getNoEarlier().getTime() - matchTime.getNoLater().getTime();
        if(lowerBoundElapsed < 0 ) {
            lowerBoundElapsed = 0;
        }
        upperBoundElapsed = existTime.getNoLater().getTime() - matchTime.getNoEarlier().getTime();
        System.out.println("Somewhere between ["+ lowerBoundElapsed+", "+upperBoundElapsed+"] milliseconds elapsed" +
              " between the successful IsMatchCSVFileTables and the unsuccessful IsExistsCSVFile.");
One a particular run of the workflow and query we generated the following output:

Somewhere between [9988, 10007] milliseconds elapsed between the successful IsMatchCSVFileTables? and the unsuccessful IsExistsCSVFile?.

Optional Query 9
In light of some of the discussion on the mailing list about this question we answer the following question: Suppose one run of the workflow halts and another runs to completion. Is an account contained in the halted graph contained in an account contained in the successful run?

To answer this we wrote a utility that takes as parameters two graphs, two accounts and two nodes. This utility returns true if an account contained in the first graph is contained in an account contained within the second. More precisely, beginning at the given nodes, one node in the first graph and one node in the second, continue if the account specific arcs on the first node is a subset of the account specific arcs of the second. This is where arc equality is defined as having the same source and sink and the same role, if applicable. Two nodes are defined as equal if they have the same name. To continue, recrursively follow each of the account specific arcs and verify that the set of account specific arcs of each node in the first graph is contained in the set of account specific arcs of each node from the second graphs. If we have visited each node in the first graph and the set of arcs on each node satsify the set containment requirement then the account contained in the first graph is a subgraph of the account contained in the second graph.

The code that accomplishes this can be found here .

As a consequence of what we record when the workflow halts, and as we expected, it turns out that the provenance account given in a halted run of the workflow is indeed contained in the provenance account of a successful run of the workflow.

Optional Query 10
We have reevaluated our approach to this query. The reason for this is that our initial answer (which we leave posted below for reference) did not take into consideration the open world assumption. By assuming that an artifact represents a user's input if those elements do not causally depend on anything within the given graph, then we implicitly assume knowledge about that artifact not contained in the graph.

The new approach to this query is as follows. We assert that a user's input is of rdf type User_Input and return provenance elements only of this type.

        Set<Resource> result = new HashSet<Resource>();
        Unifier u = new Unifier();
        u.setColumnNames("artifact");
        u.addPattern("artifact", Rdf.TYPE, PC3Utilities.ns("User_Input"));
        context.perform(u);
        for(Tuple<Resource> tuple : u.getResult()) {
            System.out.println(tuple);
            result.add(tuple.get(0));
        }
        return result;

Result:

    http://pc3#CSVRootPathResource
    http://pc3#JobIDresource

Our original approach to the query follows below.

To answer this query, we walk the graph down from the final entity produced by the work flow execution in this case the DisplayPlotProcess? and keep any entities that do not causally depend on any other entity in the graph (i.e. those graph nodes with no outgoing edges).

    void findInputs(GraphNode n, Set<Resource> beenThere, Set<Resource> result) throws ModelException, IOException, OperatorException {
        Resource subject = n.getSubject();
        if(beenThere.contains(subject)) {
            return;
        }
        beenThere.add(subject);
        for(GraphEdge edge : n.getOutgoingEdges()) {
            if(edge.getSink().getOutgoingEdges().isEmpty()) {
                result.add(edge.getSink().getSubject());   
            }
            findInputs(edge.getSink(),beenThere, result);
        }
    }

Result:

    http://pc3#CSVRootPathResource
    http://pc3#JobIDresource

Optional Query 11
We approached this query by first answering optional query 10, then on the result of that query we retrieved all used arcs terminating at one of those entities.
        ProvenanceContextFacade pcf = new ProvenanceContextFacade();
        Set<Resource> result = new HashSet<Resource>();
        pcf.setContext(context);
        Collection<ProvenanceUsedArc> arcs = new HashSet<ProvenanceUsedArc>();
        for(Resource input : userInputs) {
            arcs = pcf.getUsedBy(pcf.getArtifact(input));
            for(ProvenanceUsedArc arc : arcs) {
                result.add(((RdfProvenanceElement)arc.getProcess()).getSubject());
            }
        }
        for(Resource r : result) {
            System.out.println(r);
        }
Result:
    http://pc3#IsCSVReadyFileExistsProcess
    http://pc3#CreateEmptyLoadDBProcess
    http://pc3#ReadCSVReadyFileProcess

Suggested Workflow Variants

Suggested Queries

Suggestions for Modification of the Open Provenance Model

Conclusions

-- RobertClark - 11 May 2009

-- RobertClark - 13 Apr 2009

-- RobertClark - 04 Mar 2009
to top

I Attachment sort Action Size Date Who Comment
dot.gif manage 519.1 K 10 Apr 2009 - 19:59 RobertClark  
opmExample.txt manage 2.3 K 27 Apr 2009 - 15:32 RobertClark  
loopIteration.txt manage 6.0 K 27 Apr 2009 - 16:28 RobertClark  
J062941_output.rdf manage 71.5 K 11 May 2009 - 14:05 RobertClark  
J609241_output.xml manage 49.2 K 11 May 2009 - 14:06 RobertClark  
J062942_output.rdf manage 71.0 K 11 May 2009 - 14:06 RobertClark  
J609242_output.xml manage 48.9 K 11 May 2009 - 14:06 RobertClark  
J062943_output.rdf manage 70.9 K 11 May 2009 - 14:07 RobertClark  
J609243_output.xml manage 48.9 K 11 May 2009 - 14:07 RobertClark  
J062944_output.rdf manage 70.9 K 11 May 2009 - 14:07 RobertClark  
J609244_output.xml manage 48.9 K 11 May 2009 - 14:07 RobertClark  
J062945_output.rdf manage 71.0 K 11 May 2009 - 14:08 RobertClark  
J069245_output.xml manage 48.9 K 11 May 2009 - 14:10 RobertClark  
PC3Utilities.java manage 4.3 K 11 May 2009 - 18:41 RobertClark  

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback