Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.SwiftPc3

Start of topic | Skip to actions

Provenance Challenge: Swift

Participating Team

Team and Project Details

  • Short team name: Swift
  • Participant names: Ben Clifford (ci.uchicago.edu); Luiz M. R. Gadelha Jr. (cos.ufrj.br)
  • Project URL: http://www.ci.uchicago.edu/swift/
  • Project Overview: Swift is a system for the rapid and reliable specification, execution, and management of large-scale science and engineering workflows. It supports applications that execute many tasks coupled by disk-resident datasets - as is common, for example, when analyzing large quantities of data or performing parameter studies or ensemble simulations.
  • Relevant Publications:

Workflow Representation

There is a directory in the Swift repository at http://trac.ci.uchicago.edu/swift/browser/SwiftApps/pc3 where the Swift representation of the challenge workflow lives.

Open Provenance Model Output

The latest published OPM output for the challenge workflow is at http://www.ci.uchicago.edu/~benc/opm-20090519.xml

Older OPM output for the challenge workflow is at http://www.ci.uchicago.edu/~benc/opm-20090428.xml

Older more broken and less informative output: http://www.ci.uchicago.edu/~benc/opm-20090419.xml corresponding to the version of the workflow in SVN r2724

Query Results

  • core query 1: For a given detection, which CSV files contributed to it?

    Swift does not record database provenance data, so we used annotations in the application database to record which process inserted or modified each row.

    Suppose we want to determine the provenance of the detection that has the identifier 261887481030000003, the first query can be answered in the following manner:

    • obtain from the annotations in the application db the process responsible for inserting the detection in the application DB:

      select
        provenanceid
      from
        ipaw.p2detectionprov
      where
        detectid = 261887481030000003;
      
      tag:benc@ci.uchicago.edu,2008:swiftlogs:execute2:pc3-20090507-1008-q4dpcm28:ps_load_executable_db_app-b2bclgaj
      

    • obtain the respective execute_id:

      select
        execute_id
      from
        execute2s
      where 
        id = 'tag:benc@ci.uchicago.edu,2008:swiftlogs:execute2:pc3-20090507-1140-z7ebbrz0:ps_load_executable_db_app-8d52pgaj';
      
      tag:benc@ci.uchicago.edu,2008:swiftlogs:execute:pc3-20090507-1140-z7ebbrz0:0-5-5-1-5-1-2-0
      

    • select, among the ancestors (both artifacts and processes) of the execution, the datasets that contain CSV inputs:

      select
        filename
      from
        trans, dataset_filenames
      where
        after='tag:benc@ci.uchicago.edu,2008:swiftlogs:execute:pc3-20090507-1140-z7ebbrz0:0-5-5-1-5-1-2-0' and
        before=dataset_id and filename like '%split%';
      
      file://localhost/_concurrent/split_list_output-65fe229c-2da2-4054-997e-fb167b8c30ed--array//elt-3
      file://localhost/_concurrent/split_list_output-65fe229c-2da2-4054-997e-fb167b8c30ed--array//elt-2
      file://localhost/_concurrent/split_list_output-65fe229c-2da2-4054-997e-fb167b8c30ed--array//elt-1
      
    • the filenames returned contain references to the CSV files that will result in the detection id row insertion:
      P2_J062941_B001_P2fits0_20081115_P2Detection.csv,
      P2_J062941_B001_P2fits0_20081115_P2ImageMeta.csv,
      P2_J062941_B001_P2fits0_20081115_P2FrameMeta.csv
      

  • core query 2: The user considers a table to contain values they do not expect. Was the range check (IsMatchTableColumnRanges?) performed for this table?

    q2.sh in the SVN runs this query.

    The guts of the query is this SQL:

    select 
      dataset_values.value
    from
      processes, invocation_procedure_names,  dataset_usage, dataset_values
    where
      type='compound' and
      procedure_name='is_match_table_column_ranges' and
      dataset_usage.direction='O' and
      dataset_usage.param_name='inputcontent' and
      processes.id = invocation_procedure_names.execute_id and
      dataset_usage.process_id = processes.id and
      dataset_usage.dataset_id = dataset_values.dataset_id;
    

    This returns the input parameter XML for all is_match_table_column_ranges calls. These are XML values, and it is necessary to examine the resulting XML to determine if it was invoked for the specific table. There is unpleasant cross-format joining necessary here to get an actual yes/no result properly, although probably could use a LIKE clause to peek inside the value.

  • query 3: Which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value?

    This uses the additional annotations made for query 1. These annotations only store which process originally inserted a row, not which processes have modified a row. So to some extent, rows are regarded a bit like artifacts (though not first order artifacts in the provenance database); and we can only answer questions about the provenance of rows, not the individual fields within those rows. That is sufficient for this query, though.

    First find the row that contains the interesting value and extract its IMAGEID. Then find the process that created the image ID by querying the derby database table P2IMAGEPROV:

    ij> select * from ipaw.p2imageprov where imageid=6294301;
    IMAGEID             |PROVENANCEID                                                
    -----------------------------------------------------------------------------------------------------------------------------------------------------
    6294301             |tag:benc@ci.uchicago.edu,2008:swiftlogs:execute2:pc3-20090519-2057d8dyi9o9:ps_load_executable_db_app-dpc8q1bj
    

    Now we have a process ID for the process that created the row.

    Now query the transitive closure table for all predecessors for that process (as in q1). This will produce all processes and artifacts that preceeded this row creation.

    Our answer differs from the sample answer because we have sequenced access to the db, rather than regarding each row as a proper first-order artifact. The entire DB state at a particular time is a successor to all previous database accessing operations, so any process which led to any database access before the row in question is regarded as a necessary operations. This is undesirable in some respects, but desirable in others. For example, a row insert only works because previous database operations which inserted other rows did not insert a conflicting primary key - so there is data dependency between the different operations even though they operate on different rows.

  • Optional query 1: The workflow halts due to failing an IsMatchTableColumnRanges? check. How many tables successfully loaded before the workflow halted due to a failed check?

    This counts how many load processes are known to the database (over all recorded workflows)

    select count(*) from invocation_procedure_names where procedure_name='load_csv_file_into_table';

    This can be restricted to a particular workflow run like this:

    sqlite> select count(process_id) from invocation_procedure_names,processes_in_workflows where procedure_name='load_csv_file_into_table' and workflow_id='tag:benc@ci.uchicago.edu,2008:swiftlogs:execute:pc3-20090519-1659-jqc5od2f:run' and invocation_procedure_names.execute_id = processes_in_workflows.process_id; 3

  • Optional Query 2: Which pairs of procedures in the workflow could be swapped and the same result still be obtained (given the particular data input)?

    In our Swift representation of the workflow, we control flow dataflow dependencies. So many of the activities that could be commuted are in our implementation run in parallel. One significant thing can't describe in SwiftScript? (and so cannot answer from the provenance database using this method) is commuting operations on the database. From a Swift perspective, this is a limitation of our SwiftScript? language rather than in the provenance implementation, benc thinks.

    The query lists which pairs unix process executions (of which there are 50x50) have no data dependencies on each other. There are 2082 rows. The base SQL query is this:

    select L.id, R.id from processes as L, processes as R where L.type='execute' and R.type='execute' and NOT EXISTS (select * from trans where before=L.id and after=R.id);

    This answer is deficient in a few ways. We do not take into account non-execute procedures (such as compound procedures, function invocations, and operator executions) - there are 253 processes in total, 50 being executes and the remaineder being the other kinds of process. If we did that naively, we would not take into account compound procedures which contain other procedures (due to lack of decent support for nested processes - something like OPM accounts) and would come up with commutations which do not make sense.

Suggested Workflow Variants

Suggested Queries

Suggestions for Modification of the Open Provenance Model

To v1.01.a.xml schema definition, make time be represented by dataTime XSD type. We already do this in our OPM above, meaning that it should not validate with v1.01.a.xml.

Better support for hierarchies - of collections of artifacts; and of processes. Swift makes heavy use of both in its input language, SwiftScript?, though this information is not recorded explicitly in the provenance database.

Conclusions


to top

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback