Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.ProvenanceQuestionsPc3

Start of topic | Skip to actions

Provenance Challenge 3 Questions

There are two sets of queries: core queries and optional queries. The core queries are those that all teams should try to answer. Optional queries are additional provenance questions that teams may answer. If a team has any new queries, please add these to the optional queries section.

Core Queries

Query 1

For a given detection, which CSV files contributed to it?

Basic sample answer: The CSV file containing the Detection table.

Advanced sample answer: The CSV file containing the Detection table, CSV file containing the Image table (as the image is an attribute of the detection), and CSV file containing the FrameMetadata table (as the frame metadata is an attribute of the image).

Query 1 Detailed

This is a more specific detailing of Query 1 what the query is, what results are to be expected and what evidence exists that provenance was determined.

Query:

For a given detection described in the database and identified by a detection id, which CSV files contributed to that description? By description, we mean the rows in the database related to a detection with a given detection id.

1. Inputs

The inputs to this query are a detection id and a database. A detection id maps to the DETECTID column in the P2DETECTION table inside a given database.

detection id: 261887437030025144

database: J062941_LoadDB

2a. Answer

The file name of the P2Detection CSV file that contained the detection id.

2b. Advanced Answer

- The file name of the P2Detection CSV file containing the detection id, - the file name of the ImageMeta? CSV file containing the image id (IMAGEID) associated to the detection. - the file name of the FrameMeta? CSV file containing the frame id (FRAME ID) that is associated with the image id that is in turn associated with the detection.

The following are just examples of how queries could be answered via the traversal of a provenance graph. Systems may answer these queries using a different approach.

3a. Evidence of provenance traversal

We would want to see the following:

  1. the detection with the given id is described by a row, R, in the P2DETECTION table in the database.
  2. R was caused by modifications of the P2DETECTION table by the UpdateComputedColumn? component.
  3. the invocation of UpdateComputedColumn? was caused by the existence of a CSVFileEntry?
  4. Additionally, R was caused by LoadCSVFileIntoTable? component
  5. the invocation of LoadCSVFileIntoTable? was cause by the existence of a CSVFileEntry?, the same one in step 2.
  6. The CVSFileEntry? was caused by the existence of a file name

3b. Evidence of provenance traversal

  1. the detection with the given id is described by a row, R, in the P2DETECTION table in the database.
  2. row R contains a foreign key IMAGE ID that points to a row, I, in the P2IMAGEMETA table.
  3. row I in turn contains a foreign key FRAME ID that points to a row, F, in the P2FRAMEMETA table.
  4. then for each of the rows, R, I, F we would want to see the same evidence as in the answer to 3a.

4a and 4b below are a particular OPM example that describe evidence of a provenance traversal.

4a. Evidence of provenance traversal

  1. the detection with the given id is described by a row, artifact R, in the P2DETECTION table in the database.
  2. R wasGeneratedBy modifications of the P2DETECTION table by the UpdateComputedColumn? process.
  3. R wasDerivedFrom R^. R wasDerivedFrom a CVSFileEntry? artifact, F
  4. the invocation of UpdateComputedColumn? Used artifact F and artifact R^
  5. Additionally, artifact R^ wasGeneratedBy LoadCSVFileIntoTable? process 6. the invocation of LoadCSVFileIntoTable? process Used artifact F
  6. R^ wasDerivedFrom F
  7. F wasDerivedFrom a file name artifact

4b. Evidence of provenance traversal

  1. the detection with the given id is described by a row, artifact R, in the P2DETECTION table in the database.
  2. R wasDerivedFrom a foreign key IMAGE ID, which wasDerivedFrom a row, artifact I, in the P2IMAGEMETA table.
  3. row I in turn wasDerivedFrom a foreign key FRAME ID that wasDerivedFrom a row, artifact F, in the P2FRAMEMETA table.
  4. then for each of the artifacts, R, I, F we would want to see the same evidence as in the answer to 4a.

Query 2

The user considers a table to contain values they do not expect. Was the range check (IsMatchTableColumnRanges) performed for this table?

Sample answer: Yes

Query 3

Which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value?

Sample answer: call of ReadCSVReadyFile, call of CreateEmptyLoadDB, 2nd call of ReadCSVFileColumnNames, 2nd call of LoadCSVFileIntoTable (2nd calls because Image is loaded in the 2nd iteration of the for loop, excluded checks because they do not change anything, excluded UpdatedComputedColumns because it is non-computed, excluded CompactDatabase because it does not affect the value).

Query 4: "Where Provenance" See Query 1

_This entry in the database seems incorrect. Identify the csv file (and location) where the data occurs._

This query is similar to Query 1. So we have removed it as a core query.

Query 5 (Based on new Plotting Workflow)

For a given zoneGroup the user selects in the plot image in the DisplayPlot activity, find the CSV files that contains that data?

The P2Detection CSV file used to load the P2Detection table on which the query for the plot data was run. E.g. If the user selects zoneGroup 1108 in the plot for Job ID J062942, the query returns P2_J062942_B001_P2fits0_20081115_P2Detection.csv.

Optional Queries

Optional Query 1

The workflow halts due to failing an IsMatchTableColumnRanges check. How many tables successfully loaded before the workflow halted due to a failed check?

Sample answer: 2

Optional Query 2

Which pairs of procedures in the workflow could be swapped and the same result still be obtained (given the particular data input)?

Sample answer: (I won't enumerate them all, but I think some can be swapped as the checks in particular are not causally dependent, but we cannot swap those inside the loop with those outside).

Optional Query 3

A CSV or header file is deleted during the workflow's execution. How much time expired between a successful IsMatchCSVFileTables test (when the file existed) and an unsuccessful IsExistsCSVFile? test (when the file had been deleted)?

Sample answer: 3ms

For testing the above query, we it may be simplest to edit the workflow to include deletion of the CSV file as a step.

Optional Query 4: "Why Provenance"

Why is this entry in the database?

Optional Query 5

A user executes the workflow many times (say 5 times) over different sets of data (j062941, j062942, ... j062945). He wants to determine, which of the execution halted?

Sample answer: workflow_execution_id along with input_data_set_name (say j062941, j062942)

Optional Query 6

Determine the step where halt occured?

Sample answer: 2nd call of ReadCSVFileColumnNames or 3rd call of LoadCSVFileIntoTable, etc.

Optional Query 7

Determine data and associated granularities of the data being processed, when halt occured?

Sample answer: data_id and associated granularities(column, row, table, file)

Optional Query 8

Which steps were completed successfully before the halt occurred?

Sample answer: steps name along with call (or count) number, if steps were executed multiple times

Optional Query 9

Which steps were not executed because of halt?

Sample answer: steps name along with call (or count) number, that should have been executed in case workflow did not halt

Optional Query 10

For a workflow execution, determine the user inputs?

Sample answer: jobID, CVSRootPath?

Optional Query 11

For a workflow execution, determine steps that required user inputs?

Sample answer: IsCVSReadyFileExists?, CreateEmptyLoadDB?

Optional Query 12

For a workflow execution that halted, which files where processed successfully?

Sample answer: P2Detection, P2FrameMeta

Optional Query 13

_For a workflow execution, display the following provenance views: data dependency views, step dependency view?

Sample answer: either display these views in form of a directed graph, or as a pair of dependency relations between data or steps.

Optional Query 14 (Based on new Plotting Workflow)

For a zoneGroup the user selects in the plot image, find the ImageMeta? rows that correspond to the data used in the zoneGroup.

Sample answer: If the user selects zoneGroup 1108 in the plot for Job ID J062942, the query will return the result of ==SELECT distinct i.imageID FROM P2Detection AS d, P2ImageMeta AS i WHERE d.imageID = i.imageID and CEILING(zoneId/10) = 1108==

Optional Query 15 (Based on new Plotting Workflow)

A user plots the results of both _DetectionsHistogram and DetectionsHistogramHiQuality for a database and upon comparing the plots, selects a zoneGroup. Find all rows in the P2Detection table that was present in the result of DetectionsHistogram? but not in the result of DetectionsHistogramHiQuality?._

Sample answer: If the user selects zoneGroup 1108 in the plot for Job ID J062942, the query will return the set of rows that are present in the result of query (A) but not in the result of query (B):

(A) SELECT * FROM P2Detection  WHERE ceiling(zoneId/10) = 1108

(B) SELECT * FROM P2Detection  WHERE raErr < 0.1 and decErr < 0.05  and ceiling(zoneId/10) = 1108

-- LucMoreau - 19 Mar 2009

-- PaulGroth - 02 Apr 2009

-- ManishAnand - 03 Apr 2009

-- YogeshSimmhan - 09 Apr 2009
to top


You are here: Challenge > ThirdProvenanceChallenge > ProvenanceQuestionsPc3

to top

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback