Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.UPenn

Start of topic | Skip to actions

Provenance Challenge Template

Participating Team

  • Short team name: ZOOM, Database Group, University of Pennsylvania
  • Participant names: Sarah Cohen-Boulakia, Shirley Cohen, Susan Davidson, Olivier Biton and Thunyarat (Bam) Amornpetchkul
  • Project URL: http://db.cis.upenn.edu/research/provwf.html
  • Project Overview: This project aims to provide a formal model of provenance for scientific workflows which is both simple and general (i.e. can be used with existing workflow systems, such as Ptolemy/Kepler and myGrid) and sufficiently expressive to answer the provenance queries encountered in case studies. Interestingly, the proposed model not only takes into account the chained and complex structures of scientific workflows, but also allows for reasoning about provenance at different levels of abstraction through user views.
  • Provenance-specific Overview: See above.
  • Relevant Publications: Shirley Cohen, Sarah Cohen-Boulakia and Susan Davidson. Towards a Model of Scientific workflows and User Views. Proceedings of DILS'06, Data Integration for the Life Sciences, Springer-Verlag, Lecture Notes in Bioinformatics (LNBI) Num. 4075, pp. 264-279.

Workflow Representation

Our aim is to provide a framework to represent minimal information necessary such that we can reason about provenance in scientific workflows. With this in mind, we distinguish between the specification of the workflow and the execution of the workflow. On the one hand, the workflow specification is composed of step-classes (alternatively called procedures, tasks, actors, processes, boxes) such as reslice in the challenge workflow. On the other hand, an execution of a workflow generates a partial order of steps, each of which has a set of input and output data objects. Each step is an instance of a step-class, and the input-output flow of data and class associated with each step must conform to the workflow specification. An example of step in the challenge workflow is “8.reslice” which is an instance of the step-class “reslice”. The proposed model of provenance for workflows can represent the relationship between a step-class (specification) and a step (execution). This is necessary for keeping track of the data and parameters used by each step of a workflow execution as well as to trace the data produced. The model is represented within a relational framework extended with transitive closure. It is implemented under Oracle 10.g and Java. An object layer together with a user interface (JDBC component) is provided. More details on our model and implementation of the Challenge workflow are provided in the following section.

Provenance Trace

Our model is composed of base tables which provide minimal information about the workflow execution, the workflow specification, and the relationships between the specification and execution. The model is represented though the E/R diagram below. The tables IMMCONTAINS, IMMSTEPCONTAINS and USERVIEW will be defined in the section Worflow Variants. The other tables and their schemas are listed below. We list the type of each column (VARCHAR are strings) and the primary keys (in bold). For each table, we give an example based on the Challenge workflow.

er.jpg

Figure 1 – E/R schema for the model of Provenance

Note: The content of the tables is provided at the end of this page (see images tablesData-1.jpg to tablesData-4.jpg).

  • Data
Data (ID int, name string, type string) relates the unique identifier of a data item to its name and its type.

As an example, the data whose ID is 1 has the name “Anatomy Image1” and is of type Anatomy Image.

       DATAID               NAME                 TYPE   
-----------------------------------------------------              
1                    Anatomy Image1       Anatomy Image

  • dataAttributes
dataAttributes (dataId string, attribute string, value string) contains the annotations of each data item. There can be several pairs of keys and values annotated with each data item. As an example, Anatomy Image1, whose ID is 1, is annotated with a key word “center” and value “UChicago.” In other words, center= UChicago associated with DATAID 1, is represented in our model in the following way:
      DATAID               ATTRIBUTE            VALUE   
-----------------------------------------------------  
1                    center               UChicago

  • stepParam
stepParam(step int, attribute string, value string) stores information about the parameters used by a step. step is the id of the step, attribute represents the category of the step, and value represents the description of the category.

As an example, the table below expresses that the step whose id is 1 takes a nonlinear (linear=false), 12th order (order=12) parameter, and whose model is 1365 (model=1365). This parameter can be invoked by using “-m 12 –q” (description=-m 12 –q).

      STEP                 ATTRIBUTE            VALUE    
-----------------------------------------------------             
1                    linear               false                
1                    order                12                   
1                    model                1365                 
1                    description          -m 12 -q     

  • input
input(step int, dataId int, ts Date) stores information about an input to a step and the timestamp it occurred.

As an example, Step 2 (2.align_warp) takes as inputs data whose ID is 3 (Anatomy Image2), 4 (Anatomy Header2), 9 (Reference Image), and 10 (Reference Header). Each input in the example was taken on the same date, 8/8/2006.

STEP                 DATAID               TS   
-----------------------------------------------------                
2                    3                    8/8/2006            
2                    4                    8/8/2006            
2                    9                    8/8/2006            
2                    10                   8/8/2006

  • output
output(step int, dataId int, ts Date) stores information about an output from a step and the timestamp it occurred.

As an example, Step 10 (10. slicer) produced one output with a DATAID of 25 (Atlas X Slice) on 08/17/2006.

STEP                 DATAID               TS                  
-----------------------------------------------------  
10                   25                   8/17/2006   

  • instanceOf
instanceOf (step int, step-class string, ts) defines the relationship between a step-class and a step of which it is an execution, and contains the execution time of the step.

As an example, steps 1, 2, 3, and 4 are executions of the step-class “align_warp.”

    STEP                 STEPCLASS            TS    
-----------------------------------------------------               
1                    align_warp           8/7/2006            
2                    align_warp           8/8/2006            
3                    align_warp           8/10/2006           
4                    align_warp           8/11/2006           

  • stageInstance
Note: The notion of stage was not initially in our model and we think it needs to be better defined. We will go back to this point. However, we added it to be able to answer some of the queries in the challenge.

stageInstance(step string, stage string) stores the relationship between a stage and a step. As an example, steps 1 through 4 are in stage 1.

STEP                 STAGE
-----------------------------------------------------       
1                    1          
2                    1          
3                    1          
4                    1       
  • Process
This table is automatically generated from the previous tables. It represents the details of each execution including the step, step-class, input, output, parameter, and starting time.
CREATE VIEW Process 
AS 
    SELECT DISTINCT 
          instanceOf.step      step, 
          instanceOf.stepClass stepClass, 
          input.dataId         input, 
          dataInput.name       inputName,
          output.dataId        output, 
          dataOutput.name      outputName,
          input.ts             time 
     FROM 
           instanceOf, 
           input, 
           data dataInput,
           output,
           data dataOutput
     WHERE instanceOf.step = input.step 
       AND input.dataId = dataInput.dataId
       AND input.step = output.step 
       AND output.dataId = dataOutput.dataId
       AND input.ts <= output.ts
As an example, step 13 is an execution of the step-class convert. It takes an input of DATAID 25 (Atlas X Slice) and was executed on 08/20, 2006. It does not take any parameters. It produces the output of DATAID 28 (Atlas X Graphic).
   STEP STEPCLASS  INPUT  INPUTNAME    OUTPUT OUTPUTNAME   TIME 
---------------------------------------------------------------- 
  13   convert    25     Atlas X Slice 28  Atlas X Graphic 8/20/2006 

Provenance Queries

Teams Queries
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
UPenn team thumbs up thumbs up thumbs up thumbs up thumbs up thumbs up smile thumbs up thumbs up

For each query, we provide the corresponding SQL query and the results obtained. We also provide these results in a user-friendly format (as screenshots of our Java user interface).

  • Query1
    • Query: Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed etc.
SELECT DISTINCT step, stepClass, input, inputName, output, outputName 
     FROM process 
     START WITH output = 
              (
               SELECT dataId FROM data WHERE name = 'Atlas X Graphic'
               ) 
     CONNECT BY PRIOR input = output 
     ORDER BY step
    • Note: This query exploits the Connect by prior operator of Oracle which implements the transitive closure function. Transitive closure is necessary for returning all the data (recursively) that have been used to compute Atlas X Graphic. Basically, we use the Process table and search reccursively for the data used (output produced and input used, by mean of "connect by prior input=output").
    • Result (SQL):
STEP STEPCLASS   INPUT    INPUTNAME            OUTPUT OUTPUTNAME         
----------------------------------------------------------------  
1    align_warp  1        Anatomy Image1       11     Warp Parameters1    
1    align_warp  10       Reference Header     11     Warp Parameters1    
1    align_warp  2        Anatomy Header1      11     Warp Parameters1    
1    align_warp  9        Reference Image      11     Warp Parameters1    
10   slicer      23       Atlas Image          25     Atlas X Slice       
10   slicer      24       Atlas Header         25     Atlas X Slice       
13   convert     25       Atlas X Slice        28     Atlas X Graphic     
2    align_warp  10       Reference Header     12     Warp Parameters2    
2    align_warp  3        Anatomy Image2       12     Warp Parameters2    
2    align_warp  4        Anatomy Header2      12     Warp Parameters2    
2    align_warp  9        Reference Image      12     Warp Parameters2    
3    align_warp  10       Reference Header     13     Warp Parameters3    
3    align_warp  5        Anatomy Image3       13     Warp Parameters3    
3    align_warp  6        Anatomy Header3      13     Warp Parameters3    
3    align_warp  9        Reference Image      13     Warp Parameters3    
4    align_warp  10       Reference Header     14     Warp Parameters4    
4    align_warp  7        Anatomy Image4       14     Warp Parameters4    
4    align_warp  8        Anatomy Header4      14     Warp Parameters4    
4    align_warp  9        Reference Image      14     Warp Parameters4    
5    reslice     11       Warp Parameters1     15     Resliced Image1     
5    reslice     11       Warp Parameters1     16     Resliced Header1    
6    reslice     12       Warp Parameters2     17     Resliced Image2     
6    reslice     12       Warp Parameters2     18     Resliced Header2    
7    reslice     13       Warp Parameters3     19     Resliced Image3     
7    reslice     13       Warp Parameters3     20     Resliced Header3    
8    reslice     14       Warp Parameters4     21     Resliced Image4     
8    reslice     14       Warp Parameters4     22     Resliced Header4    
9    softmean    15       Resliced Image1      23     Atlas Image         
9    softmean    15       Resliced Image1      24     Atlas Header        
9    softmean    16       Resliced Header1     23     Atlas Image         
9    softmean    16       Resliced Header1     24     Atlas Header        
9    softmean    17       Resliced Image2      23     Atlas Image         
9    softmean    17       Resliced Image2      24     Atlas Header        
9    softmean    18       Resliced Header2     23     Atlas Image         
9    softmean    18       Resliced Header2     24     Atlas Header        
9    softmean    19       Resliced Image3      23     Atlas Image         
9    softmean    19       Resliced Image3      24     Atlas Header        
9    softmean    20       Resliced Header3     23     Atlas Image         
9    softmean    20       Resliced Header3     24     Atlas Header        
9    softmean    21       Resliced Image4      23     Atlas Image         
9    softmean    21       Resliced Image4      24     Atlas Header        
9    softmean    22       Resliced Header4     23     Atlas Image         
9    softmean    22       Resliced Header4     24     Atlas Header    
The results can also be visualized within the user-friendly interface. On the top, the query is given by the user, and on the bottom the result (provenance) is represented through a graph. The user can access to the characteristics of a given step or data item by clicking on it (visualized on the right hand side).
    • Result (user friendly view): query1.jpg

* Query2

    • Query: Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean.
SELECT DISTINCT step, stepClass, input, inputName, output, outputName            
     FROM Process                                                                
     START WITH output = (                                                       
               SELECT dataId FROM data WHERE name = 'Atlas X Graphic'            
          )                                                                      
     CONNECT BY PRIOR input = output                                             
 MINUS (                                                                         
     SELECT DISTINCT step, stepClass, input, inputName, output, outputName       
          FROM Process                                                           
               START WITH output IN (                                            
                    SELECT input FROM Process WHERE stepClass = 'softmean'       
               )                                                                 
     CONNECT BY PRIOR input = output                                             
 )                                                                              
    • Note: We follow the same process as previously but we remove (Minus operator) all the data used by "softmean".
    • Result (SQL):
STEP    STEPCLASS   INPUT INPUTNAME            OUTPUT OUTPUTNAME   
----------------------------------------------------------------     
10      slicer      23    Atlas Image          25     Atlas X Slice   
10      slicer      24    Atlas Header         25     Atlas X Slice   
13      convert     25    Atlas X Slice        28     Atlas X Graphic 
9       softmean    15    Resliced Image1      23     Atlas Image     
9       softmean    15    Resliced Image1      24     Atlas Header    
9       softmean    16    Resliced Header1     23     Atlas Image     
9       softmean    16    Resliced Header1     24     Atlas Header    
9       softmean    17    Resliced Image2      23     Atlas Image     
9       softmean    17    Resliced Image2      24     Atlas Header    
9       softmean    18    Resliced Header2     23     Atlas Image     
9       softmean    18    Resliced Header2     24     Atlas Header    
9       softmean    19    Resliced Image3      23     Atlas Image     
9       softmean    19    Resliced Image3      24     Atlas Header    
9       softmean    20    Resliced Header3     23     Atlas Image     
9       softmean    20    Resliced Header3     24     Atlas Header    
9       softmean    21    Resliced Image4      23     Atlas Image     
9       softmean    21    Resliced Image4      24     Atlas Header    
9       softmean    22    Resliced Header4     23     Atlas Image     
9       softmean    22    Resliced Header4     24     Atlas Header    
    • Result (User friendly view): query2.jpg

  • Query3
    • Query: Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic.
SELECT DISTINCT step, stepClass, input, inputName, output, outputName 
     FROM process                                                     
     WHERE step IN (                                                  
          SELECT step FROM StageInstance                              
          WHERE (                                                     
               stage in (3,4,5)                                       
          )                                                           
     )                                                                
     START WITH output = (                                            
           SELECT dataId FROM data WHERE name = 'Atlas X Graphic'     
     )                                                                
     CONNECT BY PRIOR input = output                                 
    • Note: We follow the same idea than in query1 while focusing on stages 3, 4, and 5.
    • Result (SQL):
STEP  STEPCLASS INPUT INPUTNAME         OUTPUT  OUTPUTNAME     
----------------------------------------------------------------       
9     softmean  19    Resliced Image3   23      Atlas Image          
9     softmean  21    Resliced Image4   23      Atlas Image          
9     softmean  17    Resliced Image2   24      Atlas Header         
9     softmean  20    Resliced Header3  24      Atlas Header         
9     softmean  15    Resliced Image1   24      Atlas Header         
9     softmean  19    Resliced Image3   24      Atlas Header         
13    convert   25    Atlas X Slice     28      Atlas X Graphic      
9     softmean  15    Resliced Image1   23      Atlas Image          
9     softmean  18    Resliced Header2  23      Atlas Image          
9     softmean  16    Resliced Header1  23      Atlas Image          
9     softmean  18    Resliced Header2  24      Atlas Header         
9     softmean  17    Resliced Image2   23      Atlas Image          
10    slicer    23    Atlas Image       25      Atlas X Slice        
9     softmean  22    Resliced Header4  23      Atlas Image          
9     softmean  16    Resliced Header1  24      Atlas Header         
9     softmean  22    Resliced Header4  24      Atlas Header         
9     softmean  20    Resliced Header3  23      Atlas Image          
10    slicer    24    Atlas Header      25      Atlas X Slice        
9     softmean  21    Resliced Image4   24      Atlas Header         
    • Result (User friendly view):
Query3.jpg
    • Note: The same results as query2 have been found. Query 2 and 3 are synonyms!

  • Query4
    • Query: Find all invocations of procedure align_warp using a twelfth order nonlinear 1365 parameter model (see model menu describing possible values of parameter "-m 12" of align_warp) that ran on a Monday.
SELECT step, stepClass, input, inputName, output, outputName, time    
     FROM process                                                     
     WHERE stepClass = 'align_warp'                                   
     AND EXISTS                                                       
        (                                                             
       SELECT value                                                   
         FROM stepParams                                              
       WHERE stepParams.step = process.step                           
         AND attribute='order'                                        
         AND value='12'                                               
       )                                                              
     AND EXISTS                                                       
        (                                                             
       SELECT value                                                   
         FROM stepParams                                              
       WHERE stepParams.step = process.step                           
         AND attribute='model'                                        
         AND value='1365'                                             
       )                                                              
    AND rtrim(to_char(time, 'DAY')) = 'MONDAY'                       
    • Note: We search for information about the step-class "align-warp" (first "Where" clause) such that there is a step whose parameter is "order=12" and "model=1365".
    • Result (SQL):
STEP  STEPCLASS    INPUT  INPUTNAME         OUTPUT  OUTPUTNAME        TIME 
----------------------------------------------------------------      
1     align_warp   1      Anatomy Image1    11      Warp Parameters1  8/7/2006  
1     align_warp   2      Anatomy Header1   11      Warp Parameters1  8/7/2006  
1     align_warp   10     Reference Header  11      Warp Parameters1  8/7/2006  
1     align_warp   9      Reference Image   11      Warp Parameters1  8/7/2006  
    • Result (User friendly view):

query4.jpg

  • Query5
    • Query: Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers had an entry global maximum=4095.
SELECT name                                                             
  FROM data                                                             
  WHERE dataId IN                                                       
          (                                                             
        SELECT output                                                   
          FROM process                                                  
          START WITH input IN (                                         
                    SELECT data.dataId                                  
                      FROM data, dataAttributes                         
                      WHERE data.dataId = dataAttributes.dataId         
                        AND data.type = 'Anatomy Header'                
                      AND dataAttributes.attribute = 'global maximum'   
                      AND dataAttributes.value = '4095'                 
                   )                                                    
          CONNECT BY PRIOR output=input                                 
         )                                                              
  AND type = 'Atlas Graphic'            
  • Note: The dataAttributes table is used to find all Anatomy header data whose entry 'global maxiamum' is equal to 4095. Then, the same precess as previously is used (i.e. looking recursively for data).
  • Result (SQL):
NAME 
---------------------             
Atlas X Graphic      
Atlas Y Graphic      
Atlas Z Graphic

  • Query6
    • Query: _Find all output averaged images of softmean (average) procedures, where the warped images taken as input were align-warped using a twelfth order nonlinear 1365 parameter model, i.e. "where softmean was preceded in the workflow, directly or indirectly, by an align-warp procedure with argument -m 12." _
SELECT distinct pred.orig,  data.name               
  FROM                                              
       (                                            
    SELECT distinct CONNECT_BY_ROOT step orig, step 
      FROM process                                  
      START WITH stepClass='softmean'               
      CONNECT BY PRIOR input=output                 
     ) pred,                                        
     output,                                        
     data                                           
  WHERE                                             
    EXISTS (                                        
              SELECT *                              
          FROM stepParams                           
                WHERE attribute = 'order'           
                AND value = 12                      
              AND stepParams.step = pred.step       
           )                                        
     AND EXISTS                                     
         (                                          
              SELECT *                              
          FROM stepParams                           
                WHERE attribute = 'model'           
          AND value = 1365                          
          AND stepParams.step = pred.step           
           )                                        
  AND pred.orig = output.step                       
  AND output.dataId = data.dataId                  
Note that we understand "-m12" as a search for parameter whose order is equal to 12. We could also have chosen to search for a parameter whose description was like "-m 12".

    • Result (SQL):
ORIG                 NAME  
---------------------------------               
9                    Atlas Header         
9                    Atlas Image          

  • Query7
    • Query: A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant.
    • Note: Computing the difference between two workflows or two workflow executions will be possible in our next implementation. Query 7 is a very simple example of the problem. However, when the difference between the data used by a step or produced by a step is considered, the problem becomes much harder. We would be very interested in talking about this problem with the workshop attendees.

  • Query8
    • Query: A user has annotated some anatomy images with a key-value pair center=UChicago. Find the outputs of align-warp where the inputs are annotated with center=UChicago.
SELECT distinct data.name                      
  FROM process, dataAttributes, data           
  WHERE process.stepClass='align_warp'         
    AND process.input = dataAttributes.dataId  
  AND dataAttributes.attribute = 'center'      
  AND dataAttributes.value = 'UChicago'        
  AND process.output = data.dataId            
    • Note: This query basically consists in a join between dataAttributes (which contains information about annotations), Process, and Data.
    • Result:
NAME     
------------------            
Warp Parameters2     
Warp Parameters1  

  • Query9
    • Query: A user has annotated some atlas graphics with key-value pair where the key is studyModality. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files.
SELECT distinct data.dataId, data.name, attribute, value                      
  FROM data, dataAttributes, attribute, value                                 
  WHERE                                                     
    data.type='Atlas Graphic'                               
  AND data.dataId = dataAttributes.dataId                   
  AND dataAttributes.attribute = 'studyModality'            
  AND dataAttributes.value IN ('speech', 'visual', 'audio')
    • Note: Once again, this query consists in finding annotations in the dataAttributes table and to make joins to get other information.
    • Result:
DATAID               NAME                 ATTRIBUTE            VALUE   
---------------------------------------------------------------------             
29                   Atlas Y Graphic      studyModality        audio                
29                   Atlas Y Graphic      studyModality        visual               
30                   Atlas Z Graphic      studyModality        speech  

Suggested Workflow Variants

The model we propose takes into account the capabilities of many of the current scientific workflow systems; one such capability is the idea of composite step-classes, that is, step-classes which are composed of step-classes. There are several reasons why composite step classes are useful in workflows. First, they help to manage the complexity of a workflow; users may wish to focus on a certain level of abstraction and ignore the lower levels of detail. Second, composite step-classes can also represent levels of ``authorization''; users without the appropriate clearance level would not be allowed to see the lower level executions of a step-class.

Typically, an input to a composite step-class is also an input to one or more of its substep classes while an output of a substep class is either an input to another substep class or becomes the output of a composite step class. Also, composite step-classes define a partition of the step-classes of the workflow (there is no step-class which belongs to multiple composite step-classes).

We have therefore defined the following notion of user views:

Given a workflow specification, the user view of a user (or class of users) U, userView(U), is the set of lowest-level step classes that U is entitled to see.

Note that a user view cannot contain two step classes such that one is contained in the other. We also assume that the user view is valid, i.e. that each of the highest-level step classes in the workflow specification is either in the view, or that at some lower-level all of its contained substeps are in the user view.

To answer questions of provenance, we must take the user view into account and reason about the input and output to steps which are instances of step-classes that are in the user view. We must know the containment relationship between step-classes and the relationship between each user and the step-classes he is entitled to see.

In the following, we introduce the part of our model which deals with user views and we will give examples based on a variation of the provenance workflow in the challenge. In the following discussion, we refer to composite steps as “boxes”.

wf.png

Figure 2 – Workflow for the challenge with user views (boxes 1 to 3)

UserView Application

In this section, we introduce some additional tables which are needed for reasoning about provenance in the context of user views:

  • immContains
immContains(compoStepClass string, stepClass string) indicates the immediate containment between step-classes.

As an example, box 1 covers all the step-class align_warp and reslice, and box3 contains box1.

COMPOSTEPCLASS       STEPCLASS 
---------------------------------           
box1                 align_warp           
box1                 reslice
box3                 box1

  • contains
For each step class, contains gives all the step classes it transitively contains.
CREATE VIEW
AS  
   SELECT DISTINCT CONNECT_BY_ROOT compoStepClass as compoStepClass, stepClass
     FROM immcontains
     CONNECT BY PRIOR stepClass=compoStepClass
     ORDER BY compoStepClass

COMPOSTEPCLASS       STEPCLASS 
---------------------------------  
box1                 align_warp           
box1                 reslice
box3                 box1         
box3                 align_warp           
box3                 reslice

  • instanceOf
The table instanceOf contains information for composite steps too. For example stepBox1-1 is the first instance of the box box1; it will contain the steps 1 and 5 (cf. table immStepContains and view containsStep).

  • immStepContains
This table defines direct steps containment, accordingly with immContains but for instances.

  • containsStep
For each step class, contains gives all the step classes it transitively contains.
CREATE VIEW
AS  
   SELECT DISTINCT CONNECT_BY_ROOT compoStep compoStep, step
     FROM immStepContains
     CONNECT BY PRIOR step=compoStep
     ORDER BY compoStep

COMPOSTEP            STEP                 
---------------------------------
stepBox1-1           1                    
stepBox1-1           5                    
stepBox1-2           2                    
stepBox1-2           6                    
stepBox1-3           3                    
stepBox1-3           7                    
stepBox1-4           4                    
stepBox1-4           8                    
stepBox2-1           10                   
stepBox2-1           13                   
stepBox2-2           11                   
stepBox2-2           14                   
stepBox2-3           12                   
stepBox2-3           15                   
stepBox3-1           1                    
stepBox3-1           10                   
stepBox3-1           13                   
stepBox3-1           5                    
stepBox3-1           9                    
stepBox3-1           stepBox1-1           
stepBox3-1           stepBox2-1           

  • userView
userView(usr string, stepClass string) indicates the lowest step-class level that a user can see. Remember that the union of all the step-classes a user can see must cover the whole workflow (user views are valid).

As an example, the user ‘uBlackBox’ sees the whole workflow as one step. The user ‘uBio’ cannot see the detail of step-class align_warp and reslice separately (he sees box1 instead). He cannot see the details of step-class slicer or convert (he sees box2 instead). However, he can see the step-class softmean. UAdmin can see all step-classes.

USR                  STEPCLASS  
--------------------------------          
uBlackBox            box3                 
uBio                 box1                 
uBio                 softmean             
uBio                 box2                 
uAdmin               align_warp           
uAdmin               reslice              
uAdmin               softmean             
uAdmin               slicer               
uAdmin               convert    

The following two tables, Cinput and Coutput are input and output extended to composite steps.

  • Cinput
Similarly to input, Cinput stores information about inputs to a composite step and the time of the input to the box within the context of a user view. Specifically, an input to a box is an input to one of the steps the box contains, but must not be an output of any step contained in the box.
CREATE VIEW cInput 
AS
    -- First union operand: "For each containment, the input is an input of 
    -- the higher level step class if and only if this input was not an output
    -- from another of its sub steps
    SELECT containsStep.compoStep step, input.dataId, min(input.ts) ts
      FROM input,
           containsStep
      WHERE
           input.step = containsStep.step
           AND NOT EXISTS 
               (
                SELECT o.step
                  FROM output o, containsStep c2
                  WHERE o.dataId = input.dataId
                    AND o.step = c2.step
                    AND c2.compoStep = containsStep.compoStep
               )
      GROUP BY containsStep.compoStep, input.dataId
    -- Second union operand
    -- We want to keep the direct input relations
    UNION 
    SELECT step, dataId, ts
      FROM input;
As an example, the step stepBox2-1 takes DATAID 23 (Atlas Image) and DATAID 24 (Atlas Header) as inputs.
STEP       DATAID  TS    
----------------------------
stepBox2-1 23      8/17/2006           
stepBox2-1 24      8/17/2006           

  • Coutput
Similarly to output, Coutput stores information of outputs produced from each composite step and the time of the output to the box within the context of a user view. An output from a box is an output from one of the steps the box contains, but must not be an input to any step contained in the box.
CREATE VIEW coutput                     
AS                          
  -- First union operand: "For each containment, the output is an output of 
  -- the higher level step class if and only if this output is used as an
  -- input for another step which is not contained in the same step 
  -- or if this is a final output of the workflow 
  SELECT containsStep.compoStep step, output.dataId, output.ts
    FROM output,
         containsStep
    WHERE
          output.step = containsStep.step
    AND EXISTS
        (
         /* Step class is used as an entry of at least one step 
          *  which is not a substep of this step class
          */ 
         SELECT i.step
           FROM input i
           WHERE i.dataId = output.dataId
           AND NOT EXISTS 
             (
              SELECT cont2.compoStep
                FROM containsStep cont2 
              WHERE cont2.step = i.step
                AND cont2.compoStep = containsStep.compoStep
             )
         UNION
         -- Step class is a final output?
         (
            SELECT output.dataId
              FROM dual
            MINUS
            SELECT i.dataId
              FROM input i
         )
        )
  UNION 
  -- Second union operand
  -- We want to keep the direct output relations  
  SELECT step, dataId, ts
    FROM output
As an example, step stepBox2-1 produced data ID 28 (Atlas Graphic X) on 08/20/2006.
STEP         DATAID  TS  
---------------------------- 
stepBox2-1   28      8/20/2006           

  • UProcess
Similarly to Process, UProcess represents the detail of each execution which a user can see.
CREATE VIEW UProcess                                    
AS                                                      
    SELECT DISTINCT 
        userView.usr         usr,
        instanceOf.step      step, 
        instanceOf.stepClass stepClass, 
        input.dataId         input, 
        dataInput.name       inputName,
        output.dataId        output, 
        dataOutput.name      outputName,
        input.ts             time 
     FROM 
        userView,
        instanceOf, 
        cinput input, 
        data dataInput,
        coutput output,
        data dataOutput
     WHERE 
        userView.stepClass = instanceOf.stepClass
       AND instanceOf.step = input.step 
       AND input.dataId = dataInput.dataId
       AND input.step = output.step 
       AND output.dataId = dataOutput.dataId
       AND input.ts <= output.ts
As an example, let us consider three user views and their perspectives on stage 1 of the workflow (align_warp).

The user ‘uAdmin’ can see all the steps of the workflow; and thus can see all data.

The user ‘uBio’ sees align_warp and reslice calls as single boxes (step class box1), thus he cannot see data whose ids are 11 to 14 (Warp Params 1 to 4). He also sees slicer and convert as single boxes (step class box2), thus he cannot see data with ids 25 to 27 (Atlas X Slice, Atlas Y Slice and Atlas Z Slice).

The user ‘uBlackBox’ cannot see any detail of the workflow. He sees everything as a big box which entries are data id 1 to 10 and outputs are data id 28 to 30.

The implications of user views when it comes to query answering, are discussed in the following section.

Suggested Queries

In this section, we answer query 1 considering user views and we introduce another query to emphasize the benefit of our approach.

  • Query1 for each user
    • Query for user “uBlackbox”
      • Query:
SELECT DISTINCT step, stepClass, input, inputName, output, outputName     
     FROM uProcess                                                        
   WHERE usr='uBlackBox'                                                     
     START WITH output = (                                                
               SELECT dataId FROM data WHERE name = 'Atlas X Graphic'     
               )                                                          
     CONNECT BY PRIOR input = output                                      
     ORDER BY step;                                                        
      • Result (SQL):
STEP        STEPCLASS  INPUT   INPUTNAME            OUTPUT  OUTPUTNAME      
stepBox3-1  box3       1       Anatomy Image1       23      Atlas Image     
stepBox3-1  box3       1       Anatomy Image1       24      Atlas Header    
stepBox3-1  box3       1       Anatomy Image1       28      Atlas X Graphic 
stepBox3-1  box3       10      Reference Header     23      Atlas Image     
stepBox3-1  box3       10      Reference Header     24      Atlas Header    
stepBox3-1  box3       10      Reference Header     28      Atlas X Graphic 
stepBox3-1  box3       17      Resliced Image2      23      Atlas Image     
stepBox3-1  box3       17      Resliced Image2      24      Atlas Header    
stepBox3-1  box3       17      Resliced Image2      28      Atlas X Graphic 
stepBox3-1  box3       18      Resliced Header2     23      Atlas Image     
stepBox3-1  box3       18      Resliced Header2     24      Atlas Header    
stepBox3-1  box3       18      Resliced Header2     28      Atlas X Graphic 
stepBox3-1  box3       19      Resliced Image3      23      Atlas Image     
stepBox3-1  box3       19      Resliced Image3      24      Atlas Header    
stepBox3-1  box3       19      Resliced Image3      28      Atlas X Graphic 
stepBox3-1  box3       2       Anatomy Header1      23      Atlas Image     
stepBox3-1  box3       2       Anatomy Header1      24      Atlas Header    
stepBox3-1  box3       2       Anatomy Header1      28      Atlas X Graphic 
stepBox3-1  box3       20      Resliced Header3     23      Atlas Image     
stepBox3-1  box3       20      Resliced Header3     24      Atlas Header    
stepBox3-1  box3       20      Resliced Header3     28      Atlas X Graphic 
stepBox3-1  box3       21      Resliced Image4      23      Atlas Image     
stepBox3-1  box3       21      Resliced Image4      24      Atlas Header    
stepBox3-1  box3       21      Resliced Image4      28      Atlas X Graphic 
stepBox3-1  box3       22      Resliced Header4     23      Atlas Image     
stepBox3-1  box3       22      Resliced Header4     24      Atlas Header    
stepBox3-1  box3       22      Resliced Header4     28      Atlas X Graphic 
stepBox3-1  box3       9       Reference Image      23      Atlas Image     
stepBox3-1  box3       9       Reference Image      24      Atlas Header    
stepBox3-1  box3       9       Reference Image      28      Atlas X Graphic 
  • Results obtained in a user friendly format:
Query1_UBlack.jpg

  • Note: Only the first (DATAIDs 1 to 10) inputs of the workflow are seen by uBlackBox.
    • Query for user “uBio”
      • Query: As previously with WHERE users = ‘uBio’
      • Result:
STEP          STEPCLASS    INPUT   INPUTNAME            OUTPUT   OUTPUTNAME           
9             softmean     15      Resliced Image1      23       Atlas Image          
9             softmean     15      Resliced Image1      24       Atlas Header         
9             softmean     16      Resliced Header1     23       Atlas Image          
9             softmean     16      Resliced Header1     24       Atlas Header         
9             softmean     17      Resliced Image2      23       Atlas Image          
9             softmean     17      Resliced Image2      24       Atlas Header         
9             softmean     18      Resliced Header2     23       Atlas Image          
9             softmean     18      Resliced Header2     24       Atlas Header         
9             softmean     19      Resliced Image3      23       Atlas Image          
9             softmean     19      Resliced Image3      24       Atlas Header         
9             softmean     20      Resliced Header3     23       Atlas Image          
9             softmean     20      Resliced Header3     24       Atlas Header         
9             softmean     21      Resliced Image4      23       Atlas Image          
9             softmean     21      Resliced Image4      24       Atlas Header         
9             softmean     22      Resliced Header4     23       Atlas Image          
9             softmean     22      Resliced Header4     24       Atlas Header         
stepBox1-1    box1         1       Anatomy Image1       15       Resliced Image1      
stepBox1-1    box1         1       Anatomy Image1       16       Resliced Header1     
stepBox1-1    box1         10      Reference Header     15       Resliced Image1      
stepBox1-1    box1         10      Reference Header     16       Resliced Header1     
stepBox1-1    box1         2       Anatomy Header1      15       Resliced Image1      
stepBox1-1    box1         2       Anatomy Header1      16       Resliced Header1     
stepBox1-1    box1         9       Reference Image      15       Resliced Image1      
stepBox1-1    box1         9       Reference Image      16       Resliced Header1     
stepBox1-2    box1         10      Reference Header     17       Resliced Image2      
stepBox1-2    box1         10      Reference Header     18       Resliced Header2     
stepBox1-2    box1         3       Anatomy Image2       17       Resliced Image2      
stepBox1-2    box1         3       Anatomy Image2       18       Resliced Header2     
stepBox1-2    box1         4       Anatomy Header2      17       Resliced Image2      
stepBox1-2    box1         4       Anatomy Header2      18       Resliced Header2     
stepBox1-2    box1         9       Reference Image      17       Resliced Image2      
stepBox1-2    box1         9       Reference Image      18       Resliced Header2     
stepBox1-3    box1         10      Reference Header     19       Resliced Image3      
stepBox1-3    box1         10      Reference Header     20       Resliced Header3     
stepBox1-3    box1         5       Anatomy Image3       19       Resliced Image3      
stepBox1-3    box1         5       Anatomy Image3       20       Resliced Header3     
stepBox1-3    box1         6       Anatomy Header3      19       Resliced Image3      
stepBox1-3    box1         6       Anatomy Header3      20       Resliced Header3     
stepBox1-3    box1         9       Reference Image      19       Resliced Image3      
stepBox1-3    box1         9       Reference Image      20       Resliced Header3     
stepBox1-4    box1         10      Reference Header     21       Resliced Image4      
stepBox1-4    box1         10      Reference Header     22       Resliced Header4     
stepBox1-4    box1         7       Anatomy Image4       21       Resliced Image4      
stepBox1-4    box1         7       Anatomy Image4       22       Resliced Header4     
stepBox1-4    box1         8       Anatomy Header4      21       Resliced Image4      
stepBox1-4    box1         8       Anatomy Header4      22       Resliced Header4     
stepBox1-4    box1         9       Reference Image      21       Resliced Image4      
stepBox1-4    box1         9       Reference Image      22       Resliced Header4     
stepBox2-1    box2         23      Atlas Image          28       Atlas X Graphic      
stepBox2-1    box2         24      Atlas Header         28       Atlas X Graphic      

      • Results obtained in a user friendly format:
Query1_UBio.jpg
      • Note: For security reasons, uBio does not know that only DATAIDs 1, 2, 9 and 10 are used by 1.align_warp.
He thinks that DATAIDs 1 through 10 are necessary. The same can be said for the steps inside box1 and box2.
    • Query for user “uAdmin”
      • Query: As previously with WHERE users = ‘uAdmin’
      • Result:
STEP   STEPCLASS      INPUT  INPUTNAME            OUTPUT  OUTPUTNAME        
--------------------------------------------------------------------------
1      align_warp     1      Anatomy Image1       11      Warp Parameters1  
1      align_warp     10     Reference Header     11      Warp Parameters1  
1      align_warp     2      Anatomy Header1      11      Warp Parameters1  
1      align_warp     9      Reference Image      11      Warp Parameters1  
10     slicer         23     Atlas Image          25      Atlas X Slice     
10     slicer         24     Atlas Header         25      Atlas X Slice     
13     convert        25     Atlas X Slice        28      Atlas X Graphic   
2      align_warp     10     Reference Header     12      Warp Parameters2  
2      align_warp     3      Anatomy Image2       12      Warp Parameters2  
2      align_warp     4      Anatomy Header2      12      Warp Parameters2  
2      align_warp     9      Reference Image      12      Warp Parameters2  
3      align_warp     10     Reference Header     13      Warp Parameters3  
3      align_warp     5      Anatomy Image3       13      Warp Parameters3  
3      align_warp     6      Anatomy Header3      13      Warp Parameters3  
3      align_warp     9      Reference Image      13      Warp Parameters3  
4      align_warp     10     Reference Header     14      Warp Parameters4  
4      align_warp     7      Anatomy Image4       14      Warp Parameters4  
4      align_warp     8      Anatomy Header4      14      Warp Parameters4  
4      align_warp     9      Reference Image      14      Warp Parameters4  
5      reslice        11     Warp Parameters1     15      Resliced Image1   
5      reslice        11     Warp Parameters1     16      Resliced Header1  
6      reslice        12     Warp Parameters2     17      Resliced Image2   
6      reslice        12     Warp Parameters2     18      Resliced Header2  
7      reslice        13     Warp Parameters3     19      Resliced Image3   
7      reslice        13     Warp Parameters3     20      Resliced Header3  
8      reslice        14     Warp Parameters4     21      Resliced Image4   
8      reslice        14     Warp Parameters4     22      Resliced Header4  
9      softmean       15     Resliced Image1      23      Atlas Image       
9      softmean       15     Resliced Image1      24      Atlas Header      
9      softmean       16     Resliced Header1     23      Atlas Image       
9      softmean       16     Resliced Header1     24      Atlas Header      
9      softmean       17     Resliced Image2      23      Atlas Image       
9      softmean       17     Resliced Image2      24      Atlas Header      
9      softmean       18     Resliced Header2     23      Atlas Image       
9      softmean       18     Resliced Header2     24      Atlas Header      
9      softmean       19     Resliced Image3      23      Atlas Image       
9      softmean       19     Resliced Image3      24      Atlas Header      
9      softmean       20     Resliced Header3     23      Atlas Image       
9      softmean       20     Resliced Header3     24      Atlas Header      
9      softmean       21     Resliced Image4      23      Atlas Image       
9      softmean       21     Resliced Image4      24      Atlas Header      
9      softmean       22     Resliced Header4     23      Atlas Image       
9      softmean       22     Resliced Header4     24      Atlas Header      
      • Note: All the data are obtained (same results than for the very first version of query1).

  • New Query: Find everything that caused “Resliced Image1” (ID 15) to be as it is?

    • Query for ‘uBlackBox’
      • Query:
SELECT *
  FROM uProcess  upc 
     WHERE usr = 'uBlackBox'
     START WITH outputName = 'Resliced Image1'
     CONNECT BY PRIOR upc.input = upc.output;  
      • Result: no rows selected (uBlackBox cannot even see ‘Resliced Image1’)

    • Query for ‘uBio’
      • Query: Idem with ‘uBio’
      • Result:
USR   STEP        STEPCLASS  INPUT  INPUTNAME         OUTPUT  OUTPUTNAME        TIME     
uBio  stepBox1-1  box1       1      Anatomy Image1    15      Resliced Image1   8/7/2006 
uBio  stepBox1-1  box1       10     Reference Header  15      Resliced Image1   8/7/2006 
uBio  stepBox1-1  box1       2      Anatomy Header1   15      Resliced Image1   8/7/2006 
uBio  stepBox1-1  box1       9      Reference Image   15      Resliced Image1   8/7/2006
      • Results obtained in a user friendly format:
extraQuery_UBio.jpg

    • Query for ‘uAdmin’
      • Query: Idem with ‘uAdmin’
      • Result:
USR     STEP  STEPCLASS   INPUT  INPUTNAME         OUTPUT  OUTPUTNAME        TIME      
uAdmin  5     reslice     11     Warp Parameters1  15      Resliced Image1   8/12/2006 
uAdmin  1     align_warp  1      Anatomy Image1    11      Warp Parameters1  8/7/2006  
uAdmin  1     align_warp  10     Reference Header  11      Warp Parameters1  8/7/2006  
uAdmin  1     align_warp  2      Anatomy Header1   11      Warp Parameters1  8/7/2006  
uAdmin  1     align_warp  9      Reference Image   11      Warp Parameters1  8/7/2006
      • Results obtained in a user friendly format:
extraQuery_UAdmin.jpg

Note that uAdmin can see that data id 11 (Warp Params 1) is part of the provenance while uBio cannot see it.

Categorisation of queries

Generally speaking, our provenance queries can be categorized as such:

  • step vs data provenance: depending on whether the user is interested in knowing the step(s) or the data that have been used to produce a given data item.
  • immediate vs deep provenance: depending on whether the user is interested in knowing only the previous step (and possibly the data used by this step) or the whole sequence of steps (and possibly the data used by these steps) used to produced a given data item.

According to our classification, all the queries proposed in this challenge are deep provenance queries, they are both step and data provenance queries.

Live systems

Our implementation is not presently available to the public at large. However, we will be happy to share the code with those who are interested in seeing it. Please send us an email.

Further Comments

Conclusions

We have shown how our model is able to represent the challenge workflow and to answer the proposed queries. We have also shown how the notion of a user view can allow the user to manage the complexity of a workflow through higher levels of abstraction and layering.

We would be interested in discussing with the workshop attendees about three points: (i) the meaning of a stage in a workflow (and in particular, how to interpret this notion with respect to user views), (ii) the biological significance of the procedures that can compose a workflow (in the workflow provided, what are the “significant” steps for a scientist?) and (iii) the (general) problem of computing a difference between two workflows (cf. query 7).

Aknowledgments

This work is supported by NSF grants 0513778, 0415810, and 0612177*.

(* Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.)

-- SarahCohenBoulakia - 11 Sep 2006

  • tablesData-1.jpg:
    tablesData-1.jpg

  • tablesData-2.jpg:
    tablesData-2.jpg

  • tablesData-3.jpg:
    tablesData-3.jpg

to top

I Attachment sort Action Size Date Who Comment
wf.png manage 82.3 K 10 Sep 2006 - 17:19 SarahCohenBoulakia  
er.jpg manage 156.2 K 11 Sep 2006 - 20:42 SarahCohenBoulakia  
tables2.png manage 534.5 K 10 Sep 2006 - 17:58 SarahCohenBoulakia  
tables3.png manage 1150.5 K 10 Sep 2006 - 17:59 SarahCohenBoulakia  
tables4.png manage 1062.1 K 10 Sep 2006 - 18:00 SarahCohenBoulakia  
query1.jpg manage 216.0 K 11 Sep 2006 - 01:26 SarahCohenBoulakia  
Query1_UBio.jpg manage 194.3 K 11 Sep 2006 - 21:55 SarahCohenBoulakia  
Query1_UBlack.jpg manage 141.8 K 11 Sep 2006 - 21:55 SarahCohenBoulakia  
query2.jpg manage 142.3 K 11 Sep 2006 - 01:30 SarahCohenBoulakia  
query4.jpg manage 98.9 K 11 Sep 2006 - 01:30 SarahCohenBoulakia  
query3.jpg manage 140.7 K 11 Sep 2006 - 01:31 SarahCohenBoulakia  
extraQuery_UBio.jpg manage 77.5 K 11 Sep 2006 - 22:11 SarahCohenBoulakia  
extraQuery_UAdmin.jpg manage 95.6 K 11 Sep 2006 - 22:12 SarahCohenBoulakia  
tablesData-1.jpg manage 208.5 K 11 Sep 2006 - 22:31 SarahCohenBoulakia  
tablesData-2.jpg manage 199.6 K 11 Sep 2006 - 22:31 SarahCohenBoulakia  
tablesData-3.jpg manage 155.4 K 11 Sep 2006 - 22:31 SarahCohenBoulakia  

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback