Skip to topic | Skip to bottom

Open Provenance Model

OPM
OPM.DiscussionsOnDraft1dot1

Start of topic | Skip to actions

Discussions On Draft v1.1


Comments (by Yolanda Gil)

I did have four major topics for suggested improvements, of course I may have misunderstood the model and if so let me know. They are:

1) ATTRIBUTION OF PROVENANCE. There is no mechanism in OPM to assert attribution of a provenance graph or portion thereof. That is, you can state in OPM that "the OPM document was first drafted by Luc and then went through a series of edits by Paul and Luc", but you cannot say that I (Yolanda) stated that. This seems to me pretty important and I envision requirements along these lines for provenance on the web.

2) COMPOSITE ARTIFACTS. Artifacts are not decomposable, and one cannot refer to the provenance of their constituents. That is, one can express in OPM that a process takes an artifact and creates a new one, but one cannot express that there is a delta between the result and the original and state that the delta's provenance is that process. I see this as a major issue in terms of expressing the provenance of large graphs that are assembled and manipulated by various processes. Same issue if you have a document (or a resource) that is edited (changed) and evolves over time with minor deltas after each modification.

3) IDENTITIES IN OCCURRENCES. There is no clean representation of process occurrences and artifact/agent state. The OPM document states clearly that "The OPM is a model of artifacts in the past, explaining how they were derived." So at the moment unique IDs are assigned to each occurrence of an artifact, agent, and process in the provenance graph. But I do not find anything in OPM to state that two IDs in the graph correspond to the same object (eg, that the agent in process P34562456 and the agent in process P45234563456 is the same agent).

4) ENTITIES vs AGENTS. Agents are a particular kind of entity that one can associate with a process, but not the only kind. I think that a more general notion such as "Entity" (which can be anything) which is relevant in some way to the process and is worth being recorded would be more appropriate. For example, in the current OPM if I had to represent the execution of the workflow by our system and wanted to state that a "clustering" process was executed in a specific host submitted by me under my grid certificate and that it used up 10 units from my TeraGrid? allocation I would either have to make all those agents or represent all that outside of the model. I am not sure they should be considered "agents", in some cases their participation in the process is rather passive or simply a resource that is needed but not the driver of the process. So I think that in capturing provenance one would want the model to be able to capture all those as "Entity", so they would all be included under the model. In some process models those would be called resources but I prefer entity, since the term resources already have a clear meaning in the web.

I hope this helps. I'd be happy to circulate these to the OPM list if that would bring in another voice/perspective.

From Yolanda Gil


Response from Luc:

1) That's something that we did in PASOA and we don't do (yet?) in OPM. A place for attribution would be to attach attribution to accounts or to the graph itself (as a form of annotation). I can also see downsides to this ... accounts may be coarser grained than the actual assertions of OPM (though in OPM, we have not specified what the notion of assertions is, and what its granularity is).

In fact, we also have to accept that an OPM graph may be extracted from a set of assertions (e.g. PASOA model, or RDF triples) with proper attribution.

2) We now have an embryonic collection profile, which allows us to talk about collections of artifacts, and hence, would address your former requirement. For the latter, Simon Miles has also began to map dublin core to OPM, and in particular, the notion of stateful resource. So again, a profile should be able to address this.

3) In OPM 1.1, there is now a property "pname" for persistent name. So in your case, I would annotated the two agents with a same pname.

4) The concept of agent in OPM is defined weakly. Its intent is that it is an "entity" that catalyses/controls an execution. To some extent, I don't see the distinction between agent and entity. However, we need better guidance on how to use agents, and your example is one that we should use as a use case.


Comments (by Simon Miles)

I put my comments on the OPM 1.1 spec below. I'm happy to also put them up on the Wiki discussion page if that would be helpful to you, but most are minor fixes I think.

In general, fine, and the document looks good. I believe the change proposals have all been addressed correctly. I'll give the rest of my comments in the order of the document.

p2, line 4 from bottom: suggest changing "to discuss the version of the specification" to "to discuss that version of the specification"

p6, line 9 from bottom: typo "distinghish"

I find Section 3.2 quite complex, with definitions interspersed with somewhat tangential discussion justifying why we chose those exact definitions. I wonder whether we could instead put the discussions in a later section (e.g. a new section between 3.4 and 3.5, or in section 10, Discussion). We could then refer to these later clarifications by saying what questions they answer, e.g. ("see Section X for why OPM requires only necessary, not sufficient causation,", "see Section X for how OPM deals with used edges denoting artifacts required for a process to start", "see Section X for why OPM does not require process completion before artifacts are generated" etc.) Just a minor suggestion.

p7, discussion under Defn 5: this is the first (and only?) time subtyping of edges is mentioned, I think. I suggest stating earlier (e.g. first paragraph of 3.2) that edges can be subtyped from the basic categories.

p7, last sentence of discussion under Defn 5: typos "compositionaly", "soes", "by a"

Definition 8: "to be generated." should be "to have been generated." as everything in OPM happens in the past.

p9, line 7: "not recommended, for roles to be the same within a context" - can we justify why not? It is not clear to me from the text (or in general).

Figures 4 and 5: it is not clear whether the edge labels denote (sub)types or roles here. I guess subtypes, because we have previously put roles in brackets, but it would be good to state explicitly.

p12, paragraph under Figure 5: "On the contrary, ..." should be "In comparison, ..." (or, less strongly, "On the other hand, ..."). "On the contrary" means that the previous sentence contained an assertion which is untrue (which it doesn't), as in "It may be thought that 0=1+1. On the contrary, 0=1-1."

p12, paragraph under Figure 5: I'm not sure that the analogy to AND/OR graphs provides clarity. As I understand, an OR would mean that the artifact was generated by only one of the processes for which there is a "was generated by". Instead, don't we want to suggest that the artifact was generated by all those processes, but that all the processes are the same thing seen at different granularities?

p15, line 4 from the bottom: typo: ".." at end of sentence

Section 5: the temporal constraints (T1<T3...) seem to assume a single clock, as mentioned in the text. Can we justify this? I'm unclear why this would be reasonable in an OPM graph - are we saying that whenever an OPM graph is produced all clock times should be adjusted to synchronize with one clock?

Section 5/Figure 9: I believe the temporal constraints assume a definition of "causal dependency" (as used in Defn 4) that we have not given. I assume that we would prefer not to give a definition beyond saying "necessary but not necessarily sufficient". If so, I suggest we change < to <= in the temporal constraints to allow for "weaker" notions of causal dependency. For example, in the collections profile, I believe we use "contains" relation between collection and element. I don't think this dependency is a distinction in time: the element does not need to exist AFTER the collection. Weakening to <= allows for dependencies such as "contains" to express causal relations between simultaneously existing artifacts.

Section 6, rules 1 to 6: there seems some inconsistency in that we say what artifacts and processes represent in actuality, but not accounts, agents or edges (though we say what the latter represent in Section 3.2).

Section 6, rule 7: "Roles are mandatory..." appears to contradict "It is recommended to give roles whenever possible" on page 9.

Section 6, rule 14: "Processes without "was controlled by" edge" -> in a whole graph or in one account view?

Section 6, rule 15: "not not" -> "not"

Section 6, rule 18: This is not intuitive to understand. By this definition, two accounts with no inferred dependencies would be refinements of each other, I think.

Section 7.2: It would be helpful to say what the starred relationships represent in actuality, as it otherwise may be hard for the receiver of an OPM graph to correctly interpret them. Just saying what they mean in terms of expansion to their non-starred versions does not seem, to me, to be enough to correctly interpret them (I find WasDerivedFrom?* and WasTriggeredBy?* particularly difficult to distinguish in meaning from WasDerivedFrom? and WasTriggeredBy?).

Section 8: It would be very useful (especially for my Dublin Core profile!) to have a specification on how annotations should be depicted graphically, as with the time metadata.

Section 8.1, rule 1: typo: "distincts"

Section 8.1, rule 2: inconsistent capitalisation (Graph, node, Role, annotation)

Section 8.1, rule 5: I think it is vital to require that the accounts of an annotation should be a subset of those of the annotated entity. Otherwise, it is unclear what the annotation is annotating and allows OPM to be used as general RDF. For example, imagine the effective accounts of an artifact X are {A,B} and it has been given value annotations with the values encoded differently by different asserters in accounts A and B. Later, someone comes along and adds an encoding annotation of "XML" in account C. It would be unclear which of A or B this encoding referred to: X has not been referred to at all in account C (i.e. C is not one of X's accounts) and so has no meaning in the OPM graph. If the annotater wishes to give an alternative account of the artifact, then they can add C to X's accounts before providing an annotation, thus declaring an account of X's provenance independent from A and B.

Section 9, sentence 1: "toplevel" -> "top level" or "top-level"

Section 9, end of paragraph 1: "process graph" -> "process the graph"

Section 9, element 1: "Such profile" -> "Such a profile"

Section 9, element 2: "Such controlled" -> "Such a controlled"

Section 9, final paragraph: "off-the-shelve" -> "off-the-shelf"

Section 11, sentence 1: "open provenance model" -> "Open Provenance Model"

-- SimonMiles - 24 November 2009


Response from Luc:

  > Luc,
  > 
  > Sorry it's taken a while to get to this.  I put my comments on the OPM
  > 1.1 spec below.  I'm happy to also put them up on the Wiki discussion
  > page if that would be helpful to you, but most are minor fixes I
  > think.

Thanks Simon, Some reponses interleaved.

  > 
  > In general, fine, and the document looks good.  I believe the change
  > proposals have all been addressed correctly.  I'll give the rest of my
  > comments in the order of the document.
  > 
  > p2, line 4 from bottom: suggest changing "to discuss the version of
  > the specification" to "to discuss that version of the specification"

Done.

  > 
  > p6, line 9 from bottom: typo "distinghish"

Done.

  > 
  > I find Section 3.2 quite complex, with definitions interspersed with
  > somewhat tangential discussion justifying why we chose those exact
  > definitions. I wonder whether we could instead put the discussions in
  > a later section (e.g. a new section between 3.4 and 3.5, or in section
  > 10, Discussion).  We could then refer to these later clarifications by
  > saying what questions they answer, e.g. ("see Section X for why OPM
  > requires only necessary, not sufficient causation,", "see Section X
  > for how OPM deals with used edges denoting artifacts required for a
  > process to start", "see Section X for why OPM does not require process
  > completion before artifacts are generated" etc.)  Just a minor
  > suggestion.

No action

  > 
  > p7, discussion under Defn 5: this is the first (and only?) time
  > subtyping of edges is mentioned, I think. I suggest stating earlier
  > (e.g. first paragraph of 3.2) that edges can be subtyped from the
  > basic categories.

It was the first time it was mentioned. Suggestion is followed (added a sentence after dfn 4). Note that subtyping is mentioned at several places subsequently.

  > 
  > p7, last sentence of discussion under Defn 5: typos "compositionaly",
  > "soes", "by a"

Done

  > 
  > Definition 8: "to be generated." should be "to have been generated."
  > as everything in OPM happens in the past.
  > 

Yes.

  > p9, line 7: "not recommended, for roles to be the same within a
  > context" - can we justify why not?  It is not clear to me from the
  > text (or in general).

Text changed and hopefully clarified. Roles do not have to be unique. Roles are mandatory. There is a reserved value "undefined"

  > 
  > Figures 4 and 5: it is not clear whether the edge labels denote
  > (sub)types or roles here.  I guess subtypes, because we have
  > previously put roles in brackets, but it would be good to state
  > explicitly.

Added: " In these figures, edges of the type ``was derived from'' are subtyped, and their subtype made explicit as a label to the edge."

  > 
  > p12, paragraph under Figure 5: "On the contrary, ..." should be "In
  > comparison, ..." (or, less strongly, "On the other hand, ...").  "On
  > the contrary" means that the previous sentence contained an assertion
  > which is untrue (which it doesn't), as in "It may be thought that
  > 0=1+1. On the contrary, 0=1-1."

-> on the other hand

  > 
  > p12, paragraph under Figure 5: I'm not sure that the analogy to AND/OR
  > graphs provides clarity. As I understand, an OR would mean that the
  > artifact was generated by *only one* of the processes for which there
  > is a "was generated by". Instead, don't we want to suggest that the
  > artifact was generated by all those processes, but that all the
  > processes are the same thing seen at different granularities?

I quite liked this analogy. But like all analogies, it has limitations. You identified one. I don't know how to fix it, except by removing it.

  > 
  > p15, line 4 from the bottom: typo: ".." at end of sentence

Done

  > 
  > Section 5: the temporal constraints (T1<T3...) seem to assume a single
  > clock, as mentioned in the text. Can we justify this? I'm unclear why
  > this would be reasonable in an OPM graph - are we saying that whenever
  > an OPM graph is produced all clock times should be adjusted to
  > synchronize with one clock?

Good point, it needs clarification.

So, the relation "happened before" is now used. (From Lamport, and it's defined independently of clocks). When clocks are the same or synchronised, then, we can do the actual comparixon.

So, I propose the following changes.

  1. If effect e is caused by cause c, we say that c "happens before" e, with "happens before" being a partial order (as defined by Lamport).
  2. If c happens before e, and clocks for c and e are the same or synchronised: then T_c <= T_e

  > Section 5/Figure 9: I believe the temporal constraints assume a
  > definition of "causal dependency" (as used in Defn 4) that we have not
  > given. I assume that we would prefer not to give a definition beyond
  > saying "necessary but not necessarily sufficient". If so, I suggest we
  > change < to <= in the temporal constraints to allow for "weaker"
  > notions of causal dependency. For example, in the collections profile,
  > I believe we use "contains" relation between collection and element. I
  > don't think this dependency is a distinction in time: the element does
  > not need to exist AFTER the collection. Weakening to <= allows for
  > dependencies such as "contains" to express causal relations between
  > simultaneously existing artifacts.

I am not sure this scenario does exist.

However, above point indicates that we use <=.

I am told it does not break some theoretical result.

  > 
  > Section 6, rules 1 to 6: there seems some inconsistency in that we say
  > what artifacts and processes represent in actuality, but not accounts,
  > agents or edges (though we say what the latter represent in Section
  > 3.2).

Note: the text "(irrespective of their placeholder contents)" has now been removed from process and agent.

I have indicated now what they all represent. What do you think of what I wrote for account?

  > 
  > Section 6, rule 7: "Roles are mandatory..." appears to contradict "It
  > is recommended to give roles whenever possible" on page 9.

Yes, messy. Mandatory it should be, with the possibility of stating "unkwnon" or "unspecified".

  > 
  > Section 6, rule 14: "Processes without "was controlled by" edge" -> in
  > a whole graph or in one account view?

In an account.

  > 
  > Section 6, rule 15: "not not" -> "not"

OK

  > 
  > Section 6, rule 18: This is not intuitive to understand. By this
  > definition, two accounts with no inferred dependencies would be
  > refinements of each other, I think.

TODO ?????

  > 
  > Section 7.2: It would be helpful to say what the starred relationships
  > represent in actuality, as it otherwise may be hard for the receiver
  > of an OPM graph to correctly interpret them. Just saying what they
  > mean in terms of expansion to their non-starred versions does not
  > seem, to me, to be enough to correctly interpret them (I find
  > WasDerivedFrom* and WasTriggeredBy* particularly difficult to
  > distinguish in meaning from WasDerivedFrom and WasTriggeredBy).

Added two definitions, and figure, and comment on figure.

  > 
  > Section 8: It would be very useful (especially for my Dublin Core
  > profile!) to have a specification on how annotations should be
  > depicted graphically, as with the time metadata.

Yes, suggestion?

  > 
  > Section 8.1, rule 1: typo: "distincts"

OK

  > 
  > Section 8.1, rule 2: inconsistent capitalisation (Graph, node, Role, annotation)

OK.

  > 
  > Section 8.1, rule 5: I think it is vital to require that the accounts
  > of an annotation should be a subset of those of the annotated entity.
  > Otherwise, it is unclear what the annotation is annotating and allows
  > OPM to be used as general RDF.  For example, imagine the effective
  > accounts of an artifact X are {A,B} and it has been given value
  > annotations with the values encoded differently by different asserters
  > in accounts A and B. Later, someone comes along and adds an encoding
  > annotation of "XML" in account C. It would be unclear which of A or B
  > this encoding referred to: X has not been referred to at all in
  > account C (i.e. C is not one of X's accounts) and so has no meaning in
  > the OPM graph.  If the annotater wishes to give an alternative account
  > of the artifact, then they can add C to X's accounts before providing
  > an annotation, thus declaring an account of X's provenance independent
  > from A and B.

TODO I suggest the effective account of a node, includes those of the annotation.

  > 
  > Section 9, sentence 1: "toplevel" -> "top level" or "top-level"

OK

  > 
  > Section 9, end of paragraph 1: "process graph" -> "process the graph"

OK

  > 
  > Section 9, element 1: "Such profile" -> "Such a profile"
  > 
  > Section 9, element 2: "Such controlled" -> "Such a controlled"
  > 
  > Section 9, final paragraph: "off-the-shelve" -> "off-the-shelf"
  > 
  > Section 11, sentence 1: "open provenance model" -> "Open Provenance Model"

OK

-- LucMoreau - 09 Dec 2009


Response to Response from Simon:

  > I have indicated now what they all represent. What do you think of
  > what I wrote for account?

Yes, the definition of account seems fine. I'm not sure that for agent is clear enough for unambiguous use, but as you replied to Yolanda, this is a wider problem to be solved through use cases.

  >  > Section 7.2: It would be helpful to say what the starred relationships
  >  > represent in actuality, as it otherwise may be hard for the receiver
  >  > of an OPM graph to correctly interpret them. Just saying what they
  >  > mean in terms of expansion to their non-starred versions does not
  >  > seem, to me, to be enough to correctly interpret them (I find
  >  > WasDerivedFrom* and WasTriggeredBy* particularly difficult to
  >  > distinguish in meaning from WasDerivedFrom and WasTriggeredBy).
  >
  > Added two definitions, and figure, and comment on figure.

I think there's a variable mix-up in the definition of WDF*: should swap a1 and a2 in "It expresses that artif act a2 had an influence on artifact a1." to match the figure.

I'm still a bit uneasy about the definition of WDF*. WasDerivedFrom is defined as indicating "that artifact A1 needs to have been generated for A2 to be generated", while WasDerivedFrom* indicates "that art ifact A1 had an influence on artifact A2". But surely, as WDF* is just multiple WDFs, we cannot assume A1 had any more influence on A2 than with a single WDF, and A1 would have had to be generated for A2 to be generated in both WDF and WDF*? So is there any difference?

Another thought is that a single WDF between artifacts in the graph A1 -> P -> A2, appears to be a WDF* in a refined version of the process (e.g. A1 -> Pa -> A3 -> Pb -> A2; A2 -> A3 -> A1). Isn't that plausible? If so, would we lose anything by just removing WDF* and saying multiple WDFs are equivalent to one WDF?

  > I suggest the effective account of a node, includes those of the annotation.

Unless I'm misunderstanding, I don't think that would solve the problem I suggested. My concern is that an annotation should annotate some existing account of what happened, rather than being information about an entity's provenance without anything being expressed in OPM nodes and edges (as I think having an annotation with its own separate account implies).

-- SimonMiles - 11 December 2009


OK, the definition of WasDerivedFrom needs to be strengthened.

The two notions WasDerivedFrom and are related: simply, the latter is the transitive closure of the former. The difference between the two is seen in the completion rule: the single step wasDerivedFrom can be "completed" by process introduction with used/wasGeneratedBy edges. WasDerivedFrom* would be completed by potentially introducing multiple processes.

YOur example is good but shows that A1 is generated by two different processes, so these processes must belong to different accounts. Same for WasDerivedFrom and WasDerivedFrom*: what can be seen as a single step in one account, can be multi-step in another.

For accounts above, sorry, I didn't do that in the end, I said that the list of accounts on an annotation must be a subset of the accounts of an annotated entity.

-- LucMoreau - 11 Dec 2009
to top


You are here: OPM > WorkInProgressV1pt1 > RevisionV1pt1 > DiscussionsOnDraft1dot1

to top

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback