Notes on PC4 Scoping Workshop

Attendance

We had roughly 25 attendees.

Discussion about challenge options

We began by discussing the Challenge Proposals and how broadly to organize the challenge. Two options, were discussed.

1) Chop up a single process and interoperate provenance information across systems.

2) Everyone generate their own provenance information from their chosen domain following a general problem pattern. Other teams must then read another teams provenance and answer a set of general provenance questions.

Comments during this discussion about these options

We need apis that go beyond just a common model (Luc Moreau) seconded by James Frew.
James Frew thought that provenance queries should be answered on the fly.
There were some stock pieces (i.e. general processes) that came up including paper publication and human decisions
Jim Myers noted that we need to get rid of the back channel between teams only use the data.
Paulo Pinheiro da Silva discussed whether OPM was complete enough to allow for interoperability.

Types/Patterns of Processes to Consider

There was a general agreement that all the scenarios proposed had several types of process (or "Process Patterns") that were viewed as important for PC4 to address. The group listed these patterns as follows:

User decision points
Why the user performed a decision (this includes the assumptions of the user)
Publish data at a URL (to some long term storage)
Discovery data by queries
Computational workflows
Citing data in a paper (not talking about peer review)
manipulating collections of data and the collection itself (e.g. sending a zip file and its contents)
collaborative editing spaces (e.g. wiki)
exchange of data between services, for example a web service request (note services can be workflow systems)
social collaboration (e.g. twitter, email, instant messaging)
Publish data to a third party (e.g. publishing to the cloud)
People running services provided by others
People running services over data provided by others
Services that use different credentials

For each pattern, teams identified which pattern they would be interested in supporting/implementing. The following is a tentative list of the teams that say they would be interested.

Interested Teams

User decision points
- Kings, Indiana, Rio, Abdn
Why the user performed a decision (this includes the assumptions of the user)
- RPI, Kings, Indiana, Rio, Abdn
Publish data at a URL (to some long term storage)
- RPI, Indiana, Rio, SDSC, UTEP, MSR, UCSB, Soton
Discovery data by queries
- RPI, Indiana, UTEP, MSR, UCSB
Computational workflow
- Kings, Indiana, Rio, Swift, SDSC, Abdn, MSR, UCSB
Citing data in a paper (not talking about peer review)
- UCSB, Soton, VU
manipulating collections of data and the collection itself (e.g. sending a zip file and its contents)
- RPI, Rio, UCSB
collaborative editing space where takes place (e.g. wiki)
- UTEP, RPI, VU
exchange of data between services (where services can be workflow systems)
- UTEP, MSR, SDSC, Soton
social collaboration (e.g. twitter, email, instant messaging)
- Abdn, UTEP, MSR, RPI, Soton, VU
Publish data to a third party
- Rio, MSR, UCSB, VU
People running services provided by others
- UTEP
People running services over data provided by others
- UTEP
Services that use different credentials
- RPI

For each pattern, we came up with example provenance queries that illustrated the need for this sort of process.

Provenance Queries

What are the tweets related to this publication?
How many times did the user decide to reperform part of the process and on the basis of what assumptions?
Provide the full provenance of a publication?
Provide the original data sources used in the production of a publication?
What was the impact of a data item on other data items and specifically, which were derived from it?
Who accessed this data?
Give me two provenance accounts that used the same artifacts?
Who found this data through a query?
What data items were added to a collection by two different workflows?
Who edited this page the most? Who alternated in editing a page?
Who edited my content on a page?
What activity triggered an activity in another system?
Where did the input data come from for an execution of a service (e.g. which user?)?
Who controlled a service and when?
Who deleted a data item and how many copies exist?
Who changed a data item?
Where did a third party send my data?
Did anybody make money off my data?
Why didn't my workflow reproduce?
Did anyone plagiarise my data?
Am I using the most up-to-date services and data?
What version of software was used to produce a data item?
What credentials were used to produce and/or access a data item?

Scenario Selection

There was a strong debate about whether we should adopt one scenario (e.g. the crystallography scenario) as with past challenges or allow for multiple different scenarios. The two key issues were:

if one scenario was used, some teams would not have the bandwidth to participate while
if we allowed for multiple scenarios, interoperation across many teams would most likely suffer. Teams would probably chose to only work with scenarios where interoperation was easy or teams already worked closely together.

The resolution of this problem was the following compromise:

One abstract scenario consisting of a pipeline of the above identified patterns. The abstract scenario identifies the connection points between patterns. Each connection point identifies the type of data that would result after the execution of that pattern (e.g. a pdf file, a text file, etc).
One "suggested" scenario (namely, the crystallography workflow) that follows the abstract scenario where executables are provided to all as well as the intermediate data sets. Again, like PC3 dummy services are allowed.
However, teams do not have to have to run the suggested scenario. Instead, they can implement part of the abstract scenario in their own domain. They just need to ensure that their process fits to the abstract scenario and that they provide data of the type specified by the connection point. For example, the crystallography workflow may output a crystallography paper pdf, a corresponding bioinformatics workflow would a bioinformatics paper pdf.

Organization of the Challenge

Organization of Teams

Teams specify the input and output for the portions of the process they implement
They upload those files to a wiki.
Have an automated way of getting from a data file to its provenance
We need to define the common vocabulary for the challenge
try to this asynchronously on a per pattern basis
it would be nice to have expected answers to provenance queries

Stages

For teams to generate OPM for the part process they are responsible for in either XML or OWL. These are uploaded to the wiki: Convention to grab a bunch of OPM Graphs (tar ball or wget?)
Teams load all OPM into their provenance systems and perform queries (for one run‚and then see what happens)
Decide if/how we go about implementing distributed provenance query across systems

Timeline

Abstract Scenario
Identify all the data flowing in the system with respect to the crystallography scenario (this can be mocked up) where possible we have example data: (August 30)
For each pattern of the process produce a mock-up of the opm graph with respect to the data in step 2 and make sure they stitch together (Nov 30)
Finalize queries with respect to scenario (Dec 15)
Import and implement queries over the mockup (Feb 28)
Generate and publish Provenance for each pattern (Feb 28)
Import and Implement Queries over the generated provenance (Mar 30)
Decide whether to do api compatibility
Prepare slides for challenge [Jun 1 - Jun 8]
PC4 Workshop June 10

It was suggested that we try to collocate at SIGMOD on June 12

-- PaulGroth - 07 Jul 2010
to top

End of topic
Skip to action links | Back to top

You are here: Challenge > NotesPC4ScopingWorkshop

to top

Provenance Challenge