Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.PASS2

Start of topic | Skip to actions

Second Provenance Challenge: PASS

Participating Team

  • Short team name: PASS
  • Participant names: Uri Braun, Kiran-Kumar Muniswamy-Reddy, David A. Holland, Margo I. Seltzer
  • Project URL: http://www.eecs.harvard.edu/syrah/pass
  • Reference to first challenge results (if participated): PASS

Differences from First Challenge

We did the first challenge using our first prototype, PASSv1; however, PASSv1 has reached the end of its useful lifespan. We are doing the second challenge with our new system PASSv2, which has a smarter data model, new query system, and various other changes.

The workflow representation is still the same - unmodified user tasks, shell scripts, or whatever. The provenance representation has been extended; the biggest semantic change is that we now distinguish identity information from ancestry information. Identity information describes an object; ancestry information specifies an object's connections to other objects. Identity information is furthermore shared by all versions of an object, whereas ancestry information is tied to specific versions.

A more complete description is posted along with the data.

Important note: the kernel-level provenance collector in PASSv2 was not ready in time for the February data deadline, and as things turned out still wasn't quite ready a week before the June deadline either. So our posted data is from a different provenance collection tool: a user-level system we developed for prototyping and debugging. This tool, which we call "pesto", uses the BSD ktrace facility to record system calls during workload execution, and then applies a series of transforms to produce queryable provenance data, using the same backend representation and tools that the kernel system operates with. We believe pesto probably works much like the ES^3 group's tool based on Linux strace, but haven't compared it directly.

The provenance data collected with pesto is supposed to be the same as that collected by the PASSv2 kernel; in practice some details are slightly different. Both differ from the PASSv1 data in terms of better organization and reduced noise. (PASSv1 was written assuming that extra versions of executing processes would not be visible on disk, and thus created them with wild abandon; unfortunately, the assumption proved false. See the discussions of Q1 and Q7 on the PASS first challenge page.)

Another important difference is that PASSv2 has an all-new query engine. We are still defining the query language, which is based on regular expression matching of paths through the ancestry graph; but already it far outperforms the PASSv1 tool in expressivity, clarity of results, and execution speed.

Provenance Data for Workflow Parts

All the data is here: http://www.eecs.harvard.edu/syrah/pass/ipaw-challenge2/

Model Integration Results

We wrote converters for the PASOA and MINDSWAP data sets.

We have successfully been able to import and query the PASOA data, and to splice the PASOA data together with our own data.

We have been able to import the MINDSWAP data and run some queries, but cannot splice it to our own or to the PASOA data. Discussion below.

The specific combinations we have run are:

  • PASS-PASS-PASS
  • PASOA-PASOA-PASOA
  • PASS-PASOA-PASS
  • PASOA-PASS-PASOA
  • MINDSWAP-MINDSWAP-MINDSWAP (partially)

Translation Details

Our "pesto" system is based on the idea of traces: it defines a text-based "provenance trace" format that represents the execution of a workload in terms of provenance data. These traces can then be loaded into our backend database and queried over. We chose to import data by translating to this format rather than directly loading it into the database, primarily to ease processing.

We wrote somewhat ad hoc conversion tools in Java. These tools (will be) posted here: http://www.eecs.harvard.edu/syrah/pass/ipaw-challenge2/

Our system does not run workloads as distinct identifiable units, so we have no built-in notion of workload composition; so we wrote a tool to concatenate our provenance trace format. It must do two things: avoid naming conflicts, which is straightforward, and connect up ("splice") instances of objects that appear in more than one of the input sub-traces, which is not. Since we felt that creating lists of equivalent objects by hand would be cheating, we built the tool to work by looking for objects that had equal values for a particular attribute. (This is effectively a form of relational join.)

Names and splicing

This splicing thus requires an identifier (or name) to join on.

Our system, when running and fully on line, uses serial numbers ("pnode numbers", see PassTerminology) as the ultimately unique identifiers, because no other names are necessarily unique over time. However, the challenge workload neither overwrites nor renames files, so we could and did use file and path names instead.

We wrote our converters to extract the full unique identifier for an object as its "path" name, and the shorter identifier corresponding to something in the workload definition as the "file" name. In the case of the PASOA data and our own data, the file names are the file names used to provide the sample data files in the challenge definition. The MINDSWAP data, however, uses the abstract names as found in the workload description (like "AtlasHeader"), and furthermore jumbles them up with other information in a way that's a hassle to extract cleanly. Some of the names are a little weak, too; the output images are all, if stripped of their UUIDs and other non-readable bits, simply called "Graphic".

This means that while we can splice the PASOA data to our own data, we cannot splice the MINDSWAP data in without manually providing a list of corresponding equivalent names. This feels like cheating, so we haven't done it. We can splice the MINDSWAP data to itself, however, and have done so.

The PASOA data uses URIs to name the objects it works with; we extracted these as the path names. Unfortunately, as it turns out, they do not differ between the two workload executions, so the two executions get spliced together in an improper way. This issue is presumably easily resolved in our converter, but so far we don't know exactly what the correct fix is. However, we don't consider it a serious problem.

The specific splicing mechanism we use is to first concatenate the three phases of each workload splicing by NAME (file name) and then concatenate the two workload runs splicing by PATH (path name). With our own data, this allows objects that are shared between the workloads, such as the system shared libraries and the original input files, to be spliced together without conflating the objects that are per-workload.

Names and querying

Naming is also important for querying. All the queries except Q7 inherently rely on being able to name objects (whether files or processes/executions/services) as a starting point or to restrict the output, and the way we handle Q7 does too.

As already mentioned, the PASOA data uses the same names for files that we do. It does not use the same names for processes. But because process names do not need to be spliced in the challenge workload, this does not pose a serious problem. We were unwilling to customize the queries by hand for all possible combinations of data sets; however, by prepending short per-system-per-phase declarations of names to the queries, we were able to extract this issue away.

The MINDSWAP data uses completely different names; we abstracted out the cases where the name of a specific object was wanted (such as query 1) but gave up on queries 5 and 6 because these require pattern-matching of names. Our query language does support pattern matching and it should be possible to implement these queries on the MINDSWAP data in our system, but in the framework we set up for handling the challenge and the various possible splicing combinations it would have been too messy.

In theory one could identify the objects one wants for queries by inspecting the shape of the ancestry graph rather than by any name attribute (that is, use the provenance as the name) but this is likely to be both expensive and not very robust once the database contains a lot of information. It is also not clear how one would distinguish certain pairs of files (e.g., atlas.img and atlas.hdr) based purely on their relationship to other objects in the system.

In summary, naming is a serious problem. Any attempt to merge, import, or combine provenance data will have to tackle this head-on, as will any attempt to standardize an interchange format.

PASOA conversion

In Pasoa, interactions between services are recorded in Interaction Records. Interaction records consist of a key and P-assertions from both the sending service and the receiving service. There are three kinds of assertions, Interaction P-assertions, Relationship P-assertions, and Actor State P-assertions. We extracted all the information from the key and the Interaction P-assertions.

Services in PASOA are equivalent to processes in our model. From the interaction record key, we extracted the processes being "executed". From the interaction P-assertions, we extracted the arguments and inputs to the processes, the exectime of the process, and the result of the process. The arguments, inputs, and exectime were extracted from the invocationbean and the resulting file from the resultbean.

We could have extracted the inputs to a program from the Relationship Passertion's DataAssertion? field. But we chose to extract all information from interaction P-assertion for simplicity. We did not use PASOA's actor state P-assertions.

PASOA has a lot of redundant data. The same data is recorded in multiple views/P-Assertions. For the translation, we only use the Interaction P-assertions and did not use the Relationship and ActorState? P-Assertion. We also had to filter the data through our analyzer to eliminate duplicate dependencies. However, we do undertstand that the multiple recording of data is fundamental to the PASOA architecture/approach.

MINDSWAP conversion

In MINDSWAP, ServiceExecutions? correspond to process executions in PASS. The input files to the process are extracted from hasInputImage, hasInputHeader, and hasInputParameters. The outputfiles from a process are extracted from the hasOutput* field. The arguments to the process are extracted from hasTextInputParameters and the execution time is extracted from timeRun field.

The MINDSWAP data also explicitly assigns a "stage" field to each service execution. We could have imported this data and used it to write a Q3 that was fundamentally different in nature from Q2. (Our system does not normally have explicit workloads or stages. This is discussed on our first challenge page: PASS.) However, we didn't get around to this and in fact only got it working at all at the last minute.

Other conversion issues

Other data conversion issues we encountered:

  • The PASOA data includes duplicate records that, converted straightforwardly, made our database unhappy. Fortunately, we have tools for dealing with duplicate records, because our collection technique generates them with wild abandon, so we just ran the converted data through our "analyzer" tool. (See PassTerminology.)

  • Our environment, being Unix-based, expects process arguments to be vectors of strings, and our query engine includes logic for matching against these. Both the PASOA and MINDSWAP data, however, provides arguments as single strings. In order to support the queries that require argument matching (Q4 and Q6) we had to split these on whitespace. This is not necessarily robust, as in the wild one might encounter program arguments with embedded spaces.

  • The MINDSWAP data also had apparently incorrect argument information, containing -m -12 instead of -m 12.

  • The PASOA modified workload apparently takes an additional step to convert the output JPEG files to GIF files. This breaks our querying logic, which was based on the assumption that the two runs can be distinguished based on the names of the output files. Our query language is sufficiently expressive to allow drawing this distinction based on ancestry relationships (that is, for Q1, instead of starting from all files matching atlas-x.gif, exclude any that were created from JPEG files) but we did not take this step because we did not want to have to write different queries by hand for every combination of data files. So the Q1 output returns the ancestry of the Atlas X Graphic from both workloads rather than as we intended only the original workload.

  • Both the PASOA and MINDSWAP data sets are lacking the annotations used to drive Q8 and Q9, so these queries return no results when the first phase data comes from these sources. (Also, the annotations specified in the challenge definition are such that Q9 returns no data in any event; we added a variant query Q9a as a workaround.)

  • The PASOA timestamps appear to be in milliseconds, rather than the more usual seconds, since the traditional Unix epoch. This was easily converted... once we noticed.

Query text and output

The query text and output (and more pictures) are posted on our web site.

Pictures

Graphical representation of Query 1, on PASS-PASOA-PASS. The dotted lines are splices between different appearances of the same object.

And on PASOA-PASS-PASOA:

You will note that, as discussed above, with the stage 3 PASOA data, both the first workload and second workloads show up together, and that the entities from the first workload are spliced to the entities from the second workload, even though they should not be.

On MINDSWAP-MINDSWAP-MINDSWAP:

These were done with graphviz. The pictures for PASS-PASS-PASS and PASOA-PASOA-PASOA, and larger versions of the above, are also posted on the results page of our web site. On the PASS-PASS-PASS image it is very easy to see where the splices between the phases are.

Commentary on conversions

In general, the conversions were fairly easy; most of the problems involved understanding the representation of the foreign data rather than trying to apply complex semantic conversions.

The query results on the foreign data are not exactly the same, especially since there are still some glitches in the conversions, but are effectively equivalent.

Our query results on our own data contain considerably less noise than the first challenge results, due to technical improvements in PASSv2's representation and query handling, but they're broadly comparable.

Benchmarks

We haven't really proposed any benchmark queries as such.

However, we think one important class of benchmarks involves scalability. Many of the workloads we test on are compiles of software packages. Those of you who have looked at our results from the FirstProvenanceChallenge, or who have looked at the data we posted for this challenge, may remember that we included the compilation of the AIR suite in our results. The amount of provenance from this compilation dwarfs the challenge workload by nearly two orders of magnitude. And by the standards of software packages, it is fairly small. One of our standard workloads is the compile of GNU awk; the provenance for this is ten times the size of the AIR compile. Other packages (Mozilla, for example) are vastly larger yet.

While such workloads are not the core mission of most of the projects involved in the challenge, it is not clear a priori that their size is inapplicable to other application domains. The large and complex builds of packages like Mozilla or the Linux kernel are made possible by workflow management tools. As people in other domains start to take advantage of the workflow management abilities of the Grid projects, and learn to use them effectively... their workflows will probably become large too. Especially when portions of the workflow definitions start getting automatically generated. (The most evil workload we know of is the package configure script from the am-utils package, which is emitted wholesale from a macro processor.)

Our Usenix paper from last year includes some scalability results for PASSv1. Our work so far on PASSv2 suggests that it is, as intended, at least as scalable as PASSv1 and better in a number of respects. Details to follow, hopefully, in a future paper.

Another related property is aging: as the system accumulates more and more workloads in its database, how much does performance, particularly query performance, degrade? Because we would like to retain information about old runs indefinitely, in most cases, the database will grow over time and searches will tend to become slower and slower.

It has occurred to us recently that while the path expressions in our query language result in linear sequences of result objects, one might want to paste these paths together. (For example, "find all objects descended from something matching predicate P that are also descended from something matching predicate Q.) This essentially amounts to join operations on paths, and will likely be phenomenally expensive. A reference set of really complex search queries might thus be a useful benchmark.

And finally, while it's difficult to benchmark it, it would be nice to have some kind of framework or mechanism for discussing the expressivity of provenance query languages.

Further Comments

While this challenge has offered us an unparalleled opportunity to debug our tools, the splicing operation that connects up the three parts of the workload is, in our view, fairly unnatural.

In the long run, the interoperability needs of the community will probably be better expressed in terms of two other operations:

  • import: take an output file from another project, and its provenance, integrate these into one's own system, run an additional workload step natively on the imported file, and then query over both the imported and locally generated provenance.
  • merge: given the provenance collected by two different systems from the same workload step, synthesize it into a single set of provenance that reflects the union of the information collected by both systems.

While an import, as thus defined, is quite similar to the splicing in this challenge, it interacts with names and naming in a different way, and based on our experience with this challenge, that difference could prove very significant.

Conclusions

We believe that, at a high level, most or all of our data models are interoperable. They are all ultimately recording operations with inputs and outputs. Most of the issues are like the issue with the PASOA timestamps: routine matters of converting data representations, that cause problems only when the representations are not fully agreed upon by all parties.

Naming, however, is fundamental, and a serious problem, and some standardization of naming practices will be necessary for long-term interoperability.

-- PassProject - 25 Jun 2007
to top


Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback