Understanding the Scientist’s Intent associated with a Workflow Experiment

Scenario Authors:

Edoardo Pignotti, Peter Edwards (University of Aberdeen) Gary Polhill, Nick Gotts (Macaulay Research Institute)

Brief Summary:

Recent activities in the field of social simulation (Polhill et al., 2007) indicate that there is a need to improve the scientific rigour of agent-based modelling. Results gathered from possibly hundreds of thousands of simulation runs cannot be reproduced conveniently in a journal publication. Equally, the source code of the simulation model and full details of the model parameters used are also not journal publication material. We have identified activities that are relevant to such situations. These are:

Being able to access the results, to check that the authors’ claims based on those results are justifiable;
Being able to re-run the experiments to check that they produce broadly the same results;
Being able to manipulate the simulation model parameters and re-run the experiments to check that there is no undue sensitivity of the results to certain parameter settings;
Being able to understand the conditions in which the experiment was carried out.

Workflow technologies have been used in this context to facilitate the design, execution, analysis and interpretation of simulation experiments and exploratory studies. However, workflow technologies alone can only support activities 1, 2 and 3; in order to support activity 4, a richer provenance record is required capturing the conditions in which an experiment was carried out. We focus our attention to this specific aspect of provenance.

Scenario Diagram:

Diagram

The diagram above presents an example using a virus model developed in Net Logo an agent-based model that simulates the transmission and perpetuation of a virus in a human population. An experiment using this model might involve studying the differences between different types of virus in a specific environment. A researcher wishing to test the hypothesis “Smallpox is more infectious than bird flu in environment A” might run a set of simulations using different random seeds. If in this set of simulations, the Smallpox virus demonstrated greater transmissibility than Bird Flu in a significant number of simulation runs, the experimental results could be used to support the hypothesis Representing this experiment with workflow technologies alone has some limitations as it will unable to capture the scientist’s goals and constraints (scientist’s intent) associated with the experiment. To illustrate, consider the following scenario: the goal of the experiment is to obtain significant simulation results that support the hypothesis. Imagine that the researcher knows that the simulation model could generate out-of-bounds results (e.g. falls outside of the acceptable range) and these results cannot be used in the significance test. For this reason, we don’t know a priori how many simulation runs per comparison we need to do. Too few runs will mean that the experiment will return inconclusive data, while too many runs will waste computing resources executing unnecessary simulations. There may also be constraints associated with the workflow (or specific activities within the workflow) depending upon the intent of the scientist. For example, a researcher may be concerned about floating point support on different operating systems; if the significance test activity runs on a platform not compatible with the IEEE 7549 specifications, the results of the simulation could be compromised. A researcher might also be interested in detecting and recording special conditions (e.g. a particularly transmissible virus) during the execution of the workflow to support the analysis of the results.

Users:

In this particular scenario simulation scientists performing and sharing simulation experiments using workflow technologies. In general scientists using technologies to perform and share workflow experiments.

Requirement for provenance:

In this scenario provenance is required in order to make simulation experiments more transparent by providing documentation about scientific analyses and processes. Provenance documentation is particularly important here in order to understand and reproduce experimental processes described in publications. Provenance should enable users to understand, verify, reproduce, and ascertain the quality of data products generated by processes and as a consequence, we argue that intent information should be captured as part as the provenance documentation.

Diagram

To better understand this concept let us consider the "toy" provenance example presented in the diagram above using the OPM representation. Assuming the agent ”John” is driven by a goal ("bake a cake of an acceptable quality") and has a constraint (”if the dough is too runny, you need to add more flour”), the resulting provenance graph may look like the one presented in the figure above. An additional 20g of flour was used by ”John” during the baking process as a result of violating the constraint (”if the dough is too runny, you need to add more flour”). From this representation of provenance it is not possible to understand why an additional 20g of flour was used during the baking process.

Provenance Questions:

The following are the provenance questions related to scientist’s intent that have been identified:

What was the intent of the scientist while running this experiment?
Have the goals of this experiment been achieved?
Did the experiment violate any of the constraints defined by the scientist?
What decisions were made while executing this experiment?

Technologies Used:

Scientific workflow is required in this scenario in order to design and execute a simulation experiment using a pool of available local and grid service.

Various Grid services are required in order to perform specific simulation tasks, e.g. execute a model run and significance test. In addition, Grid services are required to provide metadata about the service characteristics and during its execution.

A repository is required to store the metadata generated by the workflow (and its associated grid services) during the execution of an experiment; this should also include provenance metadata.

This scenario requires ontologies to describe different aspects of a workflow experiment: data/process provenance, service metadata, service run-time metadata, workflow activities.

References:

Polhill, J. G. and Gotts, N. M. (2007). Evaluating a prototype self-description feature in an agent-based model of land use change. In Amblard, F., editor, Proceedings of the Fourth Conference of the European Social Simulation Association, September 10-14, 2007, Toulouse, France, pages 711–718.

-- EdoardoPignotti - 15 May 2010
to top

End of topic
Skip to action links | Back to top

You are here: Main > FourthProvenanceChallenge > FourthProvenanceChallengeCFSP > UndestandingScientistsIntent

to top