VisTrails2 < Challenge

Challenge.VisTrails2

Second Provenance Challenge: VisTrails

Participating Team

Short team name: VisTrails
Participant names: Erik Anderson, Steven Callahan, Tommy Ellkvist, Juliana Freire, David Koop, Emanuele Santos, Carlos Scheidegger, Claudio Silva, Nathan Smith and Huy Vo
Project URL: http://www.vistrails.org/
First challenge results: VisTrails
Presentation

Differences from First Challenge

We have changed the structure of our provenance representation to generalize and better structure our data, but the data stored is roughly equivalent to our previous representation. The schemas and data are provided below. Recall that we store workflow evolution in a vistrail which is a tree of actions where each node represents a (possibly partial) workflow. To allow easier integration with other systems, we have also materialized the individual workflow specifications for the three parts.

We split our original workflow into three individual workflows to better reflect the independence of the parts. In addition, because the AIR tools depend on a (.hdr, .img) pair of files, the workflows are slightly restructed so that module inputs and outputs are also paired using a FileSet module.

Provenance Data for Workflow Parts

The provenance data is split into three layers (workflow evolution, workflows, and execution). The schemas for these layers are available:

vistrail.xsd - workflow evolution actions
workflow.xsd - workflow specification
log.xsd - workflow execution information

The data corresponding to these layers:

pc_vt.xml stores the workflow evolution (you can materialize workflows from this data)
pc_part1.xml is the materialized workflow for part 1
pc_part2.xml is the materialized workflow for part 2
pc_part3a.xml is the materialized workflow for part 3 (first version)
pc_part3b.xml is the materialized workflow for part 3 (second version)
pc_log.xml is the execution information

Note that teams may decide to use the vistrail data or the four materialized workflows for the challenge; the four workflows constitute a subset of the workflows contained in the vistrail. Please refer to the previous challenge for documentation on the system design.

Model Integration Results

We have successfully performed most queries using data from VisTrails, MyGrid, and Southampton. We have included our own system because our new query API is general and not native to VisTrails.

Model comparison

The VisTrails and MyGrid models were easy to use because of their simple data format, The generalized model of Southampton presented a greater challenge because of the many levels of nesting and abstractions. VisTrails required both the execution log and the workflow definition for the provenance queries whereas MyGrid and Southampton only needed the execution log. Finally, VisTrails supports a third level of provenance--the workflow evolution layer, and while we have not used it for this API, it has many benefits when asking queries about differences between workflows.

VisTrails

MyGrid

Southampton

The answers obtained varied depending which information you had access to. For example, using the VisTrails format, it was not possible to obtain intermediate data items because they are not recorded. In this case the closest answer was the module executions. The queries required the data to contain at least module executions, connections between them and required annotations. These were all present in the models except a few missing annotations in Southampton and MyGrid.

VisTrails use a normalized data model and needs to use both execution log and workflow definition. MyGrid's execution log can be used without using the workflow definition and contain derivation relationships between data items, this makes the data contain redundant information. Southampton is modeling some security features that may be useful but makes the data larger and more complex.

Concepts

The concept of data item varies between systems. It can be represented as the data exchanged between modules, the inputs or outputs of a workflow or a file reference passed between modules. The concept of parameters, which are used in VisTrails to modify modules, does not exist in other models. MyGrid uses something similar to edit the parameters of modules (like setting file name to save to). This concept is not clearly defined. Southampton have the concept of assertion where every module/service records its own view of the process. This concept does not exist in the other systems and is not used in our provenance queries. But it might be important for validating results.

Other concepts like modules/connections/executions are the same although most of them have different names.

Method

Our method consists of using wrappers to translates the queries between a common data model and the source data. We first defined a high-level general model that captures the basic concepts of workflows and its executions. The model contains basic concepts making it possible to express queries over the different models. Second, we defined API functions for the wrappers that use this model. Finally, we implemented the wrappers and constructed the queries.

This challenge sought to address how provenance from different systems can be connected. However, there was no requirement for data products to be consistently idenitifed. Thus, in order to connect provenance across different systems, we had to manually identify the mapping between output data from one workflow and input data for the next. This naming is an important consideration when coordinating workflows across different systems. One solution is to use more general identifiers like LSID's or some other standard identifier.

Translation Details

Scientific Workflow Provenance Data Model (SWPDM)

The SWPDM (shown above) is a general provenance model that aims to capture entities and relationships that are relevant to both the definition and execution of workflows. The goal is to define a general model that is able to represent provenance information obtained by different workflow systems.

The API

Our model is instantiated as a query API that operates on the concepts in the model. Vertices are modeled as objects and edges as operations on these objects. There also exists more complex operations that can traverse more than one edge which are used to model common provenance query operations.

Implementation

This API is implemented as wrappers on top of the different data models. These wrapper functions translates the queries into a native query on the source. Currently VisTrails and Southampton uses XML with XPath as the access method. In this case the queries are translated into XPath expressions. MyGrid uses RDF/XML on a SPARQL server and the queries are translated into SPARQL expressions.

Using a combination of data sources (MyGrid->Southampton->Vistrails) we can now query the data using the API:

  r2 = pqf.getAllAnnotated(pModuleInstance,[('outputName', 'eq', 'atlas-x.gif')])
  prov = r2[0].getExecutionFromInstance()[0].upstream()

We then get the result:

  vt3:4 --> vt3:7
  vt3:1 --> vt3:4
  vt3:0 --> vt3:1
  pas2:http://relation.org/softmean --> vt3:0
  myg1:urn:www.mygrid.org.uk/process#reslice1 --> pas2:http://relation.org/softmean
  myg1:urn:www.mygrid.org.uk/process#reslice2 --> pas2:http://relation.org/softmean
  myg1:urn:www.mygrid.org.uk/process#reslice3 --> pas2:http://relation.org/softmean
  myg1:urn:www.mygrid.org.uk/process#reslice4 --> pas2:http://relation.org/softmean
  myg1:urn:www.mygrid.org.uk/process#align_warp1 --> myg1:urn:www.mygrid.org.uk/process#reslice1
  myg1:urn:www.mygrid.org.uk/process#align_warp2 --> myg1:urn:www.mygrid.org.uk/process#reslice2
  myg1:urn:www.mygrid.org.uk/process#align_warp3 --> myg1:urn:www.mygrid.org.uk/process#reslice3
  myg1:urn:www.mygrid.org.uk/process#align_warp4 --> myg1:urn:www.mygrid.org.uk/process#reslice4

Which is the execution provenance trace of the file atlas-x.gif.

Benchmarks

The benchmark is done using Query 1 (Upstream of AtlasXGraphic). It is a good general upstream query that returns the module executions in the upstream. The data files are too small for a good benchmark but we have timed the queries using the different systems.

MyGrid

opn = 'urn:www.mygrid.org.uk/process#convert1_out_AtlasXGraphic'
rl = pqf.getNode(pOutputPort, opn, store3.ns).getDataFromOutPort()[0].getExecutionFromOutData()[0].upstream()

1 sec

VisTrails

ar = [('outputName', 'eq', 'atlas-x.gif')]
r1 = pqf.getAllAnnotated(pModule,ar)[0].upstream()

0.1 sec

Southampton

odn = 'challenge/atlas-x.gif'
rl = pqf.getNode(pDataItem, odn, store3.ns).getExecutionFromOutData().upstream()

1 sec

Benchmark results

Although these times are very short, there seem to be two main factors influencing the result: The query engine used and the size of the data. VisTrails is fastest using an XPath processor and a small amount of data. The MyGrid data file is small but it uses a SPARQL server which is slower than using XPath. Southampton uses XPath but has large data files. These results includes initialization of the wrapper and some extra pre-processing for Southampton to calculate the data links. But they have at most biased the result by a factor of 2.

Further Comments

Provide here further comments.

Conclusions

In the general case, tracking provenance through different systems is a data integration problem. But by defining a common model (SWPDM) on a restricted domain (Scientific Workflow) the difficulty is reduced to efficiency and entity resolution problems. We believe that it should be possible for the Scientific workflow community to support a model similar to the SWPDM to enable provenance to be tracked through their systems. We have showed that an API for querying this model can be built and its compatibility with three of the current systems.

Problems for discussion:

How to connect these systems? There is a need for the data to support referencing other models. E.g. If a data item is stored externally and tracked through another provenance store. Common identifiers like LSID:s might be part of the solution. External data items should also be given a namespace to indicate where they came from.

Is there a way to come up with common concepts for data items, they are used in many layers and have different meanings.

How can a user easily express these kind of queries?

Query complexity - Relational Algebra cannot express these kind of provenance queries because of the use of transitive closure.

-- TommyEllkvist^? - 21 Jun 2007

api.zip:

to top

End of topic
Skip to action links | Back to top

Attachment	Action	Size	Date	Who	Comment
pc_vt.xml	manage	76.6 K	23 Feb 2007 - 00:20	JulianaFreire
pc_part1.xml	manage	12.8 K	23 Feb 2007 - 01:05	JulianaFreire
pc_part2.xml	manage	4.0 K	23 Feb 2007 - 01:05	JulianaFreire
pc_part3a.xml	manage	5.1 K	23 Feb 2007 - 01:05	JulianaFreire
pc_part3b.xml	manage	5.7 K	23 Feb 2007 - 01:06	JulianaFreire
pc_log.xml	manage	11.3 K	23 Feb 2007 - 00:22	JulianaFreire
vistrail.xsd	manage	6.4 K	23 Feb 2007 - 00:23	JulianaFreire
workflow.xsd	manage	3.5 K	23 Feb 2007 - 00:24	JulianaFreire
log.xsd	manage	2.7 K	23 Feb 2007 - 00:24	JulianaFreire
model.png	manage	28.3 K	21 Jun 2007 - 08:50	JulianaFreire
model_mygrid.png	manage	22.7 K	21 Jun 2007 - 08:56	JulianaFreire
model_southampton.png	manage	13.2 K	21 Jun 2007 - 08:56	JulianaFreire
model_vistrails.png	manage	22.5 K	21 Jun 2007 - 08:57	JulianaFreire
vt_prov_challenge_present.ppt	manage	711.0 K	02 Jul 2007 - 16:03	JulianaFreire	VisTrails Second Provenance Challenge Presentation
api.zip	manage	21.0 K	14 Aug 2008 - 16:39	JulianaFreire	API source files

You are here: Challenge > SecondProvenanceChallenge > ParticipatingTeams2 > VisTrails2

to top