Provenance Challenge: KCL
Participating Team
Team and Project Details
- Short team name: KCL
- Participant names: Simon Miles
- Project URL:
- Project Overview:
- Relevant Publications:
Quick Overview
We here provide a quick overview of the system used to generate the provenance, to make the contents of the
OPM model clearer for other participants.
The system, SourceSource, aims to record provenance from the execution of Java programs, and their use of non-Java services, in a way which requires no changes to the source code, minimal set-up, but where what will be recorded is apparent.
As such, the source code is transformed into a self-documenting form in a pre-compilation step. Plug-ins can be inserted to handle services to be treated as black boxes, parsing the inputs and outputs to add to the provenance graph, e.g. a database with SQL statements as inputs. SourceSource is built upon a commonly used program transformation tool, TXL.
The SourceSource executables will be uploaded and the internal provenance model described soon.
Workflow Representation
The workflow is the
Java program exactly as supplied by Yogesh. It is pre-compiled, then run using the JVM as normal.
Open Provenance Model Output
The
OPM output is available
here. It is formatted following the example XML in schema v1.01.a provided by Paul Groth and Luc Moreau at
the OpenProvenance website.
A note about the serialisation of artifact values: In the current model, each artifact is either a variable with a given value at a given time, or a database entry with a given value at a given time. One particular class of variables is critical to answering the queries, but has no default serialisation: the
CSVFileEntry?. This object contains a CSV file path and database table name (see
LoadWorkflow? and
LoadAppLogic? classes in the workflow for its use). In the
OPM output, we have serialised the variable value in the form:
file-path#table-name
, e.g.
D:\Personal\challenge\pc3\PC3\SampleData/J062941\P2_J062941_B001_P2fits0_20081115_P2ImageMeta.csv#P2IMAGEMETA
Query Results
For now we provide the pseudo-code and output for the queries, to guide others in how the
OPM is to be interpreted. The query code (in Java) will be uploaded shortly.
A few points about the internal model of SourceSource need to be made clear to understand the query implementations:
Naming Artifacts: Each source code variable is given a name scoped by its class and method, e.g.
LoadAppLogic_IsMatchTableRowCount_FileEntry
(in class LoadAppLogic, in method IsMatchTableRowCount, the local variable named FileEntry). Each database entry is given a name comprised of the table name and the first two fields of the entry, e.g.
P2DETECTION_113191992826421637,261887437010025729
. These can be used to query for the provenance of the last values of the variable/entry (
OPM artifacts).
Naming Processes: Each Java statement is given a name scoped by its class, e.g.
LoadWorkflow_main_Declaration12
(in class LoadWorkflow, in method main, the 12th declaration). These can be used to query for the provenance of the iterations of executing the statements. A tool is provided to see what names statements are given to aid those building queries (to be uploaded shortly).
Occurrences: As most querying concerns apply to artifacts and processes in the same way, we generalise and call them both kinds of
occurrence.
Features: The
OPM value of each occurrence is a set of
features, comprising of a type and an Java value. A subset of these are the
defining features, which distinguish this occurrence from others, i.e. its identity.
Provenance and Future: The
provenance of an occurrence is the sub-tree taken from the provenance graph recursively leading backwards from effects to causes starting from that occurrence. The
future of an occurrence is the sub-tree taken from the provenance graph recursively leading forwards from causes to effects starting from that occurrence.
Query 1
Pseudo-code:
- Get the last occurrence of database entry named
P2DETECTION_112051986299712706,261887437040025450
- Find, within the provenance of this occurrence, occurrences of variables with a value of type CSVFileEntry?
- Get the file paths from the CSVFileEntry? objects
Output:
[D:\Personal\challenge\pc3\PC3\SampleData/J062941\P2_J062941_B001_P2fits0_20081115_P2Detection.csv]
Query 2
Pseudo-code:
- Get the last occurrence of database entry named
P2DETECTION_112051986299712706,261887437040025450
- Find, within the provenance of this occurrence, occurrences of variables with a value of type CSVFileEntry?
- For each CSVFileEntry? occurrence:
- Find, within the future of the CSVFileEntry? occurrence, causal relations of the form:
- The effect is process point named
LoadWorkflow_main_Declaration12
(the IsMatchTableColumnRanges check)
- The relationship (OPM: role of the cause artifact) is
Used In Expression
- The cause is the same variable as the CSVFileEntry?
- Where such a relation exists, then the table referred to by the CSVFileEntry? has been checked
Output:
P2DETECTION was checked
Query 3
Pseudo-code:
- Get the last occurrence of database entry named
P2IMAGEMETA_6294101,62941
- For each occurrence in the provenance of that entry:
- If the occurrence is a variable having a given value at a specific statement in the program, get the name of that statement; or if the occurrence is a statement being executed, get the name of that statement
- If the occurrence is part of a method's execution (OPM: fine-grained account), get the occurrence representing the method call (OPM: overlapping coarse-grained account) and get the name of that statement.
- The collection of statement names gathered are those which affected the database entry.
Output:
[LoadAppLogic_LoadCSVFileIntoTable_Declaration1, LoadAppLogic?_LoadCSVFileIntoTable_Declaration2, LoadAppLogic?_LoadCSVFileIntoTable_Statement10, LoadAppLogic?_LoadCSVFileIntoTable_Statement3, LoadAppLogic?_LoadCSVFileIntoTable_Statement4, LoadAppLogic?_LoadCSVFileIntoTable_Statement5, LoadAppLogic?_LoadCSVFileIntoTable_Statement6, LoadAppLogic?_LoadCSVFileIntoTable_Statement7, LoadAppLogic?_LoadCSVFileIntoTable_Statement8, LoadAppLogic?_LoadCSVFileIntoTable_Statement9, LoadCSVFileIntoTable?, LoadWorkflow?_main_Declaration1, LoadWorkflow?_main_Declaration5, LoadWorkflow?_main_Declaration7, LoadWorkflow?_main_Declaration9, LoadWorkflow?_main_Statement5, main]
The statements in the main workflow (
LoadWorkflow?_main...) named above correspond to the following statements in the source code:
-
LoadWorkflow_main_Declaration1
: String JobID = args [0], CSVRootPath? = args [1];
-
LoadWorkflow_main_Declaration5
: LoadAppLogic.DatabaseEntry CreateEmptyLoadDBOutput = LoadAppLogic.CreateEmptyLoadDB (JobID);
-
LoadWorkflow_main_Statement5
: for (LoadAppLogic.CSVFileEntry FileEntry : ReadCSVReadyFileOutput)
-
LoadWorkflow_main_Declaration7
: LoadAppLogic.CSVFileEntry ReadCSVFileColumnNamesOutput = LoadAppLogic.ReadCSVFileColumnNames (FileEntry);
-
LoadWorkflow_main_Declaration9
: boolean LoadCSVFileIntoTableOutput = LoadAppLogic.LoadCSVFileIntoTable (CreateEmptyLoadDBOutput, ReadCSVFileColumnNamesOutput);
The implication is that every other statement (e.g. including the validation checks) can be removed without affecting the result.
Suggested Workflow Variants
None as yet.
Suggested Queries
See query page.
Suggestions for Modification of the Open Provenance Model
To be completed soon.
Conclusions
--
SimonMiles - 03 Apr 2009
to top