Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.ThirdPCWorkflow

Start of topic | Skip to actions

PC3 Workflow

Workflow

Load Workflow (Control Flow) diagram | png format
[PNG] | [PDF] | [PPTX]

Activities

Activity Input Output
* Pre-Load Section *
IsCSVReadyFileExists ... Checks for existence of CSV Batch root directory and csv_ready.csv manifest file. * string CSVRootPathInput : Path to root directory of CSV Batch * bool IsCSVReadyFileExistsOutput : Returns true if the given CSV Batch root directory and the csv_ready.csv manifest file within it exist in file system. False otherwise.
ReadCSVReadyFile ... Reads the contents of the csv_ready.csv manifest file and creates a CSVFileEntry to hold metadata for each CSV file listed in manifest. * string CSVRootPathInput : Path to root directory of CSV Batch * List<CSVFileEntry> ReadCSVReadyFileOutput : List of CSVFileEntry 's read from the csv_ready.csv manifest file in the root directory. Each CSVFileEntry contains the FilePath of a CSV file to be loaded, the HeaderPath to a header file with list of data columns, the RowCount of number of rows in the file, the TargetTable name in the database, and the MD5 Checksum for the file. The ColumnsNames are not populated by this activity.
IsMatchCSVFileTables ... Checks if all tables to be loaded have corresponding CSV data files listed in the manifest. * List<CSVFileEntry> FileEntriesInput : List of CSVFileEntry 's read from the manifest file. * bool IsMatchCSVFileTablesOutput : Returns true if all tables have matching CSV files. False otherwise.
IsExistsCSVFile ... Checks for existence of CSV data file and Header file listed in the manifest. * CSVFileEntry FileEntryInput : A CSVFileEntry read from the manifest file. * bool IsExistsCSVFileOutput : Returns true if the CSV data file and Header files exist in file system. False otherwise.
ReadCSVFileColumnNames ... Reads the list of column names present in the CSV data file from the Header file. * CSVFileEntry FileEntryInput : A CSVFileEntry read from the manifest file. * CSVFileEntry FileEntryOutput : The input CSVFileEntry updated with the ColumnsNames field populated from the Header file.
IsMatchCSVFileColumnNames ... Checks if all columns expected for a target table are present in the corresponding CSV data file. * CSVFileEntry FileEntryInput : A CSVFileEntry read from the manifest file with columns names populated. * bool IsMatchCSVFileColumnNamesOutput : Returns true if column names present in CSV data files match expected column names for the target table to load the CSV file into. False otherwise.
* Load Section *
CreateEmptyLoadDB ... Creates an empty 'Load' database with static list of tables to load the CSV Batch into. * string JobID : Unique job (i.e. batch) identifier for the CSV Batch being loaded. * DatabaseEntry CreateEmptyLoadDBOutput : A DatabaseEntry with the DBName, unique DBGuid and ConnectionString for the newly created database instance.
LoadCSVFileIntoTable ... Load a single CSV data file into corresponding table in the Load database. * DatabaseEntry DBEntry : A DatabaseEntry with the target table to load the CSV data file into. * bool LoadCSVFileIntoTableOutput : Returns true if the CSV data file was successfully loaded into the target table. False otherwise.
* CSVFileEntry FileEntry : A CSVFileEntry to load into the corresponding target table in the database.
UpdateComputedColumns ... Updates the computed columns in the target table that was loaded. These derived columns would have been "empty" (-999) in the CSV data file. * DatabaseEntry DBEntry : A DatabaseEntry with the target table already loaded from the CSV data file. * bool UpdateComputedColumnsOutput : Returns true if the derived columns were successfully updated from existing columns. False otherwise.
* CSVFileEntry FileEntry : A CSVFileEntry containing the name of target table in the database to update.
* Post-Load Section *
IsMatchTableRowCount ... Checks if number of rows loaded into table matches expected number of rows in CSv data file. * DatabaseEntry DBEntry : A DatabaseEntry with the target table loaded and updated. * bool IsMatchTableRowCountOutput : Returns true if the number of rows in the target table equals the expected number of rows in the CSV data file.
* CSVFileEntry FileEntry : A CSVFileEntry containing the expected number of rows in the CSV data file and the target table name.
IsMatchTableColumnRanges ... Checks if the data loaded into table columns fall within the range of values expected for the column. * DatabaseEntry DBEntry : A DatabaseEntry with the target table loaded and updated. * bool IsMatchTableColumnRangesOutput : Returns true if the data values of columns in the target table fall within expected range. False otherwise.
* CSVFileEntry FileEntry : A CSVFileEntry containing the name of target table in the database to validate columns ranges.
CompactDatabase ... Shrinks the database after all write operations complete. * DatabaseEntry DBEntry : A DatabaseEntry with all tables loaded and validated. void

PC3 Plotting Workflow

The following activities are for extensions to the default Load workflow and not provided in the code. These activities deal with collecting and querying provenance across database queries, external applications for plotting, and manual user input. Teams are encouraged to implement these activities as they choose, export provenance for these and run the additional provenance queries over this. The workflow looks like follows: Plotting Workflow (Control Flow) diagram | png format
[PNG] | [PDF] | [PPTX]

* Plotting Workflow Activities (Not available in code) *
Activity Input Output
DetectionsHistogram ... Queries and builds a histogram over the P2Detections for number of detections grouped by zoneId scaled up by 10. The SQL query for this looks like: SELECT ceiling(zoneId/10) AS zoneGroup, count(*) AS  detectionCount  FROM P2Detection  GROUP BY ceiling(zoneId/10)  ORDER BY ceiling(zoneId/10) * DatabaseEntry DBEntry : A DatabaseEntry with all tables loaded and updated. * string HistogramOutput : Returns a list of tuples that match the query. Format of this string can be arbitrarily chosen. For example, as comma separated values.
Example input: Database entry for the Load DB with Job ID J062942 Example output: zoneGroup, detectionCount; 1105, 2; 1106, 10; 1107, 23; 1108, 25; 1109, 33; 1110, 7; Note: You can view all the query output data for the give 5 sample datasets online or as an Excel spreadsheet.
DetectionsHistogramHiQuality ... Queries and builds a histogram over the P2Detections for number of detections grouped by zoneId scaled up by 10. The detections counted are restricted to those rows with raErr less than 0.1 and decErr < 0.05. The SQL query for this looks like: SELECT ceiling(zoneId/10) AS zoneGroup, count(*) AS  hqDetectionCount  FROM P2Detection  WHERE raErr < 0.1 and decErr < 0.05  GROUP BY ceiling(zoneId/10)  ORDER BY ceiling(zoneId/10) * DatabaseEntry DBEntry : A DatabaseEntry with all tables loaded and updated. * string HistogramOutput : Returns a list of tuples that match the query. Format of this string can be arbitrarily chosen. For example, as comma separated values.
Example input: Database entry for the Load DB with Job ID J062942 Example output: zoneGroup, hqDetectionCount; 1105, 2; 1106, 9; 1107, 18; 1108, 16; 1109, 28; 1110, 5; Note: You can view all the query output data for the give 5 sample datasets online or as an Excel spreadsheet.
PlotHistogram ... Plots a histogram for the input data and generates and image file. The input data is of the same format as the output from DetectionsHistogram and DetectionsHistogramHiQuality above. The output can be any image file that plots the zoneGroup in the X axis and detectionCount in the Y axis. Any plotting tool such as GNU Plot, Matlab, Excel can be used. * string HistogramData : A list of tuples in the same format as HistogramOutput from DetectionsHistogram and DetectionsHistogramHiQuality activities. * string HistogramPlotOutput : Returns a file path (or any suitable file reference) to an image file that has the histogram plot for the input. For simplicity, any equivalent plot (bar, line, etc.) that shows the first column of input data on X Axis and second column of input data on Y axis can be generated.
Example input: zoneGroup, hqDetectionCount; 1105, 2; 1106, 9; 1107, 18; 1108, 16; 1109, 28; 1110, 5; Example output: Path to image file with the plot. Note: You can view all the plots for the give 5 sample datasets online or as an Excel spreadsheet.
DisplayPlot ... Displays the plot from PlotHistogram activity and lets the user interact with the plot such as select a point for further investigation. This activity requires user interaction that may be implemented as the teams see fit. * string HistogramPlot : File path (or any suitable file reference) to an image file that has the histogram plot. * Arbitrary as the teams see fit.

-- YogeshSimmhan - 08 Apr 2009

Software Setup

C# .NET + SQL Server Flavor (Windows Platform)

Pre-requisites

  • .Net Framework v3.5 SP1
  • SQL Server 2008 (Express or better)
  • Windows XP/Vista/2003/2008
  • (Optional) Visual Studio 2008

Java + Derby Flavor (All Platforms)

Pre-requisites

  • Java Development Kit 1.6
  • Apache Derby 10.4.2.0 (included)
  • Apache Ant 1.7.1

Downloads

  • PC3.tar.gz: Pan-STARRS Load Workflow Code & Sample Data | tar.gz Format | MD5 38bf4848d591cbac4048b3034c313443
  • PC3.zip: Pan-STARRS Load Workflow Code & Sample Data | ZIP format | MD5 38bf4848d591cbac4048b3034c313443

-- PaulGroth - 31 Mar 2009
to top

I Attachment sort Action Size Date Who Comment
PC3.tar.gz manage 5617.8 K 04 Feb 2009 - 03:38 YogeshSimmhan Pan-STARRS Load Workflow Code & Data | tar.gz format | MD5 38bf4848d591cbac4048b3034c313443
PC3.zip manage 5673.8 K 04 Feb 2009 - 03:41 YogeshSimmhan Pan-STARRS Load Workflow Code & Data | ZIP format | MD5 c4eecc8639070a61e9d73993ffb76f97
LoadWorkflow.pdf manage 402.6 K 17 Feb 2009 - 17:23 YogeshSimmhan Load Workflow (Control Flow) diagram | pdf format
LoadWorkflow.png manage 101.0 K 17 Feb 2009 - 17:31 YogeshSimmhan Load Workflow (Control Flow) diagram | png format
LoadWorkflow.pptx manage 55.7 K 17 Feb 2009 - 17:24 YogeshSimmhan Load Workflow (Control Flow) diagram | pptx format
Histograms.htm.zip manage 343.3 K 08 Apr 2009 - 21:29 YogeshSimmhan All histogram image files and HTML version of spreadsheet for advanced activities
Histograms.xls manage 65.5 K 08 Apr 2009 - 21:30 YogeshSimmhan All histogram data and image files as a spreadsheet for advanced activities
AdvWorkflow.pptx manage 61.6 K 09 Apr 2009 - 06:10 YogeshSimmhan Advanced workflow with database query and user interaction
AdvWorkflow.pdf manage 102.9 K 09 Apr 2009 - 06:11 YogeshSimmhan Advanced workflow with database query and user interaction
AdvWorkflow.png manage 61.8 K 09 Apr 2009 - 06:11 YogeshSimmhan Advanced workflow with database query and user interaction

You are here: Challenge > ThirdProvenanceChallenge > ThirdPCWorkflow

to top

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback