Background
Mantid extensively uses workflow algorithms to combine individual algorithms together into a reduction. From a user perspective this hides complexity, allowing someone to run their technique and only expand out the history when required.
For developers this is an alternative to writing out scripts, whilst providing other features such as progress bars, history and a single named algorithm as an entrypoint.
The Problem
Our system tests work well as go/no-go indicators, highlighting when the data has changed but little beyond this.
In cases where data should validate identically developers are faced with three choices: revert their changes, examine the diff and manually trace through the workflow algorithm, or manually step through the reduction (either through history or breakpoints to probe the output).
In cases where data is expected to change, developers are either forced to validate changes have the correct scope by hand, treat the reduction as a black-box, or avoid any improvements due to the time-cost involved. This can be another source of bugs, e.g. a change that affects two algorithm steps instead of simply one from an implicitly shared data structure.
Ultimately, this leads to developers quickly getting into drawn out sessions trying to replay history line-by-line, using print statements or stepping through the debugger. All whilst the Mantid Framework provides little in helping tooling. It also makes users nervous accepting changes to their data, as they can’t know with 100% certainty any change(s) have the correct scope and no “spooky action from a distance” has happened.
Hashing Suggestion
N.B. This is simply a suggestion to gauge interest / feedback before a proposal. I’m completely open to other suggestions or methods to solve this.
Background / Why
This concept is inspired by the R package drake. This allows users to write something analogous to a Workflow algorithm and “compile” or “make” it, where the tooling will identify changed steps and simply re-run those instead of the entire workflow.
This makes it trivial to see where any changes to a data pipeline, be it bug or intended, has taken place.
The net effect (beyond performance implications) is developers simply jumping to the section/algorithm in question and start work instead of tracing.
What
Currently, our history (whilst sometimes incomplete) makes it trivial to see “native” properties such as int/float/string
, but Workspace
types are effectively black boxes.
I propose a new field to go into the history for all algorithms that derive from the Workspace
type (i.e. Matrix Workspace, MD Workspace…etc.). This hashes and stores the raw data for any properties before/after execution.
Because of the performance cost this would have to be manually enabled through our config manager (with a tick option maybe in settings?) and would emit a warning each time to prevent users accidentally running daily with it enabled. This would not be hidden behind an #ifdef debug as end-users could examine it / use it as a diagnostic tool.
How
A new class (e.g. WorkspaceHasher) would accept a Workspace pointer and unpack the data into a raw pointer and size (which is what most libs expect). An adaptor class would then wrap an external non-cryptographically secure hashing lib.
A (likely biased) comparison of some possible options, exist on the xxhash repo but this can be discussed at a later date.
Hashing is trivially parallelisable at a spectrum level. These hashes can be appended into a single string and re-hashed into a single value that will change dramatically for any small data change.
Algorithm::exec
would hash any input Workspace properties and store these values into the history. After execution completes successfully we would again hash the output Workspace properties. This would allow us to handle the case where an algorithm does an in-place operation.
Example
Imagine this block of history spread around 20 other steps where CreateSampleWorkspace
is a stand-in for a child algorithm.
CreateSampleWorkspace(OutputWorkspace='ws1', NumEvents=1000)
CreateSampleWorkspace(OutputWorkspace='ws2', NumEvents=1000)
CreateSampleWorkspace(OutputWorkspace='ws3', NumEvents=1000)
Plus(LHSWorkspace='ws1', RHSWorkspace='ws2', OutputWorkspace='summed')
Plus(LHSWorkspace='summed', RHSWorkspace='ws3', OutputWorkspace='summed')
If the param of a child changes subtly it can quickly blur into all other fields:
CreateSampleWorkspace(OutputWorkspace='ws1', NumEvents=1000)
CreateSampleWorkspace(OutputWorkspace='ws2', NumEvents=1000)
CreateSampleWorkspace(OutputWorkspace='ws3', NumEvents=100)
Plus(LHSWorkspace='ws1', RHSWorkspace='ws2', OutputWorkspace='summed')
Plus(LHSWorkspace='summed', RHSWorkspace='ws3', OutputWorkspace='summed')
Switching on hashing would make this trivial to spot either by eye or using diff tools:
“Good”:
CreateSampleWorkspace(OutputWorkspace='ws1' : UyuxS, NumEvents=1000)
CreateSampleWorkspace(OutputWorkspace='ws2' : FrAtz, NumEvents=1000)
CreateSampleWorkspace(OutputWorkspace='ws3' : AllLm, NumEvents=1000)
Plus(LHSWorkspace='ws1' : UyuxS, RHSWorkspace='ws2' : FrAtz, OutputWorkspace='summed' : 9j5XH)
Plus(LHSWorkspace='summed' : 9j5XH, RHSWorkspace='ws3' : AllLm, OutputWorkspace='summed' : K1c3j)
“Bad”:
CreateSampleWorkspace(OutputWorkspace='ws1' : UyuxS, NumEvents=1000)
CreateSampleWorkspace(OutputWorkspace='ws2' : FrAtz, NumEvents=1000)
CreateSampleWorkspace(OutputWorkspace='ws3' : 7kzuc, NumEvents=100)
Plus(LHSWorkspace='ws1' : UyuxS, RHSWorkspace='ws2' : FrAtz, OutputWorkspace='summed' : 9j5XH)
Plus(LHSWorkspace='summed' : 9j5XH, RHSWorkspace='ws3' : 7kzuc, OutputWorkspace='summed' : A8fUf)
We can see that ws1, ws2 and the first plus all have matching hashes for their output and input properties. The second plus has a different output hash, which we can easily trace to a different input and back to the root cause (the missing 0 in the third create sample workspace).
Pros
-
Reduces the “black-box” problem of debugging with Workflow algorithms
-
Helps identify steps being completed outside Mantid algorithms, as the hash will change between an output and an input. (Helps spot and remove these for project recovery)
-
Allows end-users to have confidence that only the expected algorithms are/were affected by a documented change and no side-effects
-
Helps with confidence in reproducibility; a published history with the same hashes is almost certainly the same result
-
Highlights bugs in algorithms, such as cross-compat problems, where the same input hashes result in a different output hash
-
Possibility of linking it into CI to give better diagnostics?
Cons
-
Requires another external lib (Boost hashing is in the detail namespace)
-
Additional complexity in Algorithm / something else to go wrong
-
Have to develop handlers for each unique Workspace type for it to “just work”
-
Unknown runtime cost on very large data sets (20GB+)
-
Adds noise in history when enabled / how do we best display it.
Final thoughts
Long-term the user base (or at least SANS) is pointing out a growing demand for a complete history whilst publishing to journals.
Any suggestions, like the above or alternatives, which improves the history would allow us to get ahead of the curve.
Could you let me know on your thoughts for the above. Mostly:
- Do you have similar problems with Workflow Algorithms or not?
- Would something to help debugging them (not necessarily the above) help you?
- What are your thoughts/critiques on the suggestion above?
Also, thanks for reading.