Suggestion: Making Workflow Algorithms more transparent

DavidFair · August 19, 2021, 12:01pm

Background

Mantid extensively uses workflow algorithms to combine individual algorithms together into a reduction. From a user perspective this hides complexity, allowing someone to run their technique and only expand out the history when required.

For developers this is an alternative to writing out scripts, whilst providing other features such as progress bars, history and a single named algorithm as an entrypoint.

The Problem

Our system tests work well as go/no-go indicators, highlighting when the data has changed but little beyond this.

In cases where data should validate identically developers are faced with three choices: revert their changes, examine the diff and manually trace through the workflow algorithm, or manually step through the reduction (either through history or breakpoints to probe the output).

In cases where data is expected to change, developers are either forced to validate changes have the correct scope by hand, treat the reduction as a black-box, or avoid any improvements due to the time-cost involved. This can be another source of bugs, e.g. a change that affects two algorithm steps instead of simply one from an implicitly shared data structure.

Ultimately, this leads to developers quickly getting into drawn out sessions trying to replay history line-by-line, using print statements or stepping through the debugger. All whilst the Mantid Framework provides little in helping tooling. It also makes users nervous accepting changes to their data, as they can’t know with 100% certainty any change(s) have the correct scope and no “spooky action from a distance” has happened.

Hashing Suggestion

N.B. This is simply a suggestion to gauge interest / feedback before a proposal. I’m completely open to other suggestions or methods to solve this.

Background / Why

This concept is inspired by the R package drake. This allows users to write something analogous to a Workflow algorithm and “compile” or “make” it, where the tooling will identify changed steps and simply re-run those instead of the entire workflow.

This makes it trivial to see where any changes to a data pipeline, be it bug or intended, has taken place.

The net effect (beyond performance implications) is developers simply jumping to the section/algorithm in question and start work instead of tracing.

What

Currently, our history (whilst sometimes incomplete) makes it trivial to see “native” properties such as int/float/string, but Workspace types are effectively black boxes.

I propose a new field to go into the history for all algorithms that derive from the Workspace type (i.e. Matrix Workspace, MD Workspace…etc.). This hashes and stores the raw data for any properties before/after execution.

Because of the performance cost this would have to be manually enabled through our config manager (with a tick option maybe in settings?) and would emit a warning each time to prevent users accidentally running daily with it enabled. This would not be hidden behind an #ifdef debug as end-users could examine it / use it as a diagnostic tool.

How

A new class (e.g. WorkspaceHasher) would accept a Workspace pointer and unpack the data into a raw pointer and size (which is what most libs expect). An adaptor class would then wrap an external non-cryptographically secure hashing lib.

A (likely biased) comparison of some possible options, exist on the xxhash repo but this can be discussed at a later date.

Hashing is trivially parallelisable at a spectrum level. These hashes can be appended into a single string and re-hashed into a single value that will change dramatically for any small data change.

Algorithm::exec would hash any input Workspace properties and store these values into the history. After execution completes successfully we would again hash the output Workspace properties. This would allow us to handle the case where an algorithm does an in-place operation.

Example

Imagine this block of history spread around 20 other steps where CreateSampleWorkspace is a stand-in for a child algorithm.


CreateSampleWorkspace(OutputWorkspace='ws1', NumEvents=1000)

CreateSampleWorkspace(OutputWorkspace='ws2', NumEvents=1000)

CreateSampleWorkspace(OutputWorkspace='ws3', NumEvents=1000)

Plus(LHSWorkspace='ws1', RHSWorkspace='ws2', OutputWorkspace='summed')

Plus(LHSWorkspace='summed', RHSWorkspace='ws3', OutputWorkspace='summed')

If the param of a child changes subtly it can quickly blur into all other fields:


CreateSampleWorkspace(OutputWorkspace='ws1', NumEvents=1000)

CreateSampleWorkspace(OutputWorkspace='ws2', NumEvents=1000)

CreateSampleWorkspace(OutputWorkspace='ws3', NumEvents=100)

Plus(LHSWorkspace='ws1', RHSWorkspace='ws2', OutputWorkspace='summed')

Plus(LHSWorkspace='summed', RHSWorkspace='ws3', OutputWorkspace='summed')

Switching on hashing would make this trivial to spot either by eye or using diff tools:

“Good”:


CreateSampleWorkspace(OutputWorkspace='ws1' : UyuxS, NumEvents=1000)

CreateSampleWorkspace(OutputWorkspace='ws2' : FrAtz, NumEvents=1000)

CreateSampleWorkspace(OutputWorkspace='ws3' : AllLm, NumEvents=1000)

Plus(LHSWorkspace='ws1' : UyuxS, RHSWorkspace='ws2' : FrAtz, OutputWorkspace='summed' : 9j5XH)

Plus(LHSWorkspace='summed' : 9j5XH, RHSWorkspace='ws3' : AllLm, OutputWorkspace='summed' : K1c3j)

“Bad”:


CreateSampleWorkspace(OutputWorkspace='ws1' : UyuxS, NumEvents=1000)

CreateSampleWorkspace(OutputWorkspace='ws2' : FrAtz, NumEvents=1000)

CreateSampleWorkspace(OutputWorkspace='ws3' : 7kzuc, NumEvents=100)

Plus(LHSWorkspace='ws1' : UyuxS, RHSWorkspace='ws2' : FrAtz, OutputWorkspace='summed' : 9j5XH)

Plus(LHSWorkspace='summed' : 9j5XH, RHSWorkspace='ws3' : 7kzuc, OutputWorkspace='summed' : A8fUf)

We can see that ws1, ws2 and the first plus all have matching hashes for their output and input properties. The second plus has a different output hash, which we can easily trace to a different input and back to the root cause (the missing 0 in the third create sample workspace).

Pros

Reduces the “black-box” problem of debugging with Workflow algorithms
Helps identify steps being completed outside Mantid algorithms, as the hash will change between an output and an input. (Helps spot and remove these for project recovery)
Allows end-users to have confidence that only the expected algorithms are/were affected by a documented change and no side-effects
Helps with confidence in reproducibility; a published history with the same hashes is almost certainly the same result
Highlights bugs in algorithms, such as cross-compat problems, where the same input hashes result in a different output hash
Possibility of linking it into CI to give better diagnostics?

Cons

Requires another external lib (Boost hashing is in the detail namespace)
Additional complexity in Algorithm / something else to go wrong
Have to develop handlers for each unique Workspace type for it to “just work”
Unknown runtime cost on very large data sets (20GB+)
Adds noise in history when enabled / how do we best display it.

Final thoughts

Long-term the user base (or at least SANS) is pointing out a growing demand for a complete history whilst publishing to journals.

Any suggestions, like the above or alternatives, which improves the history would allow us to get ahead of the curve.

Could you let me know on your thoughts for the above. Mostly:

Do you have similar problems with Workflow Algorithms or not?
Would something to help debugging them (not necessarily the above) help you?
What are your thoughts/critiques on the suggestion above?

Also, thanks for reading.

peterfpeterson · August 19, 2021, 5:54pm

I like this idea as we already have something along these lines in CreateCacheFilename. It’s main shortcoming is in converting all the values to a string which runs into the exact same issue with workspaces that you are describing. Having this capability for “long” properties (e.g. arrays of numbers) would be equally as helpful.

This would likely be most useful for developers debugging workflows, not just workflow algorithms.

For the hashing, I would suggest that we do it in two places: PropertyWithValue and Workspace (the top level one) which could probably be abstracted into a single “Hashable” abstract base class that is multiple inheritanced in. The individual object that implements the method(s) could decide if it wanted to do something simple or fancy. It doesn’t matter as long as it is mostly unique. All hashing algorithms have collisions at some point and one should just acknowledge it and carry on.

I have mixed feelings as to whether this should be a run-time or compile-time configurable option. For many things (integer properties, single value workspaces, etc) the information is likely fast enough to calculate all the time. Using strategies for delegating some of the work, member items could provide their own unique identifier (e.g. a instrument geometry that is fully specified by the IDF and has no parameters doesn’t need to be calculated more than when it is first loaded). The Run object (probably PropertyManager is better) is another case where delegation would simplify the implementation.

To close on a less positive note: I think this would be helpful, but I’m not sure if the amount of effort is correctly proportional to the benefit. In other words, which cases of sorting out broken system tests would have directly benefited from this feature?

DavidFair · August 23, 2021, 11:17am

Anything I’ve not commented on I agree with completely, but didn’t want to just have a post full of ACKs.

I agree completely. Originally, I was concerned about “simple” properties such as bool, or a single int but these can be easily handled by…

…simple scalar types simply skipping hashing all-together, as it adds very little. E.g. having Random:False 345jbn6mi where False hashes to 345jbn6mi simply adds noise.

I’d strongly push for a unified hashing approach using an external lib to perform the actual hash. It’s statistically difficult to make a uniformally distributed function with minimal collisions and something we really shouldn’t be trying to home-make ourselves.
I was envisaging something like an adaptor class which takes “standard types” (e.g. cbegin<double>() and cend<double>() or std::string) and marshals them into the external lib, so it becomes trivial to add hashing to new items.

Using a vetted hashing mechanism (assuming 64-bit hash) we should approach a 40% chance of collision at 4b items hashed. If we ever run into this being a problem (somehow) we can easily swap the lib behind the adaptor to a more robust alg at the cost of performance. I wouldn’t want to use something quick and simple like CRC32 as it will collide frequently with >4GB hashed.

The problem is arrays of data, e.g. a Workspace. Although xxhash 3 (assuming we pick for perf) claims to hash ~31GB/s, so a 2 second delay for a 64GB Workspace actually isn’t too unreasonable if these speeds can be achieved.

I’d like it to be something that we can also offer users to help them see the data results are identical or not from one version of Mantid to another, and clearly show where those changes come in.
If we do provide it, some users would want to turn it off or would never use it, hence the run-time option, as it just adds noise the the workspace history. Alternatively, we could store this information somewhere else…?

This is something we need to ultimately measure the performance impact of then come back and decide, but it depends on the cost benefit as discussed below…

This is a good point to raise, and will be the most likely death-knell for this idea so I’m happy to take as much critique as early-on as possible.

I want to clarify this is more to help people who mainly work in Workflow algorithms, rather than for helping people get system tests working. This is more anecdotal to myself and the ISIS LSS team, so we need to check if it universal:

So the system tests is more a symptom rather than the root of the problem. When they fail for an single algorithm change it’s usually obvious the problem is encapsulated in a single .cpp or .py file and this limits the search area down.

In Workflow algorithms there are two cases where this comes up. During refactoring (where we don’t intend for any user-visible changes) and for any scientific improvements (e.g. bug fixes or methodology improvements) which affects data.

The refactoring case tends to be someone changes an attribute in a data structure which has unintentional consequences later in the workflow. This leaves many devs with a choice; back out of refactoring, manually hunt down the shared attr, or duplicate the data structure to simply avoid the problem.
IMO many developers will simply back-out of any refactoring effort, revert their changes and leave any technical debt as-is. Others will just absorb the development cost into their estimates for changing any workflow algorithm reducing what we deliver to users.

The worst case is when we change data. Users are immediately concerned that something else might have unintentionally changed as they simply have our word that we verified only “x changed” instead of “x and y”. Across multiple groups I’ve seen this tend towards two directions: users will avoid the changes either by sticking with an old Mantid version or ignoring the feature until they get round to testing it (read never). When we try to be proactive and chase our users to perform their own scientific validation any bugs quickly spirals back into drawn out tracing of the problem, or loss of confidence in those changes.

Assuming a best-case of users who absolutely trust what we develop, who keenly adopt any data changes knowing they are all correct and good, and regularly upgrade. We still have to have a developer sit down and step through either the history or a debugger to verify the expected step changed.

These changes don’t help where we need to change something right at the start of a workflow, as all hashes will change down the algorithm. But an increasing benefit happens as we approach the end of the workflow alg. E.g. where the final step changes only a single hash should change.

This is something I know for a fact would have helped both ISIS SANS and ISIS Reflectometry for past development and will likely help in the future. But this is something I want to gauge project-wide since I’ve seen the benefits personally of another academic project adopting R Drake data pipelines:

Do you think it’s worth asking the development team:

Do you work with Workflow algorithms?
Have you ever had to manually trace (either through debugging, print statements, saving workspaces to compare…etc.) through a Workflow algorithm to determine a difference in data?
If so, could you estimate how many development days were spent in the last year doing this?
Would an indicator in the history (e.g. hash) that two properties or workspaces contain the same value when calling a child algorithm help you?

I can only speak for SANS initially, in the last 2 years I would estimate around 1-1.5 sprints time has been lost to having to manually trace changes, revert a change and implement a workaround to avoid the overhead or manually re-verify our workflow on the groups request.

In addition an entire SANS back-end, which was written by a previous developer to drastically improve performance, has been abandoned (likely 2+ months of work). This was never adopted as it produced subtly different results (likely a re-binning error) but the group ultimately never committed to the development cost of figuring out this difference. This was directly related to the time required for a developer to sit down and trace through / manually compare each step new → old.

peterfpeterson · August 23, 2021, 3:20pm

I like the idea in general, and think it applies to lots of things beyond workflow algorithms. It would help answer questions at runtime like “are these two workspaces the same with different names?” The way you describe the UI changes to be “workspacename : hash” would fit well into the instrument tree as well as the log window and communicate information pretty well.

The suggestion of scalars/strings not being hashed is quite valid. I’m still scratching my head on how to delegate the calculation so only parts of the workspace that have changed need to be recalculated Instrument, sample logs, counts, all change with very different frequency in algorithms.

I appreciate that performance will be a factor, but if the feature is able to be disabled (via property or compile flag) it becomes less of an issue.