Fitting Histogram data, not points

jameslord · June 30, 2016, 3:29pm

For a histogram data set, at present, fitting evaluates the function at the centre of each bin. If the bins are large and the function has any significant curvature, the point value can differ from the average over the bin, which is what the data probably represents. If there are fast “wiggles” in the fit function these can really confuse the fit if the maximum or minimum happens to line up with the bin centre and has the right sign to appear to improve the fit.

Often it’s easy to analytically integrate a fit function over a range of X values, so the correct average and its derivatives could be calculated with little extra effort.

Could a fit function declare that it knows about this, and is then passed an X array of bin edges to work with, and a flag to say so?

Alternatively a function could assume it is always working with a histogram and reconstruct the bin edges for itself (as ConvertToHistogram does) and then do the averaging. That assumes equal sized bins which is usually the case.

jameslord · June 30, 2016, 5:16pm

The attached script demonstrates the problem. It fits a Lorentzian to a generated Lorentzian plus a small amount of noise, with various degrees of binning. I’ve defined the bin-size-aware fit function for comparison with the built in one.
HistoAwareFits.py (3.4 KB)

The point-data function slightly over-estimates the amplitude if the bins are too big, significantly over-estimates the FWHM, and there can be scatter in the peak centre value depending on how the peak lines up with the bin boundaries.

tom-perkins · July 5, 2016, 12:30pm

Hi James,

I’ve opened an issue to look into this question: Fitting histogram data, not points · Issue #16796 · mantidproject/mantid · GitHub

jameslord · July 12, 2016, 2:44pm

I note the new issue:
https://github.com/mantidproject/mantid/issues/16845
The “raw” muon data, perhaps after binning, grouping, and exponential lifetime correction, is a Histogram (counts per unit time-of-flight). The asymmetry data is still binned data with the same X boundaries, but it’s meaningless to normalise Y to the bin size, there’s no concept of “asymmetry per microsecond” or “total asymmetry per bin”. It’s the same as dividing a neutron spectrum by a monitor to get scattering probability. It’s still best fitted with “histogram-aware” functions if there are any sharp features such as rapid oscillation or strong relaxation. Muon ALC data however is usually true Point data in that it’s measured at fixed values of some sample environment setting such as magnetic field.

So we probably have three classes of data here - Histogram of counts, Point data, and Something Else…

tom-perkins · July 12, 2016, 3:47pm

Thanks for the clarification on this, it fits with what was seen in the issue. At the moment AsymmetryCalc produces a points workspace from input histogram data - should it instead keep the X data as bins?
For example:

y = [1,2,3] + [3,1,12]
x = [1,2,3,4] * 2
e = [1,1,1] * 2
input = CreateWorkspace(x, y, e, NSpec=2)
asymmetry = AsymmetryCalc(input, Alpha=0.5)

print input.readX(0) # [1, 2, 3, 4]
print asymmetry.readX(0) # [1.5, 2.5, 3.5]

jameslord · July 13, 2016, 8:53am

Yes, I think AsymmetryCalc should keep the X bin boundaries and perhaps set a new flag to say this data isn’t now “normalisable”. RemoveExpDecay should do likewise. It probably doesn’t make sense to overlay raw data and asymmetry as the Y units are different, but asymmetry and corrected counts from RemoveExpDecay could quite reasonably be compared.

We could argue that Rebin or Rebunch on an axis which was Point data should convert it back to Histogram with suitable boundaries, if many source X values contribute to each output bin. The likely use case is if I’d measured an ALC or similar scan with a very large number of discrete field points and I then wanted to “smooth” the data to pick out a weak but broad feature - but in a way that would still let me fit the data.

Getting a bit further off the original topic, MD workspaces may need more options here too. Each axis should be individually selected as “histogram” or “point” and then there should be an overall flag saying if Y is “counts” to be normalised to the cell volume, or is an already-normalised quantity.

tom-perkins · July 13, 2016, 10:41am

Hi James,

Thanks, I’ve created an issue to keep the X bin boundaries for AsymmetryCalc (AsymmetryCalc should keep X bin boundaries · Issue #16858 · mantidproject/mantid · GitHub) - RemoveExpDecay already retains whatever X binning the input workspace had.

Currently if you tried to compare asymmetry and corrected counts from RemoveExpDecay on the same axes, you would get a warning that one is point data and the other is histogram data, so this change would enable that comparison to be made without seeing the warning.

At the moment Rebin, if given an input workspace with point data, automatically converts to histogram data, rebins and then converts back to point data. In the use case you mention, it might be better to convert to histogram first and then run Rebin on the histogram data workspace, as then the output would be histogram with the required boundaries.

mantid_roman · September 7, 2016, 8:32am

Hi James,

I have implemented this kind of fit for functions Lorentzian, Gaussian, FlatBackground and LinearBacground. It is controlled by a new Fit property EvaluationType which takes values: “CentrePoint” for a normal fit and “Histogram” for a histogram fit. It should be in the nightly build by now.

jameslord · September 7, 2016, 8:54am

Thanks Roman!

As “Histogram” fit should be the default for fitting in the Muon Analysis interface (but not Muon ALC), we should consider implementing it for the common muon time-domain functions such as DynamicKuboToyabe, ExpDecayOsc, etc.

Is this change exposed to Python, so either I define two versions of the function and Fit chooses the one which matches its EvaluationType property, or is EvaluationType passed through to the fit function which must then act on it?

mantid_roman · September 7, 2016, 9:20am

EvaluationType determines how to evaluate the function. But the function must be able to do it (function’s class must implement certain virtual methods). If the function cannot handle the histogram evaluation Fit fails.

At the moment only functions written in C++ can be made histogram aware. Is it important to extend this to python functions?

I can implement more functions if you need.

jameslord · September 7, 2016, 9:42am

It would be good if Python could do this. Presumably by analogy with C++ we would define methods histogram1D and histogramDerivative1D.
My “Quantum” package (in Python) currently has a work-around when fitting time domain muon spectra, where it reconstructs bin boundaries before going on to simulate integrated data over the bins. It has to guess whether this is necessary based on the context. It would be better to use the bins directly.