All medical image segmentation algorithms need to be validated and compared, and yet no evaluation framework
is widely accepted within the imaging community. Collections of segmentation results often need to be compared
and ranked by their effectiveness. Evaluation measures which are popular in the literature are based on region
overlap or boundary distance. None of these are consistent in the way they rank segmentation results: they
tend to be sensitive to one or another type of segmentation error (size, location, shape) but no single measure
covers all error types. We introduce a new family of measures, with hybrid characteristics. These measures
quantify similarity/difference of segmented regions by considering their overlap around the region boundaries.
This family is more sensitive than other measures in the literature to combinations of segmentation error types.
We compare measure performance on collections of segmentation results sourced from carefully compiled 2D
synthetic data, and also on 3D medical image volumes. We show that our new measure: (1) penalises errors
successfully, especially those around region boundaries; (2) gives a low similarity score when existing measures
disagree, thus avoiding overly inflated scores; and (3) scores segmentation results over a wider range of values. We
consider a representative measure from this family and the effect of its only free parameter on error sensitivity,
typical value range, and running time.
|