Improving offline HTR in small datasets by purging unreliable labels

JC Aradillas, JJ Murillo-Fuentes… - 2020 17th International …, 2020 - ieeexplore.ieee.org
2020 17th International Conference on Frontiers in Handwriting …, 2020ieeexplore.ieee.org
This paper focuses on the offline handwriting text recognition problem (HTR) with small
training data sets. Some techniques such as transfer learning or data augmentation have
recently been applied to this problem, improving the performance of the recognition. In these
scenarios, we found that errors in the labelling of the training samples, present in some
databases, have a great impact in the character error rates (CER). Accordingly, we propose
a novel cross validation technique to remove incorrect labelled lines. In this approach, after …
This paper focuses on the offline handwriting text recognition problem (HTR) with small training data sets. Some techniques such as transfer learning or data augmentation have recently been applied to this problem, improving the performance of the recognition. In these scenarios, we found that errors in the labelling of the training samples, present in some databases, have a great impact in the character error rates (CER). Accordingly, we propose a novel cross validation technique to remove incorrect labelled lines. In this approach, after a first training stage, transcript lines with CER above a threshold are discarded, where the threshold is a function of the available data. Less available data favours larger CER, even for healthy lines, suggesting higher thresholds for fewer lines. This new technique and the validation of the threshold are analyzed over the ICFHR 2018 competition on automated HTR and other well known databases such as Washington and Parzival. For the Ricordi database in the ICFHR 2018, with transcription errors, we report a reduction of CER by 2%.
ieeexplore.ieee.org
Showing the best result for this search. See all results