End-to-end video text detection with online tracking
Text in videos usually acts as important semantic cues, which is helpful to video analysis.
Video text detection is considered as one of the most difficult tasks in document analysis due
to the following two challenges: 1) the difficulties caused by video scenes, ie, motion blur,
illumination changes, and occlusion; 2) the properties of text including variants of fonts,
languages, orientations, and shapes. Most existing methods try to improve the video text
detection through video text tracking, but treat these two tasks separately. This can …
Video text detection is considered as one of the most difficult tasks in document analysis due
to the following two challenges: 1) the difficulties caused by video scenes, ie, motion blur,
illumination changes, and occlusion; 2) the properties of text including variants of fonts,
languages, orientations, and shapes. Most existing methods try to improve the video text
detection through video text tracking, but treat these two tasks separately. This can …
Text in videos usually acts as important semantic cues, which is helpful to video analysis. Video text detection is considered as one of the most difficult tasks in document analysis due to the following two challenges: 1) the difficulties caused by video scenes, ie, motion blur, illumination changes, and occlusion; 2) the properties of text including variants of fonts, languages, orientations, and shapes. Most existing methods try to improve the video text detection through video text tracking, but treat these two tasks separately. This can significantly increase the amount of calculations and cannot take full advantage of the supervisory information of both tasks. In this work, we introduce explainable descriptor, combines appearance, geometry and PHOC features, to establish a bridge between detection and tracking and build an end-to-end video text detection model with online tracking to address these challenges together. By integrating these two branches into one trainable framework, they can promote each other and the computational cost is significantly reduced. Besides, the introduce explainable descriptor also make our end-to-end model have inherent interpretability. Experiments on existing video text benchmarks including ICDAR 2013 Video, DOST, Minetto and YVT verify the role of explainable descriptors in improving model expression ability and the proposed method significantly outperforms state-of-the-art methods. Our method improves F-score by more than 2% on all datasets and achieves 81.52% on the MOTA of the Minetto dataset.
Elsevier
Showing the best result for this search. See all results