to-Tree: Parsing PDF Text Blocks into a Tree

Y Zhang, Z Zhang, W Lai, C Zhang, T Gui… - Findings of the …, 2024 - aclanthology.org
Y Zhang, Z Zhang, W Lai, C Zhang, T Gui, Q Zhang, XJ Huang
Findings of the Association for Computational Linguistics: EMNLP 2024, 2024aclanthology.org
In many PDF documents, the reading order of text blocks is missing, which can hinder
machine understanding of the document's content. Existing works try to extract one universal
reading order for a PDF file. However, applications, like Retrieval Augmented Generation
(RAG), require breaking long articles into sections and subsections for better indexing. For
this reason, this paper introduces a new task and dataset, PDF-to-Tree, which organizes the
text blocks of a PDF into a tree structure. Since a PDF may contain thousands of text blocks …
Abstract
In many PDF documents, the reading order of text blocks is missing, which can hinder machine understanding of the document’s content. Existing works try to extract one universal reading order for a PDF file. However, applications, like Retrieval Augmented Generation (RAG), require breaking long articles into sections and subsections for better indexing. For this reason, this paper introduces a new task and dataset, PDF-to-Tree, which organizes the text blocks of a PDF into a tree structure. Since a PDF may contain thousands of text blocks, far exceeding the number of words in a sentence, this paper proposes a transition-based parser that uses a greedy strategy to build the tree structure. Compared to parser for plain text, we also use multi-modal features to encode the parser state. Experiments show that our approach achieves an accuracy of 93.93%, surpassing the performance of baseline methods by an improvement of 6.72%.
aclanthology.org
Showing the best result for this search. See all results