Progress Report: PDF To HTML Conversion
Progress Report: PDF To HTML Conversion
Progress Report: PDF To HTML Conversion
Progress Report
Summary
As a result of an extensive investigation into the existing solutions to this problem, it has been
decided to modify the aims of this project to remove the emphasis on preserving the original page
layout.
The current solutions to the problem have been found to convert PDF files to HTML with fairly
high levels of success, accurately preserving the page layout in most cases. There is little sense in
simply repeating this work. Furthermore, many of the features and benefits of HTML are lost
with these methods of conversion. It has therefore been decided to aim this project at extracting
the content from a wide variety of PDF files, and presenting it in a clean HTML format,
utilizing HTMLs features for styles, formatting, bullet points, and any other features that are
deemed appropriate. Such files benefit from being re-flowable according to the size of the
browser window, easier to edit and look more professional when published to the Web.
The original aim of preserving page layout has not been completely disregarded, and will be
investigated if there is time remaining at the end of the project.
The first solution, PDF to HTML Recastor, was the only commercial solution on the Internet
that had a free trial version available for download. This version was researched in detail with the
following PDF files:
Title
Boston Sunday Globe,
Today, October 20, 2002
White Paper: Is the Network
Slow Today?
Connex South Eastern Rail
Timetable #5
Type of layout
Complex newspaper;
columns
Word-processed
document
Tabular
Location
www.boston.com/globe/acrobat/today.pdf
www.netscout.com/files/artmb_wp.pdf
www.connex.co.uk/upload/timetable/
PTT05.pdf
All of the files were downloaded on October 21, 2002. Availability, content and location of
these files may have changed since this date.
Progress Report
Progress Report
A different issue was highlighted by the conversion of the train timetable. In Figure 2, the
figures should all appear under each other. However, the converter has mistakenly detected the
circled side-by-side figures as words in a line of text. Rather than place them separately, it has
placed them together as a line of text, with each figure separated by a space. The result is that the
figures are not in the right place. The figure 1717 should actually appear underneath the
figure 1713 in the line above. Again, this is another special case that highlights a useful feature
of the converter designed to improve the appearance of poorly created documents where each
character or word may be placed separately. Unfortunately, there was no way to turn off this
feature, as this is a general purpose converter designed for a wide variety of documents.
Progress Report
closer
resemblance
to original
more benefits
from HTML
format
Each successive approach produces a HTML file which less resembles the original PDF file than
the previous approach. However, the resulting file benefits from more of the features of the
HTML format such as editability and reflowability.
As approaches 1 and 2 have already been successfully implemented in the programs studied, it
has been decided to concentrate on approach 3. This will involve the following:
reconstructing complete paragraphs from the lines of text in the PDF file
dealing with hyphenation, bullet points, numbered lists, indentations and line spaces
detecting headings and sub-headings and converting them to appropriate HTML styles
for more complex layouts, attempting to detect columns and boxed sections in a page
parsing the page elements in the correct order (ie for a column layout, starting at the top of
the first column, moving to the second column, etc)
in multi-page documents, dealing with headers, footers, footnotes and hyphenation, and
joining two pages together to create a seamless flow of text
All these tasks are required simply to convert the text from PDF files into a clean HTML
format suitable for on-screen viewing. A further possibility, if there is time remaining, is to
attempt to recreate the page layout of the PDF file by using HTML tables. The advantage of
tables is that they can alter their shape according to the size of the browser window. However,
because of their inherent limitations, it is unlikely that they will work for all page layouts.
Another possible extension of the project is to implement and improve Approach #2 by offering
more options to control the size of the outputted page, and to rasterize text larger than a certain
size.
The author is already familiar with Java, having been taught it at university in previous years.
It is inherently a multi-platform language, making it as portable as the PDF and HTML
formats it will be used to convert.
Third party libraries available, such as JPedal* for PDF processing and Swing* for GUIs.
Although not known for its speed, it performs adequately on a relatively modern PC.
These libraries are described in the following section overleaf; web links are given in the References
section.
4
Progress Report
Design
As this project is focused on creating a single routine, it is not possible to produce any design
work at this stage. Objectives 7 and 8, as listed in the following section, Revised objectives and
timetable, are more a task of research than of software engineering, and will involve much trial
and improvement in order to discover which techniques work. Therefore, any designs created at
this stage would be subject to considerable change.
A thorough design will be created, of the main program and GUI, after Objectives 7 and 8 are
complete.
A basic overview of how Objective 7 will be tackled is shown in the pseudocode below.
The above pseudocode was written to parse the text of a single- or multi-column page and
identify and extract the text from each column, which will be held in the vector, together with its
horizontal position. Further processing will be required to identify headings and subheadings,
different articles, headers and footers, etc, and may also entail examining other features of the
page such as lines, boxes and images. These will be worked on at a later stage.
Progress Report
Completed
Completed
Completed
Completed
Completed
Completed
Completed
Vacation
Vacation
Weeks 11-12
Weeks 11-12
Weeks 12-13
Weeks 12-13
Vacation
Weeks 13-14
Weeks 14-16
Remaining
time
References
Web page addresses of the converters investigated in this project are given in the section entitled
Investigation of existing solutions, on the first page.
6