EN ELO Textreader

Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

Technical Documentation

ELO Textreader

ELO Textreader
[Date: 2020-09-18 | Program version: 20.01.000.002]

ELO Textreader is a servlet that can extract text from a variety of document types and then save it
to a text file.

Contents
1 General ......................................................................................................................................... 4

1.1 Program task ................................................................................................................................ 4

1.2 Version information ..................................................................................................................... 4

1.3 File formats .................................................................................................................................. 6

1.4 Requirements ............................................................................................................................... 7

1.5 Program installation .................................................................................................................... 8

1.5.1 Updating the ELO Textreader configuration to the current version (ELO Textreader 10
and higher) ......................................................................................................................... 9

1.5.2 New Textreader installation ...........................................................................................10

1.6 Logging outputs .........................................................................................................................11

2 Configuration via 'config.xml' (up to ELO Textreader 9) ........................................................16

2.1 Structure of "config.xml" (up to Textreader 9) ........................................................................16

2.2 Sample entries to "config.xml" .................................................................................................25

2.2.1 Explanations for configuring document size..................................................................25

2.2.2 Explanations for configuring PDF documents ...............................................................25

2.2.3 Explanations for configuring MMF documents..............................................................27

2.2.4 Explanations for configuring WMF documents .............................................................27

2.2.5 Explanations for configuring TIF documents ................................................................27

2.2.6 Explanations for configuring 7-Zip documents .............................................................27

2.2.7 Explanations for configuring the minimum width and height ......................................28

ELO Digital Office GmbH 1


Technical Documentation
ELO Textreader

2.2.8 Explanations for configuring behavior in case of an error.............................................28

3 Configuration via the ELO Administration Console (from Textreader 10, described for
version 10.01.000) ....................................................................................................................29

3.1 Structure of 'config.xml' starting with ELO Textreader 10 .....................................................29

3.2 'Return directory' area...............................................................................................................29

3.3 'Output folders and type-specific configuration' area .............................................................31

3.3.1 File types with identical configuration (up to and including version 12.00.004.000) ...
..........................................................................................................................................32

3.3.2 Option 'Convert externally'.............................................................................................33

3.3.3 Options for the 'PDF' file type ........................................................................................34

3.3.4 Options for the 'TIF(F)" file type.....................................................................................35

3.3.5 Conversion without text files ..........................................................................................36

3.3.6 Add a new directory .........................................................................................................36

3.3.7 Remove output folders ....................................................................................................36

3.4 'Textreader settings' area ..........................................................................................................37

3.4.1 "Minimum width or minimum height" option ...............................................................37

3.4.2 'Maximum error count' option ........................................................................................37

3.4.3 'Maximum number of files exported per minute' option...............................................37

3.4.4 'Maximum number of files imported per minute' option ..............................................38

3.4.5 'Maximum error count per day' option ...........................................................................38

3.4.6 'Maximum file size in MB' option ....................................................................................38

3.4.7 'Timeout per page for the OCR service' option .............................................................38

3.4.8 'Timeout per OCR document for the Textreader' option .............................................39

3.4.9 'Minimum free disk space in percent' option .................................................................39

3.4.10 'Minimum absolute free disk space' option ............................................................40

3.4.11 'Maximum number of files in folder' option (up to ELO Textreader version 11.01)
...................................................................................................................................41

3.4.12 'Maximum number of characters per line in MSG files' option .............................42

3.4.13 'Maximum size of a full text file' option...................................................................42

ELO Digital Office GmbH 2


Technical Documentation
ELO Textreader

3.5 "Additional Textreader configuration" area ............................................................................42

3.5.1 'Extended trace' option ...................................................................................................42

3.5.2 "Default folder" option ....................................................................................................42

3.5.3 'Running times for export and import' option ...............................................................42

3.5.4 'General behavior in the case of an error' option...........................................................43

3.6 'OCR options' area .....................................................................................................................44

3.6.1 'OCR read mode' option .................................................................................................44

3.6.2 Option 'OCR recognition - Detailed' ..............................................................................44

3.6.3 'OCR languages' option ..................................................................................................45

3.6.4 'Number of OCR workers' option ...................................................................................46

4 Default values of the configuration options after a new installation of the current Textreader
version (version 10 and higher) ................................................................................................47

4.1.1 'Output folders and type-specific configuration' section ..............................................47

4.1.2 'Textreader settings' section...........................................................................................47

4.1.3 'Additional Textreader configuration' section ...............................................................48

4.1.4 'OCR options' section......................................................................................................48

4.1.5 'Behavior on errors' section ............................................................................................48

5 Changing default values in the database (from version 10 onward) ......................................50

5.1 'Output folders and type-specific configuration' section ........................................................50

5.2 'Settings' section........................................................................................................................52

5.3 'Additional Textreader configuration' section .........................................................................52

5.4 Overview of options in the database ........................................................................................54

ELO Digital Office GmbH 3


Technical Documentation
ELO Textreader

1 General

Information: Unless specified otherwise, this manual applies to all ELO Textreader
versions from ELO Textreader 9 and up.

1.1 Program task


ELO Textreader is a servlet that can extract text from a variety of document types and then save it
to a text file. The documents to be extracted can also be located in a ZIP or 7-Zip archive or saved
as an e-mail attachment (MSG, EML). Depending on the file format, additional functions may be
supported. Image files can also be extracted from files in PDF format. The earlier version of the ELO
Fulltext module (ELOft), which is integrated in Textreader version 9.17.040.000, now runs with
Indexserver version 9.00.060 (previously from Indexserver version 10). On the ELO Textreader sta-
tus page, you can start and stop the exporter, importer, and the three threads for PDF, OCR, and
other converters.

Information: ELOft is integrated from Textreader version 9.17.040.000 on and runs


with Indexserver version 9.00.060 or higher. This means you need to disable or unin-
stall the old ELOft module.

Information: Starting with ELO Textreader version 12, you can start and stop the ex-
porter, importer, and converter on the status page.

1.2 Version information


The Textreader can now run with both Indexserver 9 (starting with version 9.00.060) and ELO In-
dexserver 10, 11, 12, and 20. Starting with version 11, the version number has been changed back
to the former system <major release>.<minor release>.<patch number>.<build number>, in our case
12.00.000.003. For Indexserver versions earlier than 10, the version number remains as it was, <IX
version number>.<year>.<month><hotfix>.<build number>; the Indexserver version number always
precedes the Textreader version number, e.g. 10.16.090.001.

The individual ELO Textreader versions function differently:

ELO Digital Office GmbH 4


Technical Documentation
ELO Textreader

• Starting with ELO Textreader version 10, the new, integrated ELO Textreader is configured
with the ELO Administration Console, no longer with a config.xml.

• In ELO Textreader version 9, the converter is configured as before in the config.xml, the
export and import processes of the integrated ELO Fulltext module are still configured in
the database table eloftopt (for detailed documentation, refer to the Server manual version
9, chapter "ELO Fulltext").

• Starting with ELO Textreader version 10, Apache PDFBox converter is used to convert PDF
files by default (in the ELO Administration Console, OCR can currently no longer be set to
convert PDF files. Reason: An error in Abbyy Finereader version 11).

• Starting with ELO Textreader version 12, it is possible to convert encrypted documents in
ELO.

• ELO Textreader version 20.00.003.001 and higher use "Lockback" for logging (older ver-
sions uses "Log4j"). From this version on, a logback.xml file is required for configuring log-
ging (provided by the setup, only has to be created in case of manual installation).

• Starting with ELO Textreader versions 20.00.005.002, 12.00.007.000, 11.01.012.000,


and 10.20.040.000, ICEpdf is no longer supported as PDF converter as ICEpdf is no longer
supported. If ICEpdf is still configured, the Apache PDFBox converter will be used.

Please note: With ELO Textreader version 20.00.003.001 and higher, if you install
the module manually, you have to provide for a logback.xml file.

Information for older Textreader versions: Since PDFBox used in conjunction with
older Textreader versions could cause Tomcat to crash, when installing Textreaders
with versions before 9.03.006.001, ELO OCR Service 9 with Abbyy Finereader ver-
sion 10 should be used to convert PDF files whereby PDFBox would remain available,
e.g. in the event of poor OCR quality. ELO Textreader should be installed on a separate
Tomcat (within its own address space) to prevent all other servlets from crashing in
the event that PDFBox crashes.

ELO Digital Office GmbH 5


Technical Documentation
ELO Textreader

1.3 File formats


The documents to be extracted are exported from the ELO repository to directories whose names
correspond to the respective file type and are provided to the converter in this manner. After con-
version, the resulting full text files are imported back to the ELO repository.

The following formats are supported:

• AI – Adobe Illustrator

• BMP – Windows bitmap

• CSV – Comma-separated values

• DOC – Microsoft Word text editor

• DOCM – Microsoft Word text editor, (macro files) starting with Word 2007

• DOCX – Microsoft Word text editor, starting with Word 2007

• DOT – Microsoft Word text editor

• DXL – Domino XML (exported IBM Notes documents)

• EML – Microsoft Outlook Express e-mail message format

• GIF – Graphics Interchange Format

• GZ – gzip ("GNU zip")

• HTM, HTML – Hypertext Markup Language

• JPG, JPEG – Joint Photographic Experts Group

• LOG – log files from various programs

• MHT – Microsoft Internet Explorer web archive format

• MMF – ELO COLD (requires WMF converter)

• MSG – Microsoft Outlook e-mail format

• ODT – Open office application document format (OpenDocument)

• PDF – Portable Document Format

• PNG – Portable Network Graphics

• PPS – Microsoft PowerPoint Show (files must be moved to the PPT folder. Do not create a
PPS folder.)

• PPT – Microsoft PowerPoint presentation program

ELO Digital Office GmbH 6


Technical Documentation
ELO Textreader

• PPTM – Microsoft PowerPoint presentation program (macro files), starting with Power-
Point 2007

• PPTX – Microsoft PowerPoint presentation program, starting with PowerPoint 2007

• RTF – Microsoft Rich Text Format

• TIF, TIFF – Tagged Image File Format

• VCF – vCard "electronic business card"

• VSD – Microsoft Visio

• VSDX – Microsoft Visio 2013

• WMF – Windows Metafile Microsoft graphics format

• XLA – Microsoft Excel spreadsheet program

• XLS – Microsoft Excel spreadsheet program

• XLSM – Microsoft Excel spreadsheet program (macro files), starting with Excel 2007

• XLSX - Microsoft Excel spreadsheet program, starting with Excel 2007

• XLT – Microsoft Excel table calculation program (templates)

• XML – Extensible Markup Language

• ZIP - a common open format for compressed file archives

• 7-Zip - another common open format for compressed file archives

The following additional (header) information and embedded documents are extracted from the
following document formats:

• DOCX, XLSX, PPTX – all embedded files that are known to ELO Textreader

• EML – from, to, CC, subject, and all attachments

• MSG – from, to, CC, subject, and all attachments

• PDF – embedded graphics, JPG, JPEG, PNG, TIF

1.4 Requirements

Please note: This version of ELO Textreader requires Java 8.

ELO Digital Office GmbH 7


Technical Documentation
ELO Textreader

Please note: This version of ELO Textreader requires Indexserver version 9.00.060. If
the Indexserver version used is older than 9.00.060, you also need to install ELOft.
ELO Textreader will also run with older Indexserver versions, but on startup it cleans
up temporary files that are created by ELOft, meaning that documents cannot be ad-
ded to the full text database.

Please note: This version of ELO Textreader requires ELO OCR Service version
10.00.000 or higher and ELO Indexserver 9.00.060 or higher.

Please note: Starting with version 9.02 of ELO Textreader, the TT_Matrix text type is
no longer automatically set as the OCR text type (which improved recognition of line
printer prints in landscape format). Instead, TT_Matrix should be entered to ocr_lan-

guages in config.xml if documents of this type need to undergo OCR conversion.

If you use ELO Textreader and ELO OCR Service version 10.00.000 or higher, you no
longer need to set TT_Matrix.

Please note: ELO Textreader uses the ELO OCR Service exclusively in R mode when
converting image files, as well as PDF files up to Textreader version 9.03.006.000.
Please make sure that the ELO OCR Service is configured accordingly. See the ELO
OCR Service documentation for more information.

Note: You need to disable or uninstall ELOft if you are using Textreader version
9.17.040.000 in combination with Indexserver version 9.00.060 or higher.

1.5 Program installation


This program is contained in the ELO installation program.

ELO Digital Office GmbH 8


Technical Documentation
ELO Textreader

Up to version 9, the program was configured in config.xml. Starting with version 10, ELO Textrea-
der is configured using the ELO Administration Console. However, this document also describes
configuration via config.xml since the guide is also relevant for older versions.

Please note: When manually updating earlier versions (i.e. without running the ELO Server Installer
program), make sure that the previous ELO OCR programs and/or services are uninstalled or disab-
led.

For updates to version 9, please note the changes to config.xml with regard to the pdf file type and
additional entries for the OCR service, as well as image conversion (see below for more informa-
tion).

When updating to ELO Textreader version 10 and higher, please note the configuration in the ELO
Administration Console (see 1.5.1 Updating the ELO Textreader configuration to the current version

(ELO Textreader 10 and higher).

The process for manually upgrading an ELO 9 installation (Textreader version 9.03.006 or higher)
to the newer Textreader and OCR versions is described in a separate document Installing ELO TR

10 for ELO 9.

1.5.1 Updating the ELO Textreader configuration to the current version (ELO Textreader 10
and higher)

If you are updating an earlier version of ELO Textreader to the current ELO Textreader version, the
setup program reads the ELO Textreader config.xml and enters the options to the eloftopt database

table. Then, config.xml is backed up to config.xml.BAK and deleted, except for the ELO Indexserver
logon information. Only the ELO Indexserver URL as well as the logon name and password remain
in config.xml. config.xml is not overwritten when no BAK file can be created.

Please note: In some cases, default values are also entered to the eloftopt database
table on update. Make sure to check the ELO Textreader configuration (in the ELO
Administration Console) after updates.

ELO Digital Office GmbH 9


Technical Documentation
ELO Textreader

1.5.2 New Textreader installation

After a new installation by the setup program, only the Indexserver logon information is written to
config.xml. The path names of the converter input directories, the output directory, and different

default values are written to the eloftopt table. You can find the values under 4 Default values of the

configuration options after a new installation of the current Textreader version (version 10 and hig-

her). You can adjust these options in the ELO Administration Console, see 3 Configuration via the

ELO Administration Console (from Textreader 10, described for version 10.01.000).

ELO Digital Office GmbH 10


Technical Documentation
ELO Textreader

1.6 Logging outputs


Up to ELO Textreader version 11, all logging outputs are written to one logging file, which can be
configured in the log4j.properties file; this is a logback.xml file from ELO Textreader version
20.00.003.001.

Starting with ELO Textreader version 12, the most important logging outputs are also written to
a"Report log". The system logs whether a document was exported, imported, converted, or for-
warded to another converter/application (in particular, OCR). Entering a reason logs if a (partial)
document was unable to be converted.

Logging can be limited or extended, for example by configuring the log level in the log4j.proper-

ties/logback.xml file:

• The "Error" level logs if documents are unable to be converted.

• The "Warn" level logs other problems, for example unrecognized file types attached to MSG
or EML files.

• The "Info" level also logs when a document has been converted.

• The "Debug" level finally logs when documents have been exported or imported, or sent to
another application, e.g. image files to OCR.

ELO Digital Office GmbH 11


Technical Documentation
ELO Textreader

Example of a log4j.properties file. The logging level for the Report log has been set to "Info" under
"log4j.logger.reportlog=info, reportlog":

log4j.rootLogger=info, FI
log4j.logger.reportlog=info, reportlog
log4j.additivity.reportlog=false

# uncomment the following line for debug output:


#log4j.logger.de.elo=debug
#log4j.logger.org.apache.pdfbox=fatal

# output in file:
log4j.appender.FI=org.apache.log4j.DailyRollingFileAppender
log4j.appender.FI.File=e:/temp/log/tr-elo10.log
log4j.appender.FI.DatePattern='.'yyyy-MM-dd'.txt'
log4j.appender.FI.layout=org.apache.log4j.PatternLayout
log4j.appender.FI.layout.ConversionPattern=%d{ABSOLUTE} %t %1x %-5p (%F:%L)
- %m%n
log4j.appender.FI.append=true

log4j.appender.reportlog=org.apache.log4j.RollingFileAppender
log4j.appender.reportlog.File=e:/temp/log/reportlog.log
# DatePattern: one file each day:
log4j.appender.reportlog.DatePattern='.'yyyy-MM-dd'.txt'
log4j.appender.reportlog.layout=org.apache.log4j.PatternLayout
log4j.appender.reportlog.layout.ConversionPattern=%d{ABSOLUTE} %t %1x %-5p
- %m%n
log4j.appender.reportlog.append=true

Please note: The appender names "FI" and "reportlog" as well as the logger name "re-
portlog" must not be changed.

ELO Digital Office GmbH 12


Technical Documentation
ELO Textreader

The equivalent in Logback format, e.g. in a logback.xml file, would look as follows:

<?xml version="1.0" encoding="UTF-8"?>


<configuration>
<appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAp-
pender">
<file>c:/temp/logs/tr-elo20.log</file>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPo-
licy">
<!-- daily rollover -->
<fileNamePattern>c:/temp/logs/tr-elo20.%d{yyyy-MM-
dd}.log</fileNamePattern>
<maxHistory>30</maxHistory>
</rollingPolicy>
<append>true</append>
<encoder>
<pattern>%d{HH:mm:ss.SSS} %-60(%thread %X{NDC} %-5level \(%log-
ger{0}.java:%L\)) - %msg%n</pattern>
</encoder>
</appender>

<appender name="REPORT" class="ch.qos.logback.core.rolling.RollingFileAp-


pender">
<file>c:/temp/logs/report20.log</file>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPo-
licy">
<!-- daily rollover -->
<fileNamePattern>c:/temp/logs/report20.%d{yyyy-MM-
dd}.log</fileNamePattern>
<maxHistory>30</maxHistory>
</rollingPolicy>
<append>true</append>
<encoder>
<pattern>%d{HH:mm:ss.SSS} %-60(%thread %X{NDC} %-5level \(%log-
ger{0}.java:%L\)) - %msg%n</pattern>
</encoder>
</appender>

<root level="info">
<appender-ref ref="FILE" />
</root>

<logger name="reportlog" level="info" additivity="false">


<appender-ref ref="REPORT"/>
</logger>
</configuration>

ELO Digital Office GmbH 13


Technical Documentation
ELO Textreader

Please note: The appender names "FILE" and "REPORT" as well as the logger name
"reportlog" must not be changed.

It is important to configure each appender as a RollingFileAppender with a rollingPolicy.

You can change the log level not only in the log4j.properties or logback.xml file, but also on the ELO
Textreader status page (in "Edit" mode). You can also request new, empty log files on the status
page.

Fig. 1: Changing the log level and creating new log files on the ELO Textreader status page

However, here the log level of the Report log can only be changed together with the log level of the
default logger.

When creating new log files, logback and Log4j behave differently. In the logback variant, clicking
"Start new log file" or "Start new report log file" saves the previous log file by changing it to a log
file with the current date and time (e.g. tr-elo20.log#20200508-110444.051). The previous log file

is emptied. The name of the active log file (which can be configured in logback.xml) no longer chan-
ges.

This is different with the Log4j variant: Here, clicking "Start new log file" or "Start new report log
file" creates a new, empty log file, labeled with the date and time, and logged in this log file. The
names of the new log files are shown on the status page.

It is output in the Reportlog file in a defined format, explained with example file "00000003.xls"
(the file name of exported ELO documents always consists of the document ID in hexadecimal form,
followed by the file type):

INFO - HexId >00000003< DocId >3< Filename >00000003.xls< Converted


>C:\ELO11\data\tr-elo110\xls\00000003.xls<

The ELO document ID of the document to be converted is output in both hexadecimal and decimal
form, followed by the file name and file path, both in angle brackets (for better log file filtering).

ELO Digital Office GmbH 14


Technical Documentation
ELO Textreader

For elements of container documents, the document ID of the ELO document is always output as
the HexID and DocID. So, if the file "0000005E.zip" contains an HTML file, for example, it would
be extracted with the name "0000005E_000.html". After converting this HTML file, the log file
contains the entry:

INFO - HexId >0000005E< DocId >94< Filename >0000005E_000.html< Converted


>C:\ELO11\data\tr-elo110\htm\0000005E_000.html<

Information: You may miss relevant log outputs among the many Byps and PDFBox
messages. However, you can suppress these by adding the following two lines to the
log4j.properties file:

log4j.logger.org.apache.pdfbox=fatal
log4j.logger.byps=fatal

or in logback.xml:

<logger name="org.apache.pdfbox" level="off">


<appender-ref ref="FILE" />
</logger>/>
and
<logger name="byps" level="off">
<appender-ref ref="FILE" />
</logger>/>

(This disables logging for Byps and PDFbox. However, you can also enter "error"
instead of "off" to show at least the error messages.)

ELO Digital Office GmbH 15


Technical Documentation
ELO Textreader

2 Configuration via 'config.xml' (up to ELO Textreader 9)

Please note: If Textreader version 9 is running, that is, connected to Indexserver ver-
sion 9, the configuration is read from config.xml as before.

2.1 Structure of "config.xml" (up to Textreader 9)

Excerpt from config.xml:

<comment>parameters for this web application</comment>


<entry key="dirs_delete">C:\ELOenterprise\data\ft elo\delete</entry>
<entry key="dirs_eml">
C:\ELOenterprise\data\ft-elo\eml|C:\ELOenterprise\data\ft-elo\txt
</entry>
<entry key="dirs_pdf">
C:\ELOenterprise\data\ft-elo\pdf|C:\ELOenterprise\data\ft-elo\txt|ocr
</entry>
<entry key="dirs_tif">C:\ELOenterprise\data\ft-elo\tif</entry>
<entry key="file_max_size_MB">3</entry>
<entry key="file_max_error">50</entry>
<entry key="convert not possible value">notconv|notconv_sub</entry>

The Textreader parameters are saved in the <entry> tags in the config.xml file. A parameter consists
of a name that is entered in the key attribute of the <entry> tag and the associated value. The value
for the parameter can consist of multiple parts, each separated by a | (see dirs_pdf above for an
example).

dirs_pdf means that ELO Textreader can process PDF type documents (see overview above).

Only one value (in this example) is entered to dirs_tif. This means that ELO Textreader stores all TIF
documents there without processing them further, even those that, for example, have been extrac-
ted from ZIP archives or PDF or MSG documents. With this function, ELO Textreader can make
documents available to other applications as well.

If two values are entered (see dirs_eml above), ELO Textreader extracts the text from the
documents in the first directory (output directory) and stores a TXT file containing the extracted
text of the same name in the second directory (return directory).

ELO Digital Office GmbH 16


Technical Documentation
ELO Textreader

Please note (up to and including version 12.00.004.000): The individual converters
in Textreader are partially programmed to process multiple data types, but can only
be configured once (specifically, it is only possible to configure one output directory).
If the second data type has a different configuration, the first configuration, and there-
fore the output directory, would be overwritten and would no longer be scanned. The
following data types must have the exact same configuration in the config.xml:

• doc and dot

• docx and docm

• html and htm

• jpg and jpeg

• pdf and ai

• ppt and pps

• pptx and pptm

• tif and tiff

• log and csv

• xls, xla, and xlt

• xlsx and xlsm

• 7zip and 7z

If more than two values are entered (see dirs_pdf above), refer to the table below for their meaning.

Entry to config.xml Description

dirs_delete Everything is deleted from this directory. This para-


meter may contain a maximum of one target direc-
tory.

ELO Digital Office GmbH 17


Technical Documentation
ELO Textreader

file_max_size_MB Parameter for the maximum document size. PDF-


Box checks the document size when exporting
from the repository to the ELO Textreader folder
and when extracting image files from PDFs. Files
exceeding the maximum size are deleted immedia-
tely following export/extraction.

Warning: A value must be entered for


this parameter; otherwise no files will
be processed.
file_max_error Maximum number of corrupt files that could not be
processed or deleted. If this number is exceeded,
Textreader stops. This keeps the directories from
overflowing in the event of an error.
file_max_error_per_day Maximum number of corrupt files per day that
could not be processed or deleted. If this value is
set, the value in file_max_error is ignored. The
counter for converted documents is reset to 0 on
the following day, and Textreader continues to run.
convert not possible value Text that is written to the TXT file in the case of an
error.
dirs_ai, dirs_bmp, dirs_csv, These parameters can contain an input directory
dirs_doc, dirs_docx, dirs_dot,
and a target directory.
dirs_dxl, dirs_eml, dirs_gif,
dirs_gz, dirs_htm, dirs_html,
dirs_jpg, dirs_jpeg, dirs_log,
dirs_mht, dirs_mmf, dirs_msg,
dirs_odt, dirs_png, dirs_ppt,
dirs_pptx, dirs_rtf, dirs_vcf,
dirs_vsd, dirs_vsdx, dirs_wmf,
dirs_xml, dirs_xla, dirs_xls,
dirs_xlsx, dirs_xlt, dirs_zip,
dirs_7zip

ELO Digital Office GmbH 18


Technical Documentation
ELO Textreader

dirs_pdf Parameter for PDF documents. This parameter


can contain up to 9 values: first the input and tar-
get directory, then an optional ocr. With ocr, the
documents are processed by an OCR engine, which
simultaneously extracts the text from the embed-
ded graphics. An optional on converts the
documents with the Apache PDFBox converter if
the OCR engine generates errors or with the OCR
engine if the PDFBox converter generates errors.
ocr is the default. If ocr is omitted, conversion is
performed by the PDFBox converter (not recom-
mended for Textreader versions before
9.03.006.001, see above). The parameter false
means that no images will be extracted. After the
failover parameter, you can use true (default) to
specify that conversion is performed by OCR in the
event of an error.
Some additional parameters exist specifically for
the PDFBox converter. After the password parame-
ter: A password can be entered if documents are
encrypted. After encoding: A character set can be
specified that was used to create the document
(UNICODE by default).
dirs_tif, dirs_tiff Parameter for TIF(F) documents. This parameter
can contain up to four values: the input and output
directories, a conversion program, and a timeout. If
no conversion program is specified, the ELO OCR
Service performs conversion.
file_image_min_size_pixel Minimum width and height (in pixels) for images to
be processed. The default value is 64 pixels. This
check is only performed when extracting graphics
from PDF, EML, and MSG files.

ELO Digital Office GmbH 19


Technical Documentation
ELO Textreader

file_err_move_dir Corrupt files are moved to the specified directory


instead of being deleted as long as these are NOT
documents that the full text service has transferred
to Textreader, e.g. test files.
Documents that the full text service transfers to
Textreader are handled differently if an error oc-
curs: References are created for these documents
in the Administration¶ELO Textreader¶Documents

not converted folder (the ELO-Textreader folder has


the fixed GUID "(E10E1000-E100-E100-E100-
E10E10E10E24)", while the Documents not con-

verted folder has the GUID "(E10E1000-E100-


E100-E100-E10E10E10E25)"). These folders are
created by Textreader should they not yet exist.
dirs_* Default folder. If no target directory has been spe-
cified, documents from container formats (PDF,
ZIP, MSG, etc.) are stored in the specified direc-
tory.
ocr_languages
The ocr_languages parameter defines the sup-
ported languages for OCR processing (recom-
mended: the language used during server setup, as
well as at least English). Only enter the languages
that are actually used, as the OCR Service checks
the documents for all specified languages. You can
find a list of the available languages later on in this
document. You can also find a list of languages af-
ter installation in the Tomcat webapps Textreader
directory within ocr_languages.html.
You can also enter OCR text types here:

• TT_Normal (general typographic text)

• TT_Typewriter (typewriter)

• TT_Matrix (matrix printer)

ELO Digital Office GmbH 20


Technical Documentation
ELO Textreader

• TT_OCR_A (special OCR font, monospace)

• TT_OCR_B (special OCR font)

• TT_MICR_E13B (magnetic ink character


recognition, note: select the language
"E13B" here as well)

An entry for the OCR text type is not normally re-


quired, but line printer prints in landscape format
are an exception to this: For Textreader and ELO
OCR Service versions under 10, please enter
TT_Matrix in addition to the language settings.
ixurl The Indexserver URL required by the OCR service.
username Name of an ELO user that logs onto the Indexser-
ver by the OCR service
passwd Password of the ELO user (encrypted or not enc-
rypted). Textreader does not perform any subse-
quent encryption.
ocr_pdfPages Maximum number of pages of a document that will
be processed when the OCR service is called. If the
number of pages is greater than this value and "on"
is set as the parameter for "dirs_pdf" (see above),
Textreader attempts to process the document u-
sing the PDFBox library.
Recommendation for ELO Textreader versions up
to 9.03.006.001: Use as high a value as possible
so that Textreader always uses the OCR service.
The default value is 1000000. If the values are lo-
wer, i.e. it is probable that PDFBox will also be
used, ELO Textreader up to version 9.03.006.001
should always be deployed to a separate Tomcat
(as described above).

ELO Digital Office GmbH 21


Technical Documentation
ELO Textreader

Starting with ELO Textreader version


9.03.006.001, the issues with PDFBox have been
eliminated so these precautions are no longer ne-
cessary. PDF files should always be converted by
PDFBox.
Starting with ELO Textreader version 10, this op-
tion is no longer available, as PDF files are always
converted by PDFBox.
ocr_timeoutseconds Timeout in seconds within the OCR service (as op-
posed to the ocr_threadtimeoutseconds parameter
as the maximum time allowed in ELO Textreader
to wait for the result of OCR, see below). From
OCR version 9.0.1.0, this value applies to one page
that is to be scanned, whereas in earlier versions it
applied to the entire document. The default value
is 30 seconds. After this time has elapsed, OCR
stops processing the current page, and as a result,
the entire document.
ocr_threadtimeoutseconds Timeout per document in seconds in Textreader.
This is the maximum number of seconds that Text-
reader will wait for the result of the OCR. After this
time has elapsed, the Textreader continues with
conversion of the next document (as opposed to
with the ocr_timeoutseconds parameter as the
timeout per page within the OCR service, see
above). The default value is 900 seconds (15 minu-
tes). This value can be configured from version
9.2.04, as it may happen in rare cases that a con-
version time of 15 minutes is not sufficient for
some very large or complex documents.

ELO Digital Office GmbH 22


Technical Documentation
ELO Textreader

Starting with version 12.00.001.000 (version


11.01.001.000 for Textreader 11 or version
10.19.010.000 for Textreader 10), the timeout
per document is calculated based on the number
of pages multiplied by the timeout per page
("ocr_timeoutseconds"). The "ocr_threadtimeout-
seconds" value is only used as default when the
number of pages cannot be determined.
ocr_files Number of OCR documents to be processed in pa-
rallel. Do not enter a value greater than the configu-

red number of OCR worker processes entered in the

registry to WorkerCount (usually at HKEY_LO-

CAL_MACHINE\SOFTWARE\Wow6432Node\ELO Di-

gital\OCR\Service). The default value is 2.


ocr_fastMode This can be used to influence OCR pre-
cision. ocr_fastMode=true speeds up
OCR, but is less precise, while
ocr_fastMode=false does not use accele-
ration, but has improved precision.
"false" is the default setting.
ocr_singleColumnMode
ocr_singleColumnMode can be used to determine

whether OCR reads text in a table column by column

(ocr_singleColumnMode = false, default),

or line by line (ocr_singleColumnMode = true).


You may achieve a better result if the OCR is run in
column by column mode instead of in single-co-
lumn mode. However, the disadvantage of this me-
thod is that proximity searching is no longer pos-
sible, as the words in a line of a table are not
extracted in sequence but column by column.

ELO Digital Office GmbH 23


Technical Documentation
ELO Textreader

minimumfreediskspacepercentage Here you specify (in percent of the overall size of


the particular partition) how much disk space
should remain free at a minimum when the PDF-
Box converter extracts files (value "0" means no
check; no value means that default values depen-
ding on the operating system are used).
minimumfreediskspace Here you specify in MB how much disk space
should remain free at a minimum when the PDF-
Box converter extracts files (value "0" means no
check; no value means that default values depen-
ding on the operating system are used).
maxNumberOfFilesInFolder Here you specify the maximum amount of files
written to a directory when the PDFBox converter
extracts files (value "0" means no check; no value
means that default values depending on the opera-
ting system are used).
trace_flag
If trace_flag is set to false, log outputs from converters

are suppressed if they only report that a certain directory

has been scanned. Default: false.


max_conv_time_pdfbox The max. time for processing a PDF file by Apache
PDFBox (excluding conversion time for any extrac-
ted image files). Default: 10 minutes.

ELO Digital Office GmbH 24


Technical Documentation
ELO Textreader

2.2 Sample entries to "config.xml"


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE properties SYSTEM http://java.sun.com/dtd/properties.dtd>
<properties>
<comment>parameters for this web application</comment>
<entry key="file_max_size_MB">20</entry>
<entry key="dirs_pdf">C:\ELOent\data\ft-elo\pdf|C:\ELOent\data\ft-
elo\txt</entry>
<entry key="dirs_mmf">C:\ELOent\data\ft-elo\mmf|C:\ELOent\data\ft-
elo\wmf</entry>
<entry key=“dirs_wmf“>C:\ELOent\data\ft-elo\wmf|C:\ELOent\data\ft-
elo\txt</entry>
<entry key=“dirs_tif“>C:\ELOent\data\ft-elo\tif</entry>
<entry key=“convert not possible value“>Document not processed</entry>
<entry key="file_err_move_dir">C:\ELOent\data\ft-elo\notconvfiles</entry>
<entry key="file_image_min_size_pixel">70</entry>
</properties>

2.2.1 Explanations for configuring document size

Only documents up to 20 MB in size can be processed.

2.2.2 Explanations for configuring PDF documents

The text is extracted from the PDF document and then saved to a text file in the directory
C:\ELOent\data\ft-elo\txt. If the PDF document contains a TIF document, this will be
saved under C:\ELOent\data\ft-elo\tif. All other image files (JPG, PNG) are ignored, as
Textreader does not recognize them. To process other image formats, corresponding entries should
be made to config.xml (such as <entry key="dirs_jpg">C:\ELOent\data\ft-
elo\jpg</entry>).

The OCR PDF converter also requires the ai file type to be configured with the pdf input folder (not,

for example, ai). The ocr parameter must be added here as well.

If you know in advance that you have documents unable to be processed by the OCR service, such
as documents with file protection settings or that cannot be processed by PDFBox because the for-
mat differs from what PDFBox expects (PDFbox would show an error message), it may be useful to
set the on parameter in the PDF document configuration. on will cause the respective other conver-
ter to attempt conversion again if errors occur with one of the two PDFBox converters.

2.2.2.1 Combination of possible settings for PDF documents

The following combinations of possible settings exist for the PDF document conversion process:

ELO Digital Office GmbH 25


Technical Documentation
ELO Textreader

a) <entry key=“dirs_pdf C:\ELOent\data\ft-elo\pdf|C:\ELOent\data\ft-


elo\txt|ocr</entry>

The default case. Processing occurs via OCR. PDFBox is not used, even if errors occur.

b) <entry key=“dirs_pdf“ C:\ELOent\data\ft-elo\pdf|C:\ELOent\data\ft-elo\txt


|ocr|off</entry>

Processing occurs via OCR. PDFBox is not used, even if errors occur.

c) <entry key=“dirs_pdf“ C:\ELOent\data\ft-elo\pdf|C:\ELOent\data\ft-elo\txt


|ocr|on</entry>

A check is run to see if the document is very large (number of pages is greater than the value
of ocr_pdfPages). If so, conversion is performed by the PDFBox converter. If not, the
document is processed with OCR. The default value for ocr_pdfPages is 1000000, which
means that the OCR service is normally always used (if you need to deviate from the default
value and switch between the OCR Service and the PDFBox converter depending on the
document size, Textreader should be installed to a separate Tomcat for Textreader versions
before 9.03.006.001 due to a risk of the PDFBox converter crashing). on causes conversion
to start again when OCR errors occur, using the PDFBox converter as a replacement (wit-
hout checking the number of pages).

d) <entry key=“dirs_pdf“ C:\ELOent\data\ft-elo\pdf|C:\ELOent\data\ft-


elo\txt</entry>

Without adding "|ocr", the PDFBox converter is used. No conversion by OCR is attempted
in the event of an error.

e) <entry key=“dirs_pdf“ C:\ELOent\data\ft-elo\pdf|C:\ELOent\data\ft-


elo\txt|on</entry>

Without adding "|ocr", the PDFBox converter is used. Conversion by OCR is attempted in
the event of an error.

f) <entry key=“dirs_pdf“ C:\ELOent\data\ft-elo\pdf|C:\ELOent\data\ft-


elo\txt|off</entry>

Without adding "|ocr", the PDFBox converter is used. No conversion by OCR is attempted
in the event of an error.

ELO Digital Office GmbH 26


Technical Documentation
ELO Textreader

2.2.3 Explanations for configuring MMF documents

An MMF document (COLD) is converted to WMF format and saved to the input directory of the WMF
converter. To process MMF documents, first you have to configure a WMF converter.

2.2.4 Explanations for configuring WMF documents

Text is extracted from WMF documents and saved to a text file in C:\ELOent\data\ft-
elo\txt.

2.2.5 Explanations for configuring TIF documents

TIFF image files, such as those in a PDF document, are saved under C:\ELOent\data\ft-
elo\tif, but Textreader does not process these files/documents further. This applies to all for-
mats when only one directory is provided, meaning the target directory is missing.

2.2.6 Explanations for configuring 7-Zip documents

The following compression methods are supported:

• AES256

• SHA256

• BZIP2

• BCJ_IA64_FILTER

• BCJ_PPC_FILTER

• BCJ_X86_FILTER

• BCJ_ARM_FILTER

• BCJ_ARM_THUMB_FILTER

• BCJ_SPARC_FILTER

• DEFLATE

• DEFLATE64

• DELTA_FILTER

• LZMA

• LZMA2

7-Zip documents with other compression methods are treated as faulty.

ELO Digital Office GmbH 27


Technical Documentation
ELO Textreader

2.2.7 Explanations for configuring the minimum width and height

The minimum width and height (in pixels) for images to be processed is 70. This is important when
extracting icons from Office or PDF files, for example; the icons should not generally be added to
the full text database.

Please note: The minimum size check is currently only performed when extracting gra-
phics from PDF, MSG, or EML documents.

2.2.8 Explanations for configuring behavior in case of an error

If an error occurs while extracting the text, a text file is generated with "Document not processed" as
its contents.

Textreader creates references to corrupt ELO files in the Administration¶ELO Textrea-


der¶Documents not converted folder (see above). Non-ELO documents will be moved to
the error directory if such a file is named in config.xml ("file_err_move_dir").

ELO Digital Office GmbH 28


Technical Documentation
ELO Textreader

3 Configuration via the ELO Administration Console (from


Textreader 10, described for version 10.01.000)
After calling up the ELO Administration Console, you will find the Textreader configuration under

Server modules > Full text service (Textreader). Experience has shown that it is not necessary to offer
all parameters in the Administration Console that could be configured in the config.xml. In fact, the
opposite was true: This caused confusion among administrators. For this reason, compared with
previous versions, the ELO Textreader configuration options in the ELO Administration Console
have been reduced to those system parameters which make most sense (for other values, the
defaults are applied). However, in the database it is still possible to configure parameters that are
not listed below with values that ELO Textreader does not apply by default. You will therefore also
find the "optid" from the "eloftopt" database table in the list of default values further below if you
would like to change default values.

3.1 Structure of 'config.xml' starting with ELO Textreader 10


Starting with ELO Textreader 10, only the URL to the ELO Indexserver, the logon name, and the
password remain in config.xml, as illustrated in the following example.

Example config.xml file starting with ELO Textreader 10:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>


<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>parameters for this web application</comment>
<entry key="ixurl">http://srvtdev-elo11-1:9090/ix-elo110/ix</entry>
<entry key="passwd">xxx</entry>
<entry key="username">Administrator</entry>
</properties>

The Textreader parameters are saved in the <entry> tags in the config.xml file. A parameter consists

of a name that is entered in the key attribute of the <entry> tag and the associated value.

3.2 'Return directory' area

Enter the return directory for the Textreader converter here (in the previous config.xml, the target

directory that could be entered after the first "|" icon for the individual file types). The text files from
this folder are uploaded as full text files to the ELO repository.

ELO Digital Office GmbH 29


Technical Documentation
ELO Textreader

Fig. 2: Full text service, return directory

ELO Digital Office GmbH 30


Technical Documentation
ELO Textreader

3.3 'Output folders and type-specific configuration' area

In this case, output directories are the directories to which the documents were exported from the
ELO repository for conversion. A directory must be added for each file type to be added to the full
text database.

Fig. 3: Full text service, output folders

It is possible to specify several file types in one line if these are to be configured identically for a
directory path (see below).

If the user is able to define options for a file type, the editing icon appears in the respective line.
After clicking on this gearwheel icon, the file type is opened and the user can click or enter additio-
nal options. If the user clicks the gearwheel icon again, this closes the option (however, it has not
been saved; you need to click "Save" to do so).

ELO Digital Office GmbH 31


Technical Documentation
ELO Textreader

3.3.1 File types with identical configuration (up to and including version 12.00.004.000)

Please note: The individual converters in Textreader are partially programmed to process multiple
data types, but can only be configured once (specifically, it is only possible to configure one output
directory). If the second data type has a different configuration, the first configuration, and there-
fore the output directory, would be overwritten and would no longer be scanned. The following
data types must have the exact same configuration (except for the option Convert externally).

• doc and dot

• docx and docm

• html and htm

• jpg and jpeg

• pdf and ai

• ppt and pps

• pptx and pptm

• tif and tiff

• xls, xla, and xlt

• xlsx and xlsm

• 7zip and 7z

It is therefore useful to put data types that need to be configured identically in one line, as in the list
shown above. You only need to configure these data types separately, i.e. in a separate line, if you
want to convert them externally as well if you want the Textreader to convert them.

Please note: Up to and including version 12.00.004.000, the configuration for the
following files types should be done in one line, and must be identical in any case.

• doc and dot

• docx and docm

ELO Digital Office GmbH 32


Technical Documentation
ELO Textreader

• html and htm

• jpg and jpeg

• pdf and ai

• ppt and pps

• pptx and pptm

• tif and tiff

• xls, xla, and xlt

• xlsx and xlsm

• 7zip and 7z

3.3.2 Option 'Convert externally'

The Convert externally option is available for all file types.

Fig. 4: Full text service, 'Convert externally' option for the file type "docx"

ELO Digital Office GmbH 33


Technical Documentation
ELO Textreader

Explanation: Not all file types must be converted by the Textreader converters. It is also possible
for ELO Business Partners to use their own converters that read the input directories described
here and write them to the output directory. If you want to process a file type with a converter other
than the ELO converter, select the Convert externally check box. However, this only applies to the
converters. The Textreader export service would still continue to export documents from the repo-
sitory into such directories. The name of the file in the output directory must match the name of the
file in the input directory, except the file type must now be TXT (instead of the original file type).
Example: Following export from the repository, the names of the files in the input directory always
consist of the hexadecimal values of the ELO document ID plus the respective file type, e.g.
"23b.docx". In this case, the converted file in the output directory would be "23b.txt".

For multipage documents, exactly one TXT file is expected in the output directory.

Files in the container format (e.g. file type ZIP, but also newer Office formats DOCX, XLSX, etc.) are
split into individual files (images and attachments). The converted individual files must be labeled
with a three-digit suffix in the form of a counter, e.g. "23b_001.txt", "23b_002.txt". Based on the
first part of the file name (the document ID in hexadecimal format), the Indexserver knows what
ELO documents the text file should be assigned to as full text.

3.3.3 Options for the 'PDF' file type

Fig. 5 : Full text directory, options for the 'PDF' file type

3.3.3.1 'Extract images with the PDF converter' option

Check the box here if you want the PDF converter to extract images.

3.3.3.2 'Create password' option

Enter a password for a password-protected PDF document (used by the Apache PDFBox converter
when opening the PDF file).

ELO Digital Office GmbH 34


Technical Documentation
ELO Textreader

3.3.3.3 'PDF failover mode' option

Check the box for failover mode here. PDF documents are only converted by PDFBox by default.
Starting with ELO Textreader version 20 (or starting with version 12.00.005.000 for TR12, version
11.01.006.000 for TR11, version 10.19.100.000 for TR10) the documents are forwarded to OCR
after conversion errors. Default: enabled.

3.3.4 Options for the 'TIF(F)" file type

Fig. 6: Full text directory, options for the 'TIF(F)' file type

The options in this section are useful when no ELO OCR Service is installed.

3.3.4.1 'Only export from container files' option

If ELO OCR is not installed, you can extract the TIF(F) files from container files (such as ZIP files) to
the TIFF directory and process them with a separate conversion program. Select the relevant check
box. If the check box is enabled, Textreader does not go through this directory, but copies TIFF files
to this directory when converting e.g. ZIP files. Difference to Convert externally: If the option Convert

externally is selected, ELO Textreader does not move files to the TIFF directory during export.

3.3.4.2 'External TIFF converter' option

Enter an external conversion program that Textreader should use for the individual TIFF files.
Otherwise, conversion will take place via OCR. Difference to the separate conversion programs
described above: This external TIFF converter does not have to search directories for files to be
converted on its own in intervals, but is called up by Textreader and only converts one document at
a time.

3.3.4.3 'Timeout for the external TIFF converter' option

Enter a timeout value for the external TIFF converter.

ELO Digital Office GmbH 35


Technical Documentation
ELO Textreader

3.3.5 Conversion without text files

TXT type text files are exported straight to the return directory and imported back to the repository
as full text files, without requiring any additional configuration.

It is also possible to add text files of other file types unknown to ELO Textreader (e.g. file type SQL)
to the full text database without having to provide an external converter. These text files can be
used as full text files right away without being converted. For this, enter the return directory, nor-
mally the TXT directory, as the directory path in the configuration within the ELO Administration
Console.

Fig. 7: Full text service, configuration for text files with file type ‘vbs‘, ‘js‘, or ‘sql‘

3.3.6 Add a new directory

To add an additional file type, enter one or more file extensions and a directory path to the Add a

further directory line and click Add.

Fig. 8: Full text service, Add further directory

3.3.7 Remove output folders

If you want to remove file extensions and their output directories, click the Delete icon. If the output
directory is being used for multiple file types, you will be requested to confirm deletion. Otherwise,
the line turns pink and an Undo icon will appear instead of the Delete icon, as shown in this exa-
mple:

ELO Digital Office GmbH 36


Technical Documentation
ELO Textreader

Fig. 9: Full text service, Delete directory

Click the Undo icon to restore the line.

3.4 'Textreader settings' area

3.4.1 "Minimum width or minimum height" option

Minimum width and height (in pixels) for images to be processed. The default value is 64 pixels.

Please note: The minimum size check is currently only performed when extracting gra-
phics from PDF, MSG, or EML documents.

3.4.2 'Maximum error count' option

Maximum number of corrupt files that could not be processed or deleted. If this number is excee-
ded, Textreader stops. Default: 10,000.

3.4.3 'Maximum number of files exported per minute' option

The maximum number of files downloaded from the repository to the Textreader input directories
for conversion. If you have a larger number of documents to export, it may make sense to enter a
value here, e.g. 100, to prevent system overload and Tomcat/ELO server standstill. Default: 0 (no
upper limit). When this limit is reached, the export thread in Textreader pauses for one minute.

ELO Digital Office GmbH 37


Technical Documentation
ELO Textreader

Please note: Textreader sends a search query to the Indexserver to export the
documents and then receives a package of 100 documents to be exported in return. If
Textreader has to pause for one minute frequently because the maximum number of
files per minute has been reached, processing these 100 documents may take longer
than the maximum lifetime of a search ticket (default: 10 minutes). When Textreader
goes to retrieve the next 100 documents, the search has timed out and has to be run
again. Documents may be exported multiple times. You should enter a realistic value
for Maximum number of files exported per minute so that 100 documents can be ex-
ported in 10 minutes (for a lifetime of 10 minutes, the entry for maximum number of
files to be exported per minute should be 10 or higher).

3.4.4 'Maximum number of files imported per minute' option

The maximum number of converted files that are imported back into the repository from the return
directory. If you have a larger number of documents to import, it may make sense to enter a value
here, e.g. 100, to prevent system overload and Tomcat/ELO server standstill. Default: 0 (no upper
limit).

3.4.5 'Maximum error count per day' option

Maximum number of corrupt files per day that could not be processed or deleted. If this value is set,

the value of the option 'Maximum error count' is ignored. The counter for converted documents is
reset to 0 on the following day, and Textreader continues to run. Default: not set.

3.4.6 'Maximum file size in MB' option

Maximum size of the files to be converted in MB. Default: 40. PDFBox checks the document size
when exporting from the repository to the ELO Textreader folder and when extracting image files
from PDFs. Files exceeding the maximum size are deleted immediately following export/extraction.

3.4.7 'Timeout per page for the OCR service' option

Time in seconds until a timeout occurs within the OCR service (as opposed to the Timeout per OCR

document for the Textreader parameter as the maximum time allowed in the Textreader to wait for
the result of the OCR, see below.) From OCR version 9.0.1.0, this value applies to one page that is
to be scanned, whereas in earlier versions it applied to the entire document. The default value is 30
seconds. After this time has elapsed, OCR stops processing the current page, and as a result, the
entire document.

ELO Digital Office GmbH 38


Technical Documentation
ELO Textreader

3.4.8 'Timeout per OCR document for the Textreader' option

Time in seconds per document until timeout in Textreader. This is the maximum number of se-
conds that Textreader will wait for the result of the OCR. After this time has elapsed, Textreader
continues with conversion of the next document (as opposed to with the Timeout per page for the

OCR service parameter as the timeout per page within the OCR service, see above). The default va-
lue is 900 seconds (15 minutes). This value can be configured from version 9.2.04, as it may hap-
pen in rare cases that a conversion time of 15 minutes is not sufficient for some very large or com-
plex documents. In general, 15 minutes could also be too long.

Please note: Starting with version 12.00.001.000 (starting with version


11.01.001.000 for Textreader 11, or 10.19.010.000 for Textreader 10), the timeout
per document is calculated from the number of pages of the document multiplied by
the timeout per page. The timeout per document configured here is used only as
default if the number of pages cannot be determined.

3.4.9 'Minimum free disk space in percent' option

Here, you can specify how much disk space (in percent of the overall disk space) must remain free
for Textreader export and conversion processes (restrictions for conversion processes: in the cur-
rent version, only the PDF converter with PDFbox checks the disk space when extracting images
and attachments). It you set this to Automatic operating system specification, the minimum Textrea-

der disk space is determined (depending on the operating system). If you set Disable check, no check

is performed. However, you can also enter whatever values you want (custom entry). In this case, an
input option appears on the right where you can enter how much disk space must remain free in
percent.

After reaching the set threshold, the export process (and up to version 11 of ELO Textreader the
conversion process) pauses for several minutes. It then checks whether sufficient space is free
again.

ELO Digital Office GmbH 39


Technical Documentation
ELO Textreader

Starting with ELO Textreader version 12 the converters do not pause. Instead, when the minimum
disk space is exceeded, conversion of the current PDF document is canceled, the image or attach-
ment causing the disk to be full is deleted, and conversion continues with the next PDF. ELO Text-
reader notes this PDF document and checks at regular intervals whether there the disk now has
enough space. If this is the case, it attempts to extract the images and attachments from this PDF
that have not yet been processed. This prevents ELO Textreader from being deadlocked by extrac-
ting many very large TIFF files from a PDF file.

Fig 10: Full text service, minimum free disk space in percent

3.4.10 'Minimum absolute free disk space' option

Here, you can specify how much disk space (in MB) must remain free for Textreader export and
conversion processes (restrictions for conversion processes: in the current version, only the PDF
converter with PDFbox checks the disk space). It you set this to Automatic operating system specifi-

cation, the minimum Textreader disk space is determined (depending on the operating system). If you

set Disable check, no check is performed. However, you can also enter whatever values you want

(custom entry). In this case, an input option appears on the right where you can enter how much disk

space must remain free in MB. If you set Disable check, no check is performed.

After reaching the set threshold, the export process (and up to version 11 of ELO Textreader the
conversion process) pauses for several minutes. It then checks whether sufficient space is free
again.

Starting with ELO Textreader version 12, the converters do not pause, as described above.

Fig 11: Full text service, minimum absolute free disk space

ELO Digital Office GmbH 40


Technical Documentation
ELO Textreader

3.4.11 'Maximum number of files in folder' option (up to ELO Textreader version 11.01)

Here, you can have define the maximum number of files for a single output folder, disable the check,
or configure a custom number. After reaching this threshold, the export process (and up to version
11 of ELO Textreader the PDF conversion process) pauses for several minutes. It then checks whe-
ther the maximum number of files has gone below the threshold again.

Starting with ELO Textreader version 11.01, the maximum number of files in a folder is no longer
checked by default for new installations. This setting has been removed from the ELO Administra-
tion Console. However, it is still possible to configure this option in the database (see 5.2 'Settings'
section).

Starting with ELO Textreader version 12, the converters do not pause when the maximum number
of files in a folder is exceeded. Instead, they proceed as when the minimum disk space is exceeded
(see above).

Fig. 12: Full text service, maximum number of files in folder

Please note: Starting with ELO Textreader version 11.01, the maximum number of
files in a folder is no longer checked by default for new installations.

Please note: Starting with ELO Textreader version 12, the converters no longer stop
when the minimum disk space or maximum number of files in a folder is exceeded, but
only when the import process is no longer running. The converters do not stop if the
import process was deactivated.

ELO Digital Office GmbH 41


Technical Documentation
ELO Textreader

3.4.12 'Maximum number of characters per line in MSG files' option

Enter the maximum line length in MSG files. Lines that exceed this value are skipped during con-
version. This is due to the fact that having very long lines in MSG files can cause the converter to
crash. You should only enter a larger number if you have good reason to do so. Default: 100000.

3.4.13 'Maximum size of a full text file' option

Here, you can specify the maximum size of the resulting text file to be imported back to the reposi-
tory as a full text file. Default: 100 MB.

3.5 "Additional Textreader configuration" area

3.5.1 'Extended trace' option

If the option Extended trace is not selected, log outputs from converters are suppressed if they only

report that a certain directory has been scanned. Default: not enabled.

3.5.2 "Default folder" option

When processing document container files (PDF, ZIP, MSG, etc.), the files are extracted from these
documents and copied to the corresponding target directory according to the type of file. If there is
no target directory for specific file types, these files are copied to the default folder. If you do not
specify a folder, these files will not be extracted. Default: not set Every 24 hours or when the Text-
reader is stopped, a list of the number of documents for which no target directory is configured is
output as a warning. In addition, the number of documents of a specific type that are not extracted
is also output from version 20.00.001.002.

Fig. 13: Example list of unknown file extensions every 24 hours

3.5.3 'Running times for export and import' option

Enter (as previously for the FT service) at what hours the export and import service between the
ELO repository and the Textreader input/output directories should be active ("00" to "23").

ELO Digital Office GmbH 42


Technical Documentation
ELO Textreader

Fig. 14: Full text service, running times for export and import

3.5.4 'General behavior in the case of an error' option

Fig. 15: Full text service, general behavior in the case of an error

Here, specify whether the ELO document should be referenced in a special error folder in the event
of conversion errors, or whether no action should take place (the invalid document is then just de-
leted from the folder). For documents that were exported from the ELO repository to the Textrea-
der incoming folders, references are created in the Administration¶ELO-Textreader¶Documents not

converted ELO folder if an error occurs. The ELO Textreader folder has the fixed GUID "(E10E1000-

E100-E100-E100-E10E10E10E24)", while the Documents not converted folder has the GUID
"(E10E1000-E100-E100-E100-E10E10E10E25)". These folders are created by Textreader
should they not yet exist. The references in the error folders are deleted automatically if these
documents can be converted in later Textreader runs after all. These ELO folders are named ac-
cording to the set language, currently German, English or French, or English as the default langu-
age. If you want to use a different default language, you can set it in the "messages_dflt.properties"
file. After installing ELO Textreader, open the Textreader .jar file "webapps\...\WEB-INF\lib\tr.jar"
in the ELO server directory and then the properties file "de\elo\tr\server\messages_dflt.proper-
ties".

3.5.4.1 Remarks on further configuration options in the database

• If an error folder has been configured in the database (see below), invalid documents that
were NOT exported from ELO (e.g. test files) are moved to this folder in the file system. If
no error folder has been configured, these files are deleted in the event of an error (default:
no delete folder configured).

ELO Digital Office GmbH 43


Technical Documentation
ELO Textreader

• In the case of an error, notconv is written to the TXT file by default, or notconv_sub in
the case of partial files of container files (e.g. from ZIP files). Other error texts can be confi-
gured in the database.

• As an alternative to the behavior described under 3.5.4 'General behavior in the case of an

error' option, you can also specify that the error should only be recorded in the output file,
and that no other action should be performed.

• For container files (e.g. ZIP or MSG files that contain other files), it may not make sense to
reference the entire container file e.g. in the repository as corrupt. Thus, a behavior devia-
ting from the behavior in the database selected above can be configured here for container
files.

(The configuration options in the database are described further below).

3.6 'OCR options' area

3.6.1 'OCR read mode' option

The option OCR read mode can be used to determine whether OCR reads text in a table column by

column (multiple column mode, OCR attempts to recognize blocks, default), or line by line (single

column mode, OCR no longer attempts to recognize blocks). You may achieve a better result if the
OCR is run column by column in multiple column mode instead of in single column mode. However,
the disadvantage of this method is that proximity searching is no longer possible, as the words in a
line of a table are not extracted in sequence but column by column.

3.6.2 Option 'OCR recognition - Detailed'

This can be used to affect the OCR precision. Selecting fast speeds up OCR, but is less precise, while

selecting detailed does not use acceleration, but has improved precision. Detailed is the default set-
ting.

ELO Digital Office GmbH 44


Technical Documentation
ELO Textreader

3.6.3 'OCR languages' option

The OCR languages parameter defines the supported languages for OCR processing (recom-
mended: the language used during server setup, as well as at least English). Only enter the langu-
ages that are actually used, as the OCR Service checks the documents for all specified languages.
On the left side, you will find a list of languages supported by the installed OCR. In addition, here
you can select OCR text types, which normally is not necessary (from Textreader 10 it is no longer
necessary to specify TT_Matrix in addition to the language settings if Lineprinter printouts are to be
converted in landscape format). For the sake of completeness, here is a list of the OCR text types:

• TT_Normal (general typographic text)

• TT_Typewriter (typewriter)

• TT_Matrix (matrix printer)

• TT_OCR_A (special OCR font, monospace)

• TT_OCR_B (special OCR font)

• TT_MICR_E13B (magnetic ink character recognition, note: select the language "E13B" here
as well)

Fig. 16: Full text service, configuration of the OCR languages

ELO Digital Office GmbH 45


Technical Documentation
ELO Textreader

Information: Before version 10.00.010 of the ELO Administration Console, there is


no option to configure OCR text types. If you want to configure OCR text types anyway,
please enter these in the database table eloftopt in the line with optid 1139.

3.6.4 'Number of OCR workers' option

Background: In OCR, we configure a number of worker processes running in parallel that actually
convert images. This number was defined in the registry under "WorkerCount" for OCR 9, e.g. un-
der HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\ELO Digital\OCR\Service, and
in the OCR config.xml by the "workercount" key for OCR 10. Normally, the number of cores *2 is
entered as the number of OCR workers. OCR then generates the corresponding number of worker
processes.In the Administration Console, you can configure the maximum number of workers Text-
reader can use (the remaining workers are used by the ELO Java Client). Textreader creates this
number of threads, which send the image files to the workers for conversion in parallel.

If you enter more workers than OCR makes available, OCR may report to Textreader that there are
no free workers. In this case, Textreader will wait and try to send the document to OCR again. It will
keep trying until it finds a free worker; however, we do not recommend configuring more workers
than are available for performance reasons.

Under Number of used OCR connections, enter the maximum number of workers that can be used
by the Textreader but do not enter a value that exceeds the "WorkerCount" specified in the registry
or "workercount" in the OCR config.xml. If necessary, subtract a specific number of workers for the
ELO Java Client.

The default value is 2.

Information: The other previous FT options can be omitted since the integrated ELO
Textreader now calls up the database via the ELO Indexserver or uploads the full text
files to the ELO repository and only the ELO iSearch is supported.

ELO Digital Office GmbH 46


Technical Documentation
ELO Textreader

4 Default values of the configuration options after a new in-


stallation of the current Textreader version (version 10 and hig-
her)
After a new installation, as described above, the Textreader input and output folders are configu-
red. Otherwise, ELO Textreader uses the default values listed here, which you may have to adapt
to your specific requirements after an installation via the ELO Administration Console (number of
OCR workers in particular must be adapted to the amount of workers configured for the ELO OCR
Service):

Information: The ELO Administration Console adds default values (in the database
table "eloftopt") to missing configuration settings. This feature can also be used to fill
empty "eloftopt" tables, e.g. after a setup or update during which the table was created
but left empty.

4.1.1 'Output folders and type-specific configuration' section

• The following data types are configured:

ai, bmp, doc, docx, dot, eml, htm, html, jpg, jpeg, mht, mmf, msg, odt, pdf, png, pps, ppt, pptx,

rtf, tif, tiff, vcf, vsd, vsdx, wmf, xla, xls, xlsx, xml, zip, 7zip, 7z

• All file types are processed.

• PDF file type

o Extract images with the PDF converter option is selected

o PDF converter option: Apache PDFBox This option has been removed from ELO
Textreader version 20.00.005.002, as Apache PDFBox is the only available PDF
converter. You can no longer make a selection.

o Character set option: UNICODE (can only be changed in the database)

4.1.2 'Textreader settings' section

• Minimum width or minimum height option: 64 (pixels)

• Maximum error count option: 10,000

ELO Digital Office GmbH 47


Technical Documentation
ELO Textreader

• Maximum number of files exported per minute option: 0 (unlimited)

• Maximum number of files imported per minute option: 0 (unlimited)

• Maximum error count per day option: not set

• Maximum file size in MB option: 40

• Timeout per page for the OCR service option: 30

• Timeout per OCR document for the Textreader option: 900 seconds (15 minutes)

• Minimum free disk space in percent, Minimum absolute free disk space options: The values
are determined based on the system.

• Maximum number of files in folder: Deactivated starting with Textreader 11.01

• Maximum number of characters per line in MSG files option: 100000

4.1.3 'Additional Textreader configuration' section

• Character set option: UNICODE (can only be changed in the database)

• Extended trace option: not selected

• Use copy and delete to move files option: not set

• Default folder option: not set

• Running times for export and import option: Hourly ("00"-"23")

• General behavior in the case of an error option: Create reference in the repository

4.1.4 'OCR options' section

• OCR read mode option: Multiple column mode

• OCR Mode option: Detailed

• OCR languages option: German, English, French

• Number of OCR workers option: 2

4.1.5 'Behavior on errors' section

• Error marker in the output file option: notconv|notconvsub

ELO Digital Office GmbH 48


Technical Documentation
ELO Textreader

• General behavior in the case of an error option: Create reference in the repository

• Deviations for container files option: Create reference in the repository

ELO Digital Office GmbH 49


Technical Documentation
ELO Textreader

5 Changing default values in the database (from version 10


onward)
As described above, some default values are set starting with version 10 that you should not have
to change. For this reason, it is no longer possible to configure these values in the ELO Administra-
tion Console. However, we will still explain how to change these values in the eloftopt database
table.

Please note: Changes to the database are always critical and could result in system
failure. Only change values in the database after careful consideration and if you un-
derstand the following.

5.1 'Output folders and type-specific configuration' section

• PDF file type:

o Character set option: The default is UNICODE. Can be changed in the eloftopt data-

base table under optid 131415, for example. 1314 is the (variable) optid for the file

type pdf; the label for the option character set is 15. (The optid values are assigned
sequentially for the data types and are therefore variable; the optid values of the
options for individual data types are calculated from the optid of the data type with
an identifier associated with the respective option, here "15").

o Max. PDF conversion time: option: Entry in minutes; default: 10 minutes. Can be

changed in the eloftopt database under optid 131417, if 1314 is the (modifiable) op-

tid for the file type pdf. The ID for the option Max. PDF conversion time is 17.

o Option Maximum number of graphics extracted per document: By default, a maxi-


mum of 20,000 graphics are extracted per document. Conversion of the PDF
document continues after this, but without extracting images. Can be changed in
the eloftopt database under optid 131419, if 1314 is the (modifiable) optid for the

file type pdf. The ID for the option Max. number of graphics extracted per document

is 19.

ELO Digital Office GmbH 50


Technical Documentation
ELO Textreader

o Conversion by OCR in case of error (failover) option: PDF documents are only con-
verted by PDFBox by default. Starting with ELO Textreader version 20 (or starting
with version 12.00.005.000 for TR12, version 11.01.006.000 for TR11, version
10.19.100.000 for TR10) the documents are forwarded to OCR after conversion
errors. This feature can be disabled by entering "false“ to the eloftopt database

table under optid 131418, if 1314 is the (modifiable) optid for the file type pdf. The
default value is "false", meaning the feature is enabled. Starting with version
20.01.000.002, you can also enable or disable failover mode in the ELO Administ-
ration Console.

Please note: After conversion errors by PDFBox, PDF files are for-
warded to OCR by default. Starting with version 20.01.000.002, you
can also enable or disable failover mode in the ELO Administration Con-
sole.

o Smooth images in PDF files option: Fonts in images extracted from PDF files and
forwarded to OCR may be frayed, so that OCR would not be able to do its job pro-
perly. This issue can be solved by reducing the image size, referred to as
"smoothing". A factor of 0.4 has proven to be a good value for smoothing frayed
images. This reduction factor can be configured by entering a value to optid 131420

in the eloftopt database table, if 1314 is the (modifiable) optid for file type pdf. The

ID for Smooth images in PDF files is 20. The default factor for smoothing is "0.0" –
this disables the feature, meaning no smoothing takes place (not when the com-
plete PDF is forwarded to OCR, not when using ICEpdf, not in Textreader version
9).

ELO Digital Office GmbH 51


Technical Documentation
ELO Textreader

5.2 'Settings' section

• Maximum number of files in the folder option: Here, you can have define the maximum num-
ber of files for a single output folder, disable the check, or configure a custom number. After
reaching this threshold, the Textreader export/converter process pauses for several minu-
tes and then checks whether the maximum number of files has dropped below the threshold
again (restriction for conversion processes: in the current version, only the PDF converter
with PDFbox checks the number of files). This setting is deactivated by default. To activate
it, go to optid 1062 and enter the maximum number of files in the folder. It is deactivated by
entering "0", and the automatic operating system setting is "-1".

Starting with ELO Textreader version 12, the converters do not pause (as described under
3.4.9 'Minimum free disk space in percent' option).

5.3 'Additional Textreader configuration' section

• Character set option: The default is UNICODE. Can be changed in the eloftopt database

table under optid 1145.

Please note: For the file type pdf, enter the character set under optid 131415

(if 1314 is the optid for the file type pdf).

• Copy and delete to move files option: To enable this option, enter true under optid 1146 in

the eloftopt database table. Default: false.

• Delete folder option: This folder is not set as default. Can be changed in the eloftopt data-

base table under optid 1148.

• General behavior in the case of an error option: In addition to the alternatives offered in the

Administration Console, Create reference in the repository and No action, additional pos-

sible actions can be configured in the eloftopt database table under optid 1149. Below you
will find a complete list of the possible values:

o Create reference in the repository: Set optval to 1.

o Move to the error folder: Set optval to 2.

ELO Digital Office GmbH 52


Technical Documentation
ELO Textreader

o Only mark errors in the output file: Set optval to 4.

o No action: Set optval to 8.

• Deviations for container files option: No action deviating from General behavior in the case

of an error is set as default. This can also be changed in the eloftopt database under optid

1149. To be able to join the options General behavior in the case of an error and Deviations

for container files under one optid, the actions to be performed are configured by setting
bits in a bit pattern; the decimal value of this bit pattern is stored in the database under
optid 1149. Below you will find a list of the possible actions as well as the resulting values

for optid 1149:

o General behavior in the case of an error option set for Create reference in the reposi-

tory

 However, Move to the error folder should be set as an option for container

files: Set optval to 513.

 No action should be set as an option for container files: Set optval to 1025.

o Move to the error folder set for General behavior in the case of an error and No action

should be set as an option for container files: Set optval to 1026.

o Only mark errors in the output file set for General behavior in the case of an error and

No action should be set as an option for container files: Set optval to 1028.

• Delete folder option: This folder is not set as default. Can be changed in the eloftopt data-

base table under optid 1144.

• Error marker in the output file option notconv|notconv_sub entered as default. Can be chan-

ged in the eloftopt database table under optid 1138.

ELO Digital Office GmbH 53


Technical Documentation
ELO Textreader

5.4 Overview of options in the database

Option optid Value

Maximum size of the full text file (.txt) 1016 Default: 100 MB

Max. number of exported files per min. 1017 Default: 0 (unlimited)

Max. number of imported files per min. 1018 Default: 0 (unlimited)

Maximum size for graphics 1052 Default: 64 pixels

Max. number of errors 1053 Default: 10,000

Max. number of errors per day 1054 Default option: not set

Max. size of the file to be converted 1055 Default: 40 MB

Number of OCR workers 1056 Default: 2

Max. number of pages in PDF for OCR 1057 Default: 0 (always PDFBox conver-
conversion sion)

OCR timeout per page 1058 Default: 30 seconds

OCR timeout per document 1059 Default: 900 seconds

Minimum disk space (in percent) 1060 Default: depends on operating sys-
tem

Minimum disk space, absolute 1061 Default: depends on operating sys-


tem

Max. number of files in one folder 1062 Default: disabled (-1)

Maximum number of characters in an 1063 Default: 100,000


MSG file line

Start times for export and import 1134 Default:


111111111111111111111111
(24 hours)

Error marker in the output file 1138 Default: notconv|notconv_sub

OCR languages 1139 Default: German, English, French

OCR fast mode 1140 Default: false

OCR-SingleColumnMode 1141 Default: false

Extended trace 1143 Default: false

Error folder 1144 Default option: not set

ELO Digital Office GmbH 54


Technical Documentation
ELO Textreader

Option optid Value

Character set for other converters 1145 Default: UNICODE

Use copy and delete to move files 1146 true/false, Default: false

Folder for unknown file types 1147 Default option: not set

Delete folder 1148 Default option: not set

General behavior in the case of an er- 1149 1 (Create reference in the repository,
ror include container files)
513 (Create reference in the reposi-
tory, but skip container files; these
are moved to error folders)
1025 (Create reference in the reposi-
tory, but skip container files; no ac-
tion)
2 (Move to the error folder, include
container files)
1026 (Move to error folder, but skip
container files; no action)
4 (Mark errors in the output file, in-
clude container files)
1028 (Mark errors in the output file,
but skip container files; no action)
8 (No action, include container files)

Character set for PDFBox converter e.g. 131415 if Default: UNICODE


1314 is the
optid for pdf

Max. conversion time for PDFBox con- e.g. 131417 if Default: 10 minutes
verter 1314 is the
optid for pdf

Smoothing of PDF images e.g. 131422 if 0.0 (disabled)


1314 is the
optid for pdf

ELO Digital Office GmbH 55

You might also like