EN ELO Textreader
EN ELO Textreader
EN ELO Textreader
ELO Textreader
ELO Textreader
[Date: 2020-09-18 | Program version: 20.01.000.002]
ELO Textreader is a servlet that can extract text from a variety of document types and then save it
to a text file.
Contents
1 General ......................................................................................................................................... 4
1.5.1 Updating the ELO Textreader configuration to the current version (ELO Textreader 10
and higher) ......................................................................................................................... 9
2.2.7 Explanations for configuring the minimum width and height ......................................28
3 Configuration via the ELO Administration Console (from Textreader 10, described for
version 10.01.000) ....................................................................................................................29
3.3.1 File types with identical configuration (up to and including version 12.00.004.000) ...
..........................................................................................................................................32
3.4.7 'Timeout per page for the OCR service' option .............................................................38
3.4.8 'Timeout per OCR document for the Textreader' option .............................................39
3.4.11 'Maximum number of files in folder' option (up to ELO Textreader version 11.01)
...................................................................................................................................41
3.4.12 'Maximum number of characters per line in MSG files' option .............................42
4 Default values of the configuration options after a new installation of the current Textreader
version (version 10 and higher) ................................................................................................47
1 General
Information: Unless specified otherwise, this manual applies to all ELO Textreader
versions from ELO Textreader 9 and up.
Information: Starting with ELO Textreader version 12, you can start and stop the ex-
porter, importer, and converter on the status page.
• Starting with ELO Textreader version 10, the new, integrated ELO Textreader is configured
with the ELO Administration Console, no longer with a config.xml.
• In ELO Textreader version 9, the converter is configured as before in the config.xml, the
export and import processes of the integrated ELO Fulltext module are still configured in
the database table eloftopt (for detailed documentation, refer to the Server manual version
9, chapter "ELO Fulltext").
• Starting with ELO Textreader version 10, Apache PDFBox converter is used to convert PDF
files by default (in the ELO Administration Console, OCR can currently no longer be set to
convert PDF files. Reason: An error in Abbyy Finereader version 11).
• Starting with ELO Textreader version 12, it is possible to convert encrypted documents in
ELO.
• ELO Textreader version 20.00.003.001 and higher use "Lockback" for logging (older ver-
sions uses "Log4j"). From this version on, a logback.xml file is required for configuring log-
ging (provided by the setup, only has to be created in case of manual installation).
Please note: With ELO Textreader version 20.00.003.001 and higher, if you install
the module manually, you have to provide for a logback.xml file.
Information for older Textreader versions: Since PDFBox used in conjunction with
older Textreader versions could cause Tomcat to crash, when installing Textreaders
with versions before 9.03.006.001, ELO OCR Service 9 with Abbyy Finereader ver-
sion 10 should be used to convert PDF files whereby PDFBox would remain available,
e.g. in the event of poor OCR quality. ELO Textreader should be installed on a separate
Tomcat (within its own address space) to prevent all other servlets from crashing in
the event that PDFBox crashes.
• AI – Adobe Illustrator
• DOCM – Microsoft Word text editor, (macro files) starting with Word 2007
• PPS – Microsoft PowerPoint Show (files must be moved to the PPT folder. Do not create a
PPS folder.)
• PPTM – Microsoft PowerPoint presentation program (macro files), starting with Power-
Point 2007
• XLSM – Microsoft Excel spreadsheet program (macro files), starting with Excel 2007
The following additional (header) information and embedded documents are extracted from the
following document formats:
• DOCX, XLSX, PPTX – all embedded files that are known to ELO Textreader
1.4 Requirements
Please note: This version of ELO Textreader requires Indexserver version 9.00.060. If
the Indexserver version used is older than 9.00.060, you also need to install ELOft.
ELO Textreader will also run with older Indexserver versions, but on startup it cleans
up temporary files that are created by ELOft, meaning that documents cannot be ad-
ded to the full text database.
Please note: This version of ELO Textreader requires ELO OCR Service version
10.00.000 or higher and ELO Indexserver 9.00.060 or higher.
Please note: Starting with version 9.02 of ELO Textreader, the TT_Matrix text type is
no longer automatically set as the OCR text type (which improved recognition of line
printer prints in landscape format). Instead, TT_Matrix should be entered to ocr_lan-
If you use ELO Textreader and ELO OCR Service version 10.00.000 or higher, you no
longer need to set TT_Matrix.
Please note: ELO Textreader uses the ELO OCR Service exclusively in R mode when
converting image files, as well as PDF files up to Textreader version 9.03.006.000.
Please make sure that the ELO OCR Service is configured accordingly. See the ELO
OCR Service documentation for more information.
Note: You need to disable or uninstall ELOft if you are using Textreader version
9.17.040.000 in combination with Indexserver version 9.00.060 or higher.
Up to version 9, the program was configured in config.xml. Starting with version 10, ELO Textrea-
der is configured using the ELO Administration Console. However, this document also describes
configuration via config.xml since the guide is also relevant for older versions.
Please note: When manually updating earlier versions (i.e. without running the ELO Server Installer
program), make sure that the previous ELO OCR programs and/or services are uninstalled or disab-
led.
For updates to version 9, please note the changes to config.xml with regard to the pdf file type and
additional entries for the OCR service, as well as image conversion (see below for more informa-
tion).
When updating to ELO Textreader version 10 and higher, please note the configuration in the ELO
Administration Console (see 1.5.1 Updating the ELO Textreader configuration to the current version
The process for manually upgrading an ELO 9 installation (Textreader version 9.03.006 or higher)
to the newer Textreader and OCR versions is described in a separate document Installing ELO TR
10 for ELO 9.
1.5.1 Updating the ELO Textreader configuration to the current version (ELO Textreader 10
and higher)
If you are updating an earlier version of ELO Textreader to the current ELO Textreader version, the
setup program reads the ELO Textreader config.xml and enters the options to the eloftopt database
table. Then, config.xml is backed up to config.xml.BAK and deleted, except for the ELO Indexserver
logon information. Only the ELO Indexserver URL as well as the logon name and password remain
in config.xml. config.xml is not overwritten when no BAK file can be created.
Please note: In some cases, default values are also entered to the eloftopt database
table on update. Make sure to check the ELO Textreader configuration (in the ELO
Administration Console) after updates.
After a new installation by the setup program, only the Indexserver logon information is written to
config.xml. The path names of the converter input directories, the output directory, and different
default values are written to the eloftopt table. You can find the values under 4 Default values of the
configuration options after a new installation of the current Textreader version (version 10 and hig-
her). You can adjust these options in the ELO Administration Console, see 3 Configuration via the
ELO Administration Console (from Textreader 10, described for version 10.01.000).
Starting with ELO Textreader version 12, the most important logging outputs are also written to
a"Report log". The system logs whether a document was exported, imported, converted, or for-
warded to another converter/application (in particular, OCR). Entering a reason logs if a (partial)
document was unable to be converted.
Logging can be limited or extended, for example by configuring the log level in the log4j.proper-
ties/logback.xml file:
• The "Warn" level logs other problems, for example unrecognized file types attached to MSG
or EML files.
• The "Info" level also logs when a document has been converted.
• The "Debug" level finally logs when documents have been exported or imported, or sent to
another application, e.g. image files to OCR.
Example of a log4j.properties file. The logging level for the Report log has been set to "Info" under
"log4j.logger.reportlog=info, reportlog":
log4j.rootLogger=info, FI
log4j.logger.reportlog=info, reportlog
log4j.additivity.reportlog=false
# output in file:
log4j.appender.FI=org.apache.log4j.DailyRollingFileAppender
log4j.appender.FI.File=e:/temp/log/tr-elo10.log
log4j.appender.FI.DatePattern='.'yyyy-MM-dd'.txt'
log4j.appender.FI.layout=org.apache.log4j.PatternLayout
log4j.appender.FI.layout.ConversionPattern=%d{ABSOLUTE} %t %1x %-5p (%F:%L)
- %m%n
log4j.appender.FI.append=true
log4j.appender.reportlog=org.apache.log4j.RollingFileAppender
log4j.appender.reportlog.File=e:/temp/log/reportlog.log
# DatePattern: one file each day:
log4j.appender.reportlog.DatePattern='.'yyyy-MM-dd'.txt'
log4j.appender.reportlog.layout=org.apache.log4j.PatternLayout
log4j.appender.reportlog.layout.ConversionPattern=%d{ABSOLUTE} %t %1x %-5p
- %m%n
log4j.appender.reportlog.append=true
Please note: The appender names "FI" and "reportlog" as well as the logger name "re-
portlog" must not be changed.
The equivalent in Logback format, e.g. in a logback.xml file, would look as follows:
<root level="info">
<appender-ref ref="FILE" />
</root>
Please note: The appender names "FILE" and "REPORT" as well as the logger name
"reportlog" must not be changed.
You can change the log level not only in the log4j.properties or logback.xml file, but also on the ELO
Textreader status page (in "Edit" mode). You can also request new, empty log files on the status
page.
Fig. 1: Changing the log level and creating new log files on the ELO Textreader status page
However, here the log level of the Report log can only be changed together with the log level of the
default logger.
When creating new log files, logback and Log4j behave differently. In the logback variant, clicking
"Start new log file" or "Start new report log file" saves the previous log file by changing it to a log
file with the current date and time (e.g. tr-elo20.log#20200508-110444.051). The previous log file
is emptied. The name of the active log file (which can be configured in logback.xml) no longer chan-
ges.
This is different with the Log4j variant: Here, clicking "Start new log file" or "Start new report log
file" creates a new, empty log file, labeled with the date and time, and logged in this log file. The
names of the new log files are shown on the status page.
It is output in the Reportlog file in a defined format, explained with example file "00000003.xls"
(the file name of exported ELO documents always consists of the document ID in hexadecimal form,
followed by the file type):
The ELO document ID of the document to be converted is output in both hexadecimal and decimal
form, followed by the file name and file path, both in angle brackets (for better log file filtering).
For elements of container documents, the document ID of the ELO document is always output as
the HexID and DocID. So, if the file "0000005E.zip" contains an HTML file, for example, it would
be extracted with the name "0000005E_000.html". After converting this HTML file, the log file
contains the entry:
Information: You may miss relevant log outputs among the many Byps and PDFBox
messages. However, you can suppress these by adding the following two lines to the
log4j.properties file:
log4j.logger.org.apache.pdfbox=fatal
log4j.logger.byps=fatal
or in logback.xml:
(This disables logging for Byps and PDFbox. However, you can also enter "error"
instead of "off" to show at least the error messages.)
Please note: If Textreader version 9 is running, that is, connected to Indexserver ver-
sion 9, the configuration is read from config.xml as before.
The Textreader parameters are saved in the <entry> tags in the config.xml file. A parameter consists
of a name that is entered in the key attribute of the <entry> tag and the associated value. The value
for the parameter can consist of multiple parts, each separated by a | (see dirs_pdf above for an
example).
dirs_pdf means that ELO Textreader can process PDF type documents (see overview above).
Only one value (in this example) is entered to dirs_tif. This means that ELO Textreader stores all TIF
documents there without processing them further, even those that, for example, have been extrac-
ted from ZIP archives or PDF or MSG documents. With this function, ELO Textreader can make
documents available to other applications as well.
If two values are entered (see dirs_eml above), ELO Textreader extracts the text from the
documents in the first directory (output directory) and stores a TXT file containing the extracted
text of the same name in the second directory (return directory).
Please note (up to and including version 12.00.004.000): The individual converters
in Textreader are partially programmed to process multiple data types, but can only
be configured once (specifically, it is only possible to configure one output directory).
If the second data type has a different configuration, the first configuration, and there-
fore the output directory, would be overwritten and would no longer be scanned. The
following data types must have the exact same configuration in the config.xml:
• pdf and ai
• 7zip and 7z
If more than two values are entered (see dirs_pdf above), refer to the table below for their meaning.
• TT_Typewriter (typewriter)
CAL_MACHINE\SOFTWARE\Wow6432Node\ELO Di-
The text is extracted from the PDF document and then saved to a text file in the directory
C:\ELOent\data\ft-elo\txt. If the PDF document contains a TIF document, this will be
saved under C:\ELOent\data\ft-elo\tif. All other image files (JPG, PNG) are ignored, as
Textreader does not recognize them. To process other image formats, corresponding entries should
be made to config.xml (such as <entry key="dirs_jpg">C:\ELOent\data\ft-
elo\jpg</entry>).
The OCR PDF converter also requires the ai file type to be configured with the pdf input folder (not,
for example, ai). The ocr parameter must be added here as well.
If you know in advance that you have documents unable to be processed by the OCR service, such
as documents with file protection settings or that cannot be processed by PDFBox because the for-
mat differs from what PDFBox expects (PDFbox would show an error message), it may be useful to
set the on parameter in the PDF document configuration. on will cause the respective other conver-
ter to attempt conversion again if errors occur with one of the two PDFBox converters.
The following combinations of possible settings exist for the PDF document conversion process:
The default case. Processing occurs via OCR. PDFBox is not used, even if errors occur.
Processing occurs via OCR. PDFBox is not used, even if errors occur.
A check is run to see if the document is very large (number of pages is greater than the value
of ocr_pdfPages). If so, conversion is performed by the PDFBox converter. If not, the
document is processed with OCR. The default value for ocr_pdfPages is 1000000, which
means that the OCR service is normally always used (if you need to deviate from the default
value and switch between the OCR Service and the PDFBox converter depending on the
document size, Textreader should be installed to a separate Tomcat for Textreader versions
before 9.03.006.001 due to a risk of the PDFBox converter crashing). on causes conversion
to start again when OCR errors occur, using the PDFBox converter as a replacement (wit-
hout checking the number of pages).
Without adding "|ocr", the PDFBox converter is used. No conversion by OCR is attempted
in the event of an error.
Without adding "|ocr", the PDFBox converter is used. Conversion by OCR is attempted in
the event of an error.
Without adding "|ocr", the PDFBox converter is used. No conversion by OCR is attempted
in the event of an error.
An MMF document (COLD) is converted to WMF format and saved to the input directory of the WMF
converter. To process MMF documents, first you have to configure a WMF converter.
Text is extracted from WMF documents and saved to a text file in C:\ELOent\data\ft-
elo\txt.
TIFF image files, such as those in a PDF document, are saved under C:\ELOent\data\ft-
elo\tif, but Textreader does not process these files/documents further. This applies to all for-
mats when only one directory is provided, meaning the target directory is missing.
• AES256
• SHA256
• BZIP2
• BCJ_IA64_FILTER
• BCJ_PPC_FILTER
• BCJ_X86_FILTER
• BCJ_ARM_FILTER
• BCJ_ARM_THUMB_FILTER
• BCJ_SPARC_FILTER
• DEFLATE
• DEFLATE64
• DELTA_FILTER
• LZMA
• LZMA2
The minimum width and height (in pixels) for images to be processed is 70. This is important when
extracting icons from Office or PDF files, for example; the icons should not generally be added to
the full text database.
Please note: The minimum size check is currently only performed when extracting gra-
phics from PDF, MSG, or EML documents.
If an error occurs while extracting the text, a text file is generated with "Document not processed" as
its contents.
Server modules > Full text service (Textreader). Experience has shown that it is not necessary to offer
all parameters in the Administration Console that could be configured in the config.xml. In fact, the
opposite was true: This caused confusion among administrators. For this reason, compared with
previous versions, the ELO Textreader configuration options in the ELO Administration Console
have been reduced to those system parameters which make most sense (for other values, the
defaults are applied). However, in the database it is still possible to configure parameters that are
not listed below with values that ELO Textreader does not apply by default. You will therefore also
find the "optid" from the "eloftopt" database table in the list of default values further below if you
would like to change default values.
The Textreader parameters are saved in the <entry> tags in the config.xml file. A parameter consists
of a name that is entered in the key attribute of the <entry> tag and the associated value.
Enter the return directory for the Textreader converter here (in the previous config.xml, the target
directory that could be entered after the first "|" icon for the individual file types). The text files from
this folder are uploaded as full text files to the ELO repository.
In this case, output directories are the directories to which the documents were exported from the
ELO repository for conversion. A directory must be added for each file type to be added to the full
text database.
It is possible to specify several file types in one line if these are to be configured identically for a
directory path (see below).
If the user is able to define options for a file type, the editing icon appears in the respective line.
After clicking on this gearwheel icon, the file type is opened and the user can click or enter additio-
nal options. If the user clicks the gearwheel icon again, this closes the option (however, it has not
been saved; you need to click "Save" to do so).
3.3.1 File types with identical configuration (up to and including version 12.00.004.000)
Please note: The individual converters in Textreader are partially programmed to process multiple
data types, but can only be configured once (specifically, it is only possible to configure one output
directory). If the second data type has a different configuration, the first configuration, and there-
fore the output directory, would be overwritten and would no longer be scanned. The following
data types must have the exact same configuration (except for the option Convert externally).
• pdf and ai
• 7zip and 7z
It is therefore useful to put data types that need to be configured identically in one line, as in the list
shown above. You only need to configure these data types separately, i.e. in a separate line, if you
want to convert them externally as well if you want the Textreader to convert them.
Please note: Up to and including version 12.00.004.000, the configuration for the
following files types should be done in one line, and must be identical in any case.
• pdf and ai
• 7zip and 7z
Fig. 4: Full text service, 'Convert externally' option for the file type "docx"
Explanation: Not all file types must be converted by the Textreader converters. It is also possible
for ELO Business Partners to use their own converters that read the input directories described
here and write them to the output directory. If you want to process a file type with a converter other
than the ELO converter, select the Convert externally check box. However, this only applies to the
converters. The Textreader export service would still continue to export documents from the repo-
sitory into such directories. The name of the file in the output directory must match the name of the
file in the input directory, except the file type must now be TXT (instead of the original file type).
Example: Following export from the repository, the names of the files in the input directory always
consist of the hexadecimal values of the ELO document ID plus the respective file type, e.g.
"23b.docx". In this case, the converted file in the output directory would be "23b.txt".
For multipage documents, exactly one TXT file is expected in the output directory.
Files in the container format (e.g. file type ZIP, but also newer Office formats DOCX, XLSX, etc.) are
split into individual files (images and attachments). The converted individual files must be labeled
with a three-digit suffix in the form of a counter, e.g. "23b_001.txt", "23b_002.txt". Based on the
first part of the file name (the document ID in hexadecimal format), the Indexserver knows what
ELO documents the text file should be assigned to as full text.
Fig. 5 : Full text directory, options for the 'PDF' file type
Check the box here if you want the PDF converter to extract images.
Enter a password for a password-protected PDF document (used by the Apache PDFBox converter
when opening the PDF file).
Check the box for failover mode here. PDF documents are only converted by PDFBox by default.
Starting with ELO Textreader version 20 (or starting with version 12.00.005.000 for TR12, version
11.01.006.000 for TR11, version 10.19.100.000 for TR10) the documents are forwarded to OCR
after conversion errors. Default: enabled.
Fig. 6: Full text directory, options for the 'TIF(F)' file type
The options in this section are useful when no ELO OCR Service is installed.
If ELO OCR is not installed, you can extract the TIF(F) files from container files (such as ZIP files) to
the TIFF directory and process them with a separate conversion program. Select the relevant check
box. If the check box is enabled, Textreader does not go through this directory, but copies TIFF files
to this directory when converting e.g. ZIP files. Difference to Convert externally: If the option Convert
externally is selected, ELO Textreader does not move files to the TIFF directory during export.
Enter an external conversion program that Textreader should use for the individual TIFF files.
Otherwise, conversion will take place via OCR. Difference to the separate conversion programs
described above: This external TIFF converter does not have to search directories for files to be
converted on its own in intervals, but is called up by Textreader and only converts one document at
a time.
TXT type text files are exported straight to the return directory and imported back to the repository
as full text files, without requiring any additional configuration.
It is also possible to add text files of other file types unknown to ELO Textreader (e.g. file type SQL)
to the full text database without having to provide an external converter. These text files can be
used as full text files right away without being converted. For this, enter the return directory, nor-
mally the TXT directory, as the directory path in the configuration within the ELO Administration
Console.
Fig. 7: Full text service, configuration for text files with file type ‘vbs‘, ‘js‘, or ‘sql‘
To add an additional file type, enter one or more file extensions and a directory path to the Add a
If you want to remove file extensions and their output directories, click the Delete icon. If the output
directory is being used for multiple file types, you will be requested to confirm deletion. Otherwise,
the line turns pink and an Undo icon will appear instead of the Delete icon, as shown in this exa-
mple:
Minimum width and height (in pixels) for images to be processed. The default value is 64 pixels.
Please note: The minimum size check is currently only performed when extracting gra-
phics from PDF, MSG, or EML documents.
Maximum number of corrupt files that could not be processed or deleted. If this number is excee-
ded, Textreader stops. Default: 10,000.
The maximum number of files downloaded from the repository to the Textreader input directories
for conversion. If you have a larger number of documents to export, it may make sense to enter a
value here, e.g. 100, to prevent system overload and Tomcat/ELO server standstill. Default: 0 (no
upper limit). When this limit is reached, the export thread in Textreader pauses for one minute.
Please note: Textreader sends a search query to the Indexserver to export the
documents and then receives a package of 100 documents to be exported in return. If
Textreader has to pause for one minute frequently because the maximum number of
files per minute has been reached, processing these 100 documents may take longer
than the maximum lifetime of a search ticket (default: 10 minutes). When Textreader
goes to retrieve the next 100 documents, the search has timed out and has to be run
again. Documents may be exported multiple times. You should enter a realistic value
for Maximum number of files exported per minute so that 100 documents can be ex-
ported in 10 minutes (for a lifetime of 10 minutes, the entry for maximum number of
files to be exported per minute should be 10 or higher).
The maximum number of converted files that are imported back into the repository from the return
directory. If you have a larger number of documents to import, it may make sense to enter a value
here, e.g. 100, to prevent system overload and Tomcat/ELO server standstill. Default: 0 (no upper
limit).
Maximum number of corrupt files per day that could not be processed or deleted. If this value is set,
the value of the option 'Maximum error count' is ignored. The counter for converted documents is
reset to 0 on the following day, and Textreader continues to run. Default: not set.
Maximum size of the files to be converted in MB. Default: 40. PDFBox checks the document size
when exporting from the repository to the ELO Textreader folder and when extracting image files
from PDFs. Files exceeding the maximum size are deleted immediately following export/extraction.
Time in seconds until a timeout occurs within the OCR service (as opposed to the Timeout per OCR
document for the Textreader parameter as the maximum time allowed in the Textreader to wait for
the result of the OCR, see below.) From OCR version 9.0.1.0, this value applies to one page that is
to be scanned, whereas in earlier versions it applied to the entire document. The default value is 30
seconds. After this time has elapsed, OCR stops processing the current page, and as a result, the
entire document.
Time in seconds per document until timeout in Textreader. This is the maximum number of se-
conds that Textreader will wait for the result of the OCR. After this time has elapsed, Textreader
continues with conversion of the next document (as opposed to with the Timeout per page for the
OCR service parameter as the timeout per page within the OCR service, see above). The default va-
lue is 900 seconds (15 minutes). This value can be configured from version 9.2.04, as it may hap-
pen in rare cases that a conversion time of 15 minutes is not sufficient for some very large or com-
plex documents. In general, 15 minutes could also be too long.
Here, you can specify how much disk space (in percent of the overall disk space) must remain free
for Textreader export and conversion processes (restrictions for conversion processes: in the cur-
rent version, only the PDF converter with PDFbox checks the disk space when extracting images
and attachments). It you set this to Automatic operating system specification, the minimum Textrea-
der disk space is determined (depending on the operating system). If you set Disable check, no check
is performed. However, you can also enter whatever values you want (custom entry). In this case, an
input option appears on the right where you can enter how much disk space must remain free in
percent.
After reaching the set threshold, the export process (and up to version 11 of ELO Textreader the
conversion process) pauses for several minutes. It then checks whether sufficient space is free
again.
Starting with ELO Textreader version 12 the converters do not pause. Instead, when the minimum
disk space is exceeded, conversion of the current PDF document is canceled, the image or attach-
ment causing the disk to be full is deleted, and conversion continues with the next PDF. ELO Text-
reader notes this PDF document and checks at regular intervals whether there the disk now has
enough space. If this is the case, it attempts to extract the images and attachments from this PDF
that have not yet been processed. This prevents ELO Textreader from being deadlocked by extrac-
ting many very large TIFF files from a PDF file.
Fig 10: Full text service, minimum free disk space in percent
Here, you can specify how much disk space (in MB) must remain free for Textreader export and
conversion processes (restrictions for conversion processes: in the current version, only the PDF
converter with PDFbox checks the disk space). It you set this to Automatic operating system specifi-
cation, the minimum Textreader disk space is determined (depending on the operating system). If you
set Disable check, no check is performed. However, you can also enter whatever values you want
(custom entry). In this case, an input option appears on the right where you can enter how much disk
space must remain free in MB. If you set Disable check, no check is performed.
After reaching the set threshold, the export process (and up to version 11 of ELO Textreader the
conversion process) pauses for several minutes. It then checks whether sufficient space is free
again.
Starting with ELO Textreader version 12, the converters do not pause, as described above.
Fig 11: Full text service, minimum absolute free disk space
3.4.11 'Maximum number of files in folder' option (up to ELO Textreader version 11.01)
Here, you can have define the maximum number of files for a single output folder, disable the check,
or configure a custom number. After reaching this threshold, the export process (and up to version
11 of ELO Textreader the PDF conversion process) pauses for several minutes. It then checks whe-
ther the maximum number of files has gone below the threshold again.
Starting with ELO Textreader version 11.01, the maximum number of files in a folder is no longer
checked by default for new installations. This setting has been removed from the ELO Administra-
tion Console. However, it is still possible to configure this option in the database (see 5.2 'Settings'
section).
Starting with ELO Textreader version 12, the converters do not pause when the maximum number
of files in a folder is exceeded. Instead, they proceed as when the minimum disk space is exceeded
(see above).
Please note: Starting with ELO Textreader version 11.01, the maximum number of
files in a folder is no longer checked by default for new installations.
Please note: Starting with ELO Textreader version 12, the converters no longer stop
when the minimum disk space or maximum number of files in a folder is exceeded, but
only when the import process is no longer running. The converters do not stop if the
import process was deactivated.
Enter the maximum line length in MSG files. Lines that exceed this value are skipped during con-
version. This is due to the fact that having very long lines in MSG files can cause the converter to
crash. You should only enter a larger number if you have good reason to do so. Default: 100000.
Here, you can specify the maximum size of the resulting text file to be imported back to the reposi-
tory as a full text file. Default: 100 MB.
If the option Extended trace is not selected, log outputs from converters are suppressed if they only
report that a certain directory has been scanned. Default: not enabled.
When processing document container files (PDF, ZIP, MSG, etc.), the files are extracted from these
documents and copied to the corresponding target directory according to the type of file. If there is
no target directory for specific file types, these files are copied to the default folder. If you do not
specify a folder, these files will not be extracted. Default: not set Every 24 hours or when the Text-
reader is stopped, a list of the number of documents for which no target directory is configured is
output as a warning. In addition, the number of documents of a specific type that are not extracted
is also output from version 20.00.001.002.
Enter (as previously for the FT service) at what hours the export and import service between the
ELO repository and the Textreader input/output directories should be active ("00" to "23").
Fig. 14: Full text service, running times for export and import
Fig. 15: Full text service, general behavior in the case of an error
Here, specify whether the ELO document should be referenced in a special error folder in the event
of conversion errors, or whether no action should take place (the invalid document is then just de-
leted from the folder). For documents that were exported from the ELO repository to the Textrea-
der incoming folders, references are created in the Administration¶ELO-Textreader¶Documents not
converted ELO folder if an error occurs. The ELO Textreader folder has the fixed GUID "(E10E1000-
E100-E100-E100-E10E10E10E24)", while the Documents not converted folder has the GUID
"(E10E1000-E100-E100-E100-E10E10E10E25)". These folders are created by Textreader
should they not yet exist. The references in the error folders are deleted automatically if these
documents can be converted in later Textreader runs after all. These ELO folders are named ac-
cording to the set language, currently German, English or French, or English as the default langu-
age. If you want to use a different default language, you can set it in the "messages_dflt.properties"
file. After installing ELO Textreader, open the Textreader .jar file "webapps\...\WEB-INF\lib\tr.jar"
in the ELO server directory and then the properties file "de\elo\tr\server\messages_dflt.proper-
ties".
• If an error folder has been configured in the database (see below), invalid documents that
were NOT exported from ELO (e.g. test files) are moved to this folder in the file system. If
no error folder has been configured, these files are deleted in the event of an error (default:
no delete folder configured).
• In the case of an error, notconv is written to the TXT file by default, or notconv_sub in
the case of partial files of container files (e.g. from ZIP files). Other error texts can be confi-
gured in the database.
• As an alternative to the behavior described under 3.5.4 'General behavior in the case of an
error' option, you can also specify that the error should only be recorded in the output file,
and that no other action should be performed.
• For container files (e.g. ZIP or MSG files that contain other files), it may not make sense to
reference the entire container file e.g. in the repository as corrupt. Thus, a behavior devia-
ting from the behavior in the database selected above can be configured here for container
files.
The option OCR read mode can be used to determine whether OCR reads text in a table column by
column (multiple column mode, OCR attempts to recognize blocks, default), or line by line (single
column mode, OCR no longer attempts to recognize blocks). You may achieve a better result if the
OCR is run column by column in multiple column mode instead of in single column mode. However,
the disadvantage of this method is that proximity searching is no longer possible, as the words in a
line of a table are not extracted in sequence but column by column.
This can be used to affect the OCR precision. Selecting fast speeds up OCR, but is less precise, while
selecting detailed does not use acceleration, but has improved precision. Detailed is the default set-
ting.
The OCR languages parameter defines the supported languages for OCR processing (recom-
mended: the language used during server setup, as well as at least English). Only enter the langu-
ages that are actually used, as the OCR Service checks the documents for all specified languages.
On the left side, you will find a list of languages supported by the installed OCR. In addition, here
you can select OCR text types, which normally is not necessary (from Textreader 10 it is no longer
necessary to specify TT_Matrix in addition to the language settings if Lineprinter printouts are to be
converted in landscape format). For the sake of completeness, here is a list of the OCR text types:
• TT_Typewriter (typewriter)
• TT_MICR_E13B (magnetic ink character recognition, note: select the language "E13B" here
as well)
Background: In OCR, we configure a number of worker processes running in parallel that actually
convert images. This number was defined in the registry under "WorkerCount" for OCR 9, e.g. un-
der HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\ELO Digital\OCR\Service, and
in the OCR config.xml by the "workercount" key for OCR 10. Normally, the number of cores *2 is
entered as the number of OCR workers. OCR then generates the corresponding number of worker
processes.In the Administration Console, you can configure the maximum number of workers Text-
reader can use (the remaining workers are used by the ELO Java Client). Textreader creates this
number of threads, which send the image files to the workers for conversion in parallel.
If you enter more workers than OCR makes available, OCR may report to Textreader that there are
no free workers. In this case, Textreader will wait and try to send the document to OCR again. It will
keep trying until it finds a free worker; however, we do not recommend configuring more workers
than are available for performance reasons.
Under Number of used OCR connections, enter the maximum number of workers that can be used
by the Textreader but do not enter a value that exceeds the "WorkerCount" specified in the registry
or "workercount" in the OCR config.xml. If necessary, subtract a specific number of workers for the
ELO Java Client.
Information: The other previous FT options can be omitted since the integrated ELO
Textreader now calls up the database via the ELO Indexserver or uploads the full text
files to the ELO repository and only the ELO iSearch is supported.
Information: The ELO Administration Console adds default values (in the database
table "eloftopt") to missing configuration settings. This feature can also be used to fill
empty "eloftopt" tables, e.g. after a setup or update during which the table was created
but left empty.
ai, bmp, doc, docx, dot, eml, htm, html, jpg, jpeg, mht, mmf, msg, odt, pdf, png, pps, ppt, pptx,
rtf, tif, tiff, vcf, vsd, vsdx, wmf, xla, xls, xlsx, xml, zip, 7zip, 7z
o PDF converter option: Apache PDFBox This option has been removed from ELO
Textreader version 20.00.005.002, as Apache PDFBox is the only available PDF
converter. You can no longer make a selection.
• Timeout per OCR document for the Textreader option: 900 seconds (15 minutes)
• Minimum free disk space in percent, Minimum absolute free disk space options: The values
are determined based on the system.
• General behavior in the case of an error option: Create reference in the repository
• General behavior in the case of an error option: Create reference in the repository
Please note: Changes to the database are always critical and could result in system
failure. Only change values in the database after careful consideration and if you un-
derstand the following.
o Character set option: The default is UNICODE. Can be changed in the eloftopt data-
base table under optid 131415, for example. 1314 is the (variable) optid for the file
type pdf; the label for the option character set is 15. (The optid values are assigned
sequentially for the data types and are therefore variable; the optid values of the
options for individual data types are calculated from the optid of the data type with
an identifier associated with the respective option, here "15").
o Max. PDF conversion time: option: Entry in minutes; default: 10 minutes. Can be
changed in the eloftopt database under optid 131417, if 1314 is the (modifiable) op-
tid for the file type pdf. The ID for the option Max. PDF conversion time is 17.
file type pdf. The ID for the option Max. number of graphics extracted per document
is 19.
o Conversion by OCR in case of error (failover) option: PDF documents are only con-
verted by PDFBox by default. Starting with ELO Textreader version 20 (or starting
with version 12.00.005.000 for TR12, version 11.01.006.000 for TR11, version
10.19.100.000 for TR10) the documents are forwarded to OCR after conversion
errors. This feature can be disabled by entering "false“ to the eloftopt database
table under optid 131418, if 1314 is the (modifiable) optid for the file type pdf. The
default value is "false", meaning the feature is enabled. Starting with version
20.01.000.002, you can also enable or disable failover mode in the ELO Administ-
ration Console.
Please note: After conversion errors by PDFBox, PDF files are for-
warded to OCR by default. Starting with version 20.01.000.002, you
can also enable or disable failover mode in the ELO Administration Con-
sole.
o Smooth images in PDF files option: Fonts in images extracted from PDF files and
forwarded to OCR may be frayed, so that OCR would not be able to do its job pro-
perly. This issue can be solved by reducing the image size, referred to as
"smoothing". A factor of 0.4 has proven to be a good value for smoothing frayed
images. This reduction factor can be configured by entering a value to optid 131420
in the eloftopt database table, if 1314 is the (modifiable) optid for file type pdf. The
ID for Smooth images in PDF files is 20. The default factor for smoothing is "0.0" –
this disables the feature, meaning no smoothing takes place (not when the com-
plete PDF is forwarded to OCR, not when using ICEpdf, not in Textreader version
9).
• Maximum number of files in the folder option: Here, you can have define the maximum num-
ber of files for a single output folder, disable the check, or configure a custom number. After
reaching this threshold, the Textreader export/converter process pauses for several minu-
tes and then checks whether the maximum number of files has dropped below the threshold
again (restriction for conversion processes: in the current version, only the PDF converter
with PDFbox checks the number of files). This setting is deactivated by default. To activate
it, go to optid 1062 and enter the maximum number of files in the folder. It is deactivated by
entering "0", and the automatic operating system setting is "-1".
Starting with ELO Textreader version 12, the converters do not pause (as described under
3.4.9 'Minimum free disk space in percent' option).
• Character set option: The default is UNICODE. Can be changed in the eloftopt database
Please note: For the file type pdf, enter the character set under optid 131415
• Copy and delete to move files option: To enable this option, enter true under optid 1146 in
• Delete folder option: This folder is not set as default. Can be changed in the eloftopt data-
• General behavior in the case of an error option: In addition to the alternatives offered in the
Administration Console, Create reference in the repository and No action, additional pos-
sible actions can be configured in the eloftopt database table under optid 1149. Below you
will find a complete list of the possible values:
• Deviations for container files option: No action deviating from General behavior in the case
of an error is set as default. This can also be changed in the eloftopt database under optid
1149. To be able to join the options General behavior in the case of an error and Deviations
for container files under one optid, the actions to be performed are configured by setting
bits in a bit pattern; the decimal value of this bit pattern is stored in the database under
optid 1149. Below you will find a list of the possible actions as well as the resulting values
o General behavior in the case of an error option set for Create reference in the reposi-
tory
However, Move to the error folder should be set as an option for container
No action should be set as an option for container files: Set optval to 1025.
o Move to the error folder set for General behavior in the case of an error and No action
o Only mark errors in the output file set for General behavior in the case of an error and
No action should be set as an option for container files: Set optval to 1028.
• Delete folder option: This folder is not set as default. Can be changed in the eloftopt data-
• Error marker in the output file option notconv|notconv_sub entered as default. Can be chan-
Maximum size of the full text file (.txt) 1016 Default: 100 MB
Max. number of errors per day 1054 Default option: not set
Max. number of pages in PDF for OCR 1057 Default: 0 (always PDFBox conver-
conversion sion)
Minimum disk space (in percent) 1060 Default: depends on operating sys-
tem
Use copy and delete to move files 1146 true/false, Default: false
Folder for unknown file types 1147 Default option: not set
General behavior in the case of an er- 1149 1 (Create reference in the repository,
ror include container files)
513 (Create reference in the reposi-
tory, but skip container files; these
are moved to error folders)
1025 (Create reference in the reposi-
tory, but skip container files; no ac-
tion)
2 (Move to the error folder, include
container files)
1026 (Move to error folder, but skip
container files; no action)
4 (Mark errors in the output file, in-
clude container files)
1028 (Mark errors in the output file,
but skip container files; no action)
8 (No action, include container files)
Max. conversion time for PDFBox con- e.g. 131417 if Default: 10 minutes
verter 1314 is the
optid for pdf