Transformation Engine Manual - en - v0.81

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

EKT Biblio-Tranformation-Engine

UserGuide v0.81

Α. Introduction

Transformation engine tool is a java framework developed by National Documentation Center / National
Hellenic Research Foundation and consists of programmatic APIs for filtering and modifying records that
are retrieved from a data source (e.g. databases, files, legacy data sources) as well as for outputting them
in appropriate formats (e.g. database files, txt, xml, oai_dc). The framework includes of independent
abstract modules that are executed separately offering in many cases alternative choices to the user
depending of the input data set, the transformation workflow that needs to be executed and the output
format that need to be extracted. Transformation engine tool can be used in the following cases:

1) Filtering records from a database


2) For cleaning and modifying records
3) For transforming records from the input format to the output format (independently of the fact if
filtering or modification of records takes place)

Β. Transformation Engine Framework

Transformation Engine Framework includes abstract java classes that user need to implement in order to
realize a procedure throughout the platform. Moreover, the framework provides users with core
functionality, that determines the relations among these abstract classes, the way they function and has
the central control of the platform. The basic architecture of the platform as well as the workflow
functionality of the record transformation are shown in the figure below.
As show, transformation engine has the core control of the platform. Transformation engine is initialized
with the three basic elements of the platform, the DataLoader, the TransformationWorkflow and the
OutputGenerator. Additional configuration takes place on data loader and transformation workflow. The
latter consists of filters and modifiers while the platform defines the Record as the basic component that
holds data across workflow steps. Data loader can be initialized with some loading specs that can
configure data loader to load only specific records depending maybe on a time period or some offset,
which is useful when we have sequential data loading and each time we need to load the next records
from out data source. At last, both transformation workflow and output procedure can be interrupted by
processing and output conditions that can lead the workflow back to the data loading state. This is
necessary in cases we want to have specific number of outcome records and when some of them are
filtered we get back to load more. Detailed explanation of each component separately follows.

Record
Class Record (gr.ekt.transfromationengine.core.Record) is an abstraction of the data record that we want
to filter/clean/modify through transformation engine. This class can simulate the record in any format
user wants. It can be an XML Element or it might be a Map with keys and objects. Platform is record
independent meaning that it does not care about the information that is stored in each record. However,
record is supposed to keep information about fields and their values (it may be multiple values for a single
field) and that is why user needs to implement some methods from this abstract class in his own
extended version of Record. These methods are:

public abstract List<String> getByName(String fieldName);

public abstract void printByName(String fieldName);

public abstract void removeField(String fieldName);

public abstract void removeValueFromField(String fieldName, Object value);

public abstract void addField(String fieldName, ArrayList<Object> fieldValues);

public abstract void addValueToField(String fieldName, Object fieldValue);

Given that each record consists of fields, user needs to implement the 6 methods mentioned previously.
The way the user will implement each method depends on the way the user holds the data information
inside the record. Thus, if the information is hold as an XML Element, method getByName can implement
an Xpath query or DOM traversing in order to extract information. Not all methods are necessary to be
actually implemented. For example, removeField and addField are necessary only if user needs to do any
changes to the records (e.g. in data cleansing procedures) while for filtering are not necessary.

Class SimpleRecord (gr.ekt.transfromationengine.records.Record) is also an abstract method that


provides implementation for the “addField” method. Thus, users are encouraged to extend the “Simple
Record” rather than the “Record” class.

Transformation engine library includes the implementation of some Record classes that are supposed to
be common in many cases.

MapRecord (gr.ekt.transformationengine.records.MapRecord):
It is a record that holds data information in a Map (key – value). Given that such records exist in
the system, user can call method getByName with key as an input parameter and get value back
as a result.

XMLRecord (gr.ekt.transformationengine.records.XMLRecord):

In this case, this record is not an actual implementation; rather it is an abstract class that holds
data information as XML Element. Given that each xml may have its own schema, we could not
create a class that can work for its case. However, in this case, all four aforementioned methods
are implemented for the user, who in turn has to only implement one and only methods that is:

public abstract String mapFieldNameWithXpath(String fieldName);

where, user, given the field that he needs to edit/modify has to return the xpath string inside the
element that is held by the record. Everything else is implemented inside the abstract class
XMLRecord.

MapDSpaceRecord (gr.ekt.transformationengine.records.MapDSpaceRecord):

It actually extends the Map Record and adds the functionality to the record to hold its handle (for
DSpace), the handle prefix and a list of metadata prefixes like “dc” or any other that will be used.

MapEseRecord (gr.ekt.transformationengine.records.MapEseRecord):

It actually extends the Map Record and adds the functionality to the record to hold its identifier
(for the ESE output) and a list of metadata prefixes like “dc” or any other that will be used.

Of course, users can create their own Records as long as they implement the abstract methods
mentioned before.

RecordSet
Class RecordSet (gr.ekt.transformationengine.core.RecordSet) is just a list of Records tha move within
the platform from one step to another until it reaches the output where it can be saved in any format.

DataLoader
Class DataLoader (gr.ekt.transformationengine.core.DataLoader) is responsible for data loading in the
system. Here, the following abstract method is defined:

public abstract RecordSet loadData();

where user can specify his own implementations. Methods need to return a RecordSet with Records that
user has defined.

Transformation Engine library included the following implementation of a DataLoader:

ExcelDataLoader (gr.ekt.transformation.dataLoaders.ExcelDataLoader):
ExcelDataLoader, given of an xls file that each row constitutes a record, creates a RecordSet of
MapRecords (we talked about them previously). This class accepts the following parameters:

1) The sheet index that we want to read (zero based)


2) The xls file name

ExcelDataloader supposes that the first row includes the name of each column (and thus the key
name of the various fields used in the map record) and that the records rows start immediately
after the header row (that is from the second row)

TSVDataLoader (gr.ekt.transformation.dataLoaders.dspace.TSVDataLoader):
It reads a tsv (tab separated values) file of the following format:

and creates a RecordSet of MapDSpaceRecords (that is a MapRecord as well). The first line of the
file is used for the keys of the various fields inside the record.

CSVDataLoader (gr.ekt.transformation.dataLoaders.dspace.CSVDataLoader)
It reads a csv (comma separated values) file of the following format:

and creates a RecordSet of MapDSpaceRecords (that is a MapRecord as well). The first line of the
file is used for the keys of the various fields inside the record.

RISDataLoader (gr.ekt.transformation.dataLoaders.dspace.CSVDataLoader)
It reads a RIS file of the following format:
and creates a RecordSet of MapDSpaceRecords (that is a MapRecord as well). The two-letter
keywords on the beginning of each line are used as the keys of the fields in the record.

EndnoteDataLoader (gr.ekt.transformation.dataLoaders.dspace.EndnoteDataLoader)
It reads an endnote file of the following format:

and creates a RecordSet of MapDSpaceRecords (that is a MapRecord as well). The two-letter


keywords on the beginning of each line are used as the keys of the fields in the record.

BibTexDataLoader (gr.ekt.transformation.dataLoaders.dspace.BibTexDataLoader)
It reads a bibtex file of the following format:
and creates a RecordSet of MapDSpaceRecords (that is a MapRecord as well). The keywords on
the beginning of each line are used as the keys of the fields in the record.

IsiHtmlDataLoader (gr.ekt.transformation.dataLoaders.dspace.IsiHtmlDataLoader)
It reads a isihtml file of the following format:

and creates a RecordSet of MapDSpaceRecords (that is a MapRecord as well). The keywords on


the first column of the html table are used as the keys for the various record fields.

All the aforementioned data loaders are also available in the package
gr.ekt.transformation.dataLoaders.ese with the difference that they create a record set of
MapEseRecords.

DataLoadingSpec
Class DataLoadingSpec (gr.ekt.transformationengine.dataloaders.DataLoadingSpec) is responsible to
initialize data loader in order for it to know what records to load. The only method that the user needs to
implement if the following abstract one:
public abstract DataLoadingSpec generateNextLoadingSpec();

in t his method the user is responsible to return the data loading specs for the next time that the data
loader is going to run. For example, suppose that we need to load n numbers of records for the first time,
and then, the next n ones. In such a case, data loading specs need to hold the offset of the records to be
read that initially is 0. After n records are read, the user, in the aforementioned abstract method need to
return a new data loading specs with the value of offset equals to n, in order to proceed to the next n
records.

OutputGenerator
Class OutputGenerator (gr.ekt.transformationengine.core.OutputGenerator) is repsonsible for the
output of the records. It consists of the following abstract method:

public abstract boolean generateOutput(RecordSet recordSet);

that user needs to implement in his own implementation of Output generator. It takes as input a
RecordSet and the used needs to implement some code so as to extract the information in any format he
wants. In this method, method getByName of Record class can be used in order for the user to have
access to the various values of the Record fields.

Transformation Engine library includes the following implementation of OutputGenerator:

DSpaceOutputGenerator
(gr.ekt.transformationengine.outputGenerators.DSpaceOutputGenerator):

This OutputGenerator is responsible to output records in the format the DSpace wants for its
batch import according to:

http://www.tdl.org/wp-content/uploads/2009/04/DSpaceBatchImportFormat.pdf.

The users can have as many schemas they like for the output in case they have added to their
DSpace more schemas apart from the default “dc” schema.

This output generator accepts as input parameters the correspondence between the name of the
fields and the dc elements.

DublinCoreDSpaceOutputGenerator
(gr.ekt.transformationengine.outputGenerators.DublinCoreDSpaceOutputGenerator)

This OutputGenerator is an enhancement of the previous one in that the mapping between the
keys of the fields and the dc elements is not needed. It recognizes the key of the fields of the
format “dc.element.[qualifier]” and creates the mapping automatically. Of course, in order to
have keys like these you have to input a modifier in the workflow to change the key names from
the ones that exist to the one that can be recognized by this output generator.

ESEOutputGenerator (gr.ekt.transformationengine.outputGenerators.ESEOutputGenerator)
This OutputGenerator is responsible to output records in the OAI_PMH format that Europeana
uses to import records in its database. This output generator accepts as input parameters the
correspondence between the name of the fields and the “dc” or “europeana” elements so as the
ESE output can be created.

Condition
Class Condition (gr.ekt.transformationengine.conditions.Condition) is responsible to check for a
condition that need to be met by the record set and return the workflow state back to the data loading
task. In case that re condition is not true, data loaded is the called again to load more records in the
system and the whole transformation workflow runs from the beginning. The user needs to implement
the following abstract method:

public abstract boolean check(RecordSet recordSet);

In this method the user checks the record set and if the condition is not met, he must return false.
Otherwise, he must return true.

Conditions are mostly used before the output generation takes place to check if the record set meets
some conditions (e.g. the size of the record set). If the condition is not met, the processing goes back to
the data loading state (by setting correctly the data loading specs) to load more data.

TransformationWorkflow
Class TransformationWorkflow (gr.ekt.transformationengine.core.TransformationWorkflow) includes
the core logic of transformation engine. It accepts as input some workflow steps (filters or modifiers)
which we are going to see in a while and user can specify how these steps are will work all together. In
transformation engine library, you can find build-in the following implementation

ConjuctionTransformationWorkflow(gr.ekt.transformationengine.core.ConjuctionTransformatio
nWorkflow):

This workflow checks all the steps in serial mode and if any of them returns true (especially a
filter) then the record is filtered. In case of a modifier, the return value is always false, meaning
that the record is kept in the system.

ProcessingStep
Class ProcessingStep (gr.ekt.tranformationengine.core.ProcessingStep) is the minimal module executed
in the system. It is the basic ingredient of TransformationWorkflow. There can be two types of
ProcessingSteps in the system: filters and modifiers, as explained below.

Filter (extends gr.ekt.tranformationengine.core.ProcessingStep)


Class Filter (gr.ekt.transformationengine.core.Filter) is responsible to decide if a Records will be driven
to the output or not. It includes the following abstract method:

public abstract boolean filter(Record record)

which user needs to implement. It accepts as an input parameter a Record and needs to return false if the
records must be in the output or true, otherwise.

The logic behind filters is to cut some records from the output, however, user can always return false
(meaning, no filtering takes place) and just modify the record by changing the values of some fields. Users
can find useful the following methods getByName, addField, removeField and updateField of Record
class as discussed previously.

Filter is able to contain a list of strings (that are loaded at the very beginning through an initializer) and a
comparator. Given the field that user defines, the specific filter returns false (meaning no filtering) if the
value of the field is found on the list on strings. Comparison is done using the comparator, as explained
later on.

Modifier (extends gr.ekt.tranformationengine.core.ProcessingStep)


Class Modifier (gr.ekt.transformationengine.core.Modifier) is reponsible to modify a Record by changing
the values of some of its fields. It includes the following abstract method:

public abstract void modify(Record record);

which the user needs to implement. It accepts as an input argument a Record that needs to be edited.
User can work with methods getByName, addField, removeField και updateField of Record class. In
transformation engine library, you can find build-in the following implementation

KeyRenameModifer (gr.ekt.transformationengine.modifiers.KeyRenameModifier):

This modifier updates the key names of the various fields of the record with new names. It is an
abstract class which defines the following abstract method:

public abstract void loadMapping();

The mapping is a map of keys (old key value) and values (the new key value).

SimpleKeyRenameModifer (gr.ekt.transformationengine.modifiers.SimpleKeyRenameModifier):

It is an implementation of the KeyRenameModifier, where the user can just load the mapping
between the old and the new values within the String XML configuration file.

FileKeyRenameModifier (gr.ekt.transformationengine.modifiers.FileKeyRenameModifier):

It is an implementation of the KeyRenameModifier, where the user can provide the filename of an
XML file that includes the mapping. Here is an example of such a file:
The framework includes such file mappings for the following transformations:

• TSV to DC mapping
• CSV to DC mapping
• RIS to DC mapping
• Endnote to DC mapping
• Bibtex to DC mapping
• ISI HTML to DC mapping

Comparator
Class Comparator (gr.ekt.dataprocessing.core.comparator) is responsible to compare the values of two
strings. More specifically, abstract class comparator includes the following method:

public boolean compare(List<Object> valuesList, String recordValue)

in which user feeds a list of strings to be compared with the record value. Comparator must return true in
the case tha the value is found among the values of the list. Otherwise, it must return false. The actual
implementation of the comparator is left to the user as well as the logic of how to combine the separate
comparison results between each value of the list with the record value.

Transformation Engine library consists of the following implementations of Comparator class:

EqualsComparator (gr.ekt.transformationengine.comparators.EqualsComparator):

It checks if the record value is contained exactly as is in the list of strings. If yes, it returns true, otherwise
false.

IncludesComparator (gr.ekt. transformationengine.comparators.IncludesComparator):

It checks if the record value is included (using java containsOf string method) in a value of the list of
strings. If yes, it returns true, otherwise false.
Initializer
Class Initializer (gr.ekt. transformationengine.initializers.Initializer) is used to initialize a filter with a list
of string values. The usual thing for someone to want is to have a filter with some string values and check
if the value of a field of a record is contained on these values. As mentioned later, the word “contains”
does not always mean equality, but a similarity that is expressed by a comparator and possibly by a string
metric algorithm. User, in his own implementation of Initializer needs to implement the following abstract
method:

public abstract List<Object> initialize()

where the user needs to return a list of values (not necessarily strings) that are going to be used by a filter
for comparison reasons.

You might also like