Reconciling Schema Matching Networks: Thèse N 6033 (2013)
Reconciling Schema Matching Networks: Thèse N 6033 (2013)
Reconciling Schema Matching Networks: Thèse N 6033 (2013)
PAR
Suisse
2014
Acknowledgements
iii
Pay-as-you-go reconciliation. A single human expert is involved to val-
idate the generated correspondences. In Chapter 4, we develop a recon-
ciliation guiding method, in which the correspondences are validated in an
order according to the expected amount of the potential “benefit” if their
correctness is given. To further reduce the involved human effort and detect
erroneous input, we propose a reasoning technique based on the integrity con-
straints to derive the validation consequences. Morever, as the availability
of such expert work is often limited, we also develop heuristics to construct
a set of good quality correspondences with a high probability, even if the
expert does not validate all the necessary correspondences.
Collaborative reconciliation. While the pay-as-you-go setting above con-
siders only one single expert, this setting employs a group of experts who work
simultaneously to validate a set of generated correspondences. As the ex-
perts might have conflicting views whether a given correspondence is correct
or not, there is a need of supporting the discussion and negotiation among
the experts. In Chapter 5, we leverage the theorical advances and multiagent
nature of argumentation to realize the supporting tools for collaborative rec-
onciliation. More precisely, we construct an abstract argumentation from the
expers’ inputs and encode the integrity constraints for detecting validation
conflicts. Then we guide the conflict resolution by presenting meaningful in-
terpretations for the conflicts and offering various metrics to help the experts
understand the consequences of their own decisions as well as those of others.
Crowdsourced reconciliation. In the two above settings, we elicit the
knowledge from experts for reconciliation. However in some cases, the ex-
perts are not always available due to limited effort budget. To overcome this
limitation, we use crowdsourcing approach that employs a large number of
crowd users online, with the advantages of low monetary cost and high avail-
ability. In Chapter 6, we propose techniques to obtain high-quality validation
results while minimizing labour efforts of the crowd, including (i) contextual
question design – incorporates the contextual information to the question,
(ii) constraint-based answer aggregation – aggregates different answers of the
crowd based on the dependencies between correspondences.
Through theoretical and empirical findings, the thesis highlights the impor-
tance and robustness of schema matching networks in reconciling the erro-
neous matches of automatic matching. Such matching networks are inde-
pendent of used matching tools and reconciliation settings, leading to more
potential applications in the future. Especially with the era of Big Data in
recent years, our techniques are suitable for big data integration in which
more and more data sources are being incorporated over time.
Keywords: data integration, schema matching, reconciliation, crowdsourc-
ing, collaborative work, argumentation
Résumé
v
contraintes d’intégrité créent un certain nombre de dépendances entre les cor-
respondances qui sont difficiles à prendre en compte dans le cas de réseaux
étendus. Malgré ces complications, ces dépendances apportent également
des informations supplémentaires permettant de guider la validation des cor-
respondances en fournissant un critère de qualité. Les contributions de ce
travail proposent une solution au problème de réconciliation de réseaux de
correspondance entre schémas dans trois configurations données.
Pay-as-you-go reconciliation. Une seule personne est impliquée dans la
validation des correspondances automatiquement générées. Dans le chapitre
4, nous développons une méthode permettant de guider la phase de réconciliation.
Dans cette méthode, les correspondances sont validées dans un certain or-
dre dépendant des potentiels bénéfices que peuvent apporter ces correspon-
dances si elles s’avèrent correctes. Afin de diminuer encore plus l’implication
humaine dans le processus ainsi que détecter d’éventuelles données erronées,
nous proposons un raisonnement déterminant les conséquences d’une vali-
dation basée sur l’intégrité des contraintes du réseau de schémas. De plus,
étant donné que la disponibilité d’un avis expert est souvent restreinte, nous
développons également des heuristiques pour construire des ensembles de
correspondances de bonnes qualités avec une grande probabilité, même dans
les cas où l’expert ne valide pas tout les correspondances nécessaires.
Collaborative reconciliation. A l’inverse du mode de réconciliation précèdent
qui ne prend en compte qu’un seul expert, cette méthode considère un groupe
d’expert qui travaillent simultanément à la validation d’un ensemble des
correspondances générées automatiquement. Les experts pouvant être en
désaccord concernant la validité d’une correspondance, un outil de discus-
sion et de négociation est nécessaire. Dans le chapitre 5, nous utilisons les
avancées théoriques et la nature multi-agent d’une argumentation afin de
réaliser les outils supportant une réconciliation collective. Plus précisément,
nous construisons une argumentation dite abstraite des conseils des experts
et intégrons les contraintes d’intégrité tirées de la topologie du réseau de
schémas afin de détecter les conflits de validation. Finalement nous guidons
la résolution des conflits en présentant une interprétation cohérente des con-
flits et en offrant différentes mesures ayant pour but d’aider les expert à
comprendre les conséquences de leur propres décisions ainsi que celles des
autres experts.
Crowdsourced reconciliation. Dans les deux configurations précédentes,
nous utilisons les connaissances d’experts afin de valider les correspondances.
Cependant, ces connaissances ne sont pas toujours disponibles, par exemple
pour des raisons de budget. Afin de pallier cette limitation, nous utilisons
le crowsourcing qui utilise un grand nombre d’utilisateur en ligne et qui
présente les avantages d’être bon marché et d’avoir une forte disponibilité.
Dans le chapitre 6, nous proposons une méthode pour obtenir des valida-
tions de correspondance de grande qualité tout en minimisant le travail des
utilisateurs, incluant (i) Un questionnaire contextualisée - incorporant des
éléments contextuels dans questions, (ii) une agrégation des résultats basée
sur les contraintes - agrégation des réponses des utilisateurs en se basant sur
les dépendances entre les correspondances.
A travers une recherche à la fois empirique et théorique, la thèse met en avant
l’importance et la robustesse des réseaux de correspondances entre schémas
lors de la réconciliations de relations erronées générées automatiquement.
Ces réseaux de correspondances sont indépendants des outils utilisés pour la
générations des relations ainsi que du mode de réconciliation utilisé, menant
sur de nombreuses applications potentielles. En particulier due à l’avènement
de l’ère ”Big data” ces dernières années, les techniques présentées sont ap-
propriées à l’intégration de ce déluge de données dans lequel de plus en plus
de sources sont incorporées au fil du temps.
Mots-clés: l’intégration des données, schéma correspondant, la réconciliation,
crowdsourcing, le travail collaboratif, l’argumentation
viii
Contents
Abstract iii
Résumé v
Contents ix
List of Figures xv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Applications of Schema Matching Networks . . . . . . . . . . . . . 3
1.1.2 The Need of Schema Matching Reconciliation . . . . . . . . . . . . 5
1.2 Goals and Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Contributions and Thesis Organization . . . . . . . . . . . . . . . . . . . . 8
1.4 Selected Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Background 13
2.1 Schema Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Matching Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1.1 Techniques for Pair-wise Matching . . . . . . . . . . . . . 14
2.1.1.2 Techniques for Multiple Schemas . . . . . . . . . . . . . . 15
2.1.1.3 Techniques for Large Schemas . . . . . . . . . . . . . . . 16
2.1.1.4 Combined Techniques . . . . . . . . . . . . . . . . . . . . 17
2.1.1.5 Semi-Automatic Techniques . . . . . . . . . . . . . . . . . 17
ix
CONTENTS
x
CONTENTS
4 Pay-as-you-go Reconciliation 57
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Model and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.2 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.3 Reconciliation Process . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 Minimize User Effort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.1 Effort Minimization by Ordering . . . . . . . . . . . . . . . . . . . 62
4.3.2 Effort Minimization by Reasoning . . . . . . . . . . . . . . . . . . 64
4.4 Instantiate Selective Matching . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4.2 Heuristic-based Algorithm . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5.2 Evaluations on Minimizing User Effort . . . . . . . . . . . . . . . . 72
4.5.3 Evaluations on Instantiating Selective Matching . . . . . . . . . . . 75
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5 Collaborative Reconciliation 79
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.1.2 Overall Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Model and System Overview . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.1 Task Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
xi
CONTENTS
6 Crowdsourced Reconciliation 97
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2 Model and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 Question Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.4 Answer Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4.1 Aggregating Without Constraints . . . . . . . . . . . . . . . . . . . 101
6.4.2 Leveraging Constraints to Reduce Error Rate . . . . . . . . . . . . 101
6.4.2.1 Aggregating with Constraints . . . . . . . . . . . . . . . . 102
6.4.2.2 Aggregating with 1-1 Constraint . . . . . . . . . . . . . . 102
6.4.2.3 Aggregating with Cycle Constraint . . . . . . . . . . . . . 104
6.4.2.4 Aggregating with Multiple Constraints . . . . . . . . . . 106
6.5 Worker Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.5.1 Detect Spammers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.5.2 Detect Worker Dependency . . . . . . . . . . . . . . . . . . . . . . 108
xii
CONTENTS
7 Conclusion 117
7.1 Summary of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.2.1 Managing Schema Matching Networks . . . . . . . . . . . . . . . . 119
7.2.2 Big Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.2.3 Generalizing Reconciliation for Crowdsourced Models . . . . . . . 121
Bibliography 123
xiii
CONTENTS
xiv
List of Figures
xv
LIST OF FIGURES
xvi
List of Tables
xvii
LIST OF TABLES
xviii
Chapter 1
Introduction
More and more online services enable users to upload and share structured data, in-
cluding Google Fusion Tables [GHJ+ 10], Freebase [BEP+ 08], and Factual [fac]. These
services primarily offer easy visualization of uploaded data as well as tools to embed
the visualisation to blogs or Web pages. As the number of publicly available datasets
grows rapidly and fragmentation of data in different sources becomes a common phe-
nomenon, it is essential to create the interlinks between them [DSFG+ 12]. An example
is the often quoted coffee consumption data found in Google Fusion Tables, which is
distributed among different tables that represent a specific region [GHJ+ 10]. Extraction
of information over all regions requires means to query and aggregate across multiple
tables, thereby raising the need of interconnecting table schemas to achieve an integrated
view of the data. This task is often labeled schema matching, which is the process of
generating a set of correspondences between attributes of the involved schemas.
Not only is schema matching essential for such online services, but it is also crucial for
data integration purposes in large enterprises. Specifically, schema matching enables to
integrate the data distributed in different subsidiaries of an enterprise, such as providing
access to all data via a Web portal. Moreover, it also allows the collaboration between
the information systems of different companies through a seamless exchange of the data
residing in their databases. In fact, the market for data integration is growing rapidly
and attracting large amounts of capital in recent years. Statistically, data integration
was thought to consume about 40% of IT budget in large enterprises [BM07, Haa06].
The market was about $2.5 billion in 2007 and had an average annual growth rate of
8.7% [HAB+ 05]. Those numbers reflect the high level of importance of schema matching
in large enterprises.
Technically, schema matching is the problem of generating correspondences between
the attributes of two given schemas. A schema is a formal structure designed by human
beings, such as a relational schema and XML schema. Each schema contains several
attributes, each of which reflects the meaning of a particular aspect of the schema. An
attribute correspondence between a pair of schemas captures the equivalence relationship
of two given attributes, implying that they have the same meaning. For example, Figure
1.1 shows three XML schemas that were developed independently in the same domain.
1
1. Introduction
Despite the lexical difference between the two attributes releasedate and availabilityDate,
they have the same meaning and thus a correspondence (c2 ) is generated between them.
A set of attribute correspondences is typically the outcome of the schema matching
process. It is noteworthy that although we mainly use XML schemas for illustrations
in this thesis, our proposed techniques are applicable for a wide range of crowdsourced
models designed by humans such as ontology and web service.
s3: DVDizzy
a4: productionDate
s1: EoverI c4
c5
a3: availabilityDate
a1: releaseDate c2
c3
c1
2
1.1 Motivation
ing the network-level integrity constraints is a must to enforce the natural expectations
for consistency purposes in data integration. The presence of such integrity constraints
creates a number of dependencies between correspondences, making it challenging to
guarantee the overall consistency especially in large-scale networks. Despite this chal-
lenge, the dependencies between correspondences create an opportunity to improve the
quality of the matching by providing evidence for detecting problematic correspondences.
Prioritizing the problematic correspondences for user validation is crucial to reduce the
necessary efforts since those correspondences are the most likely cause for the poor
matching results.
In this chapter, we firstly claim that reconciliation in schema matching network is
important. To support this claim, we show a wide range of applications for schema
matching networks as well as the need of reconciling these networks. Then we introduce
the goals of this thesis as well as the research questions tackled to achieve these goals.
After that, we summarize the contributions which are the proposed solutions for those
research questions. Finally, we present the thesis organization and selected publications.
1.1 Motivation
Before diving into the goals of this thesis and the research questions, we would like to
convince the reader that reconcilition in schema matching network (SMN) is important.
To do this, we would like to show the presence of SMN in many applications and the
need of reconciliation.
There are many applications that require schema matching, varying from large enter-
prises to cloud platforms.
Large enterprises. Large enterprises often consist of many subsidiaries whose databases
are developed independently for targeted business needs. Hence, data reside in multiple
sources throughout an enterprise, rather than sitting in one neatly organized database.
Consequently, there is a need of querying across different databases to provide a unified
view of data in the whole enterprise. To support these cross-queries, we need to identify
the matchings between the schemas of involved databases [LMR90, SMM+ 09a]. And
thus, schema matching is a valuable tool to realize this identification.
Dataspaces. Dataspaces has been proposed [FHM05, HFM06] as a new abstraction
for data management to meet the growing demands of pervasive data. One success-
ful application of dataspaces is the personal information management that consists of
highly heterogeneous data typically stored among multiple file systems (local or net-
work) and multiple machines (desktop or mobile devices). The goal of dataspaces is
to provide primitive functionality over all data sources. To realize such functionality,
schema matching is an essential component for establishing the connections between all
data sources in the dataspaces.
3
1. Introduction
4
1.1 Motivation
5
1. Introduction
effort involved. The main difference between the literature and our work is that we
study the reconciliation for schema matching networks, which involve many pair-wise
matchings at the same time.
• How to propagate user input? The presence of integrity constraints creates a num-
ber of dependencies between correspondences, which may be hard to overlook in
the reconciliation. Despite this challenge, dependencies between correspondences
open an opportunity to derive the consequences from user input. As a result,
redundant validation of correspondences can be avoided to save the expert’s work.
6
1.2 Goals and Research Questions
• How to detect conflicting inputs? As experts might have different opinions about
the correctness of correspondences, their inputs should inevitably contain conflicts.
Moreover, since they work on local parts of the network, some global consistency
conditions can be violated. Consequently, we regard detecting conflicts as an
important task to eliminate the violations.
• How to guide conflict resolution? Since conflicts are inevitable, we need to define
a guiding mechanism that assists experts in exchanging knowledge, debugging,
and explaining the possible validations. Through this guidance, an expert would
trust more his own decisions and those of the others, resulting in a rapid conflict
resolution.
• How to design and post questions to the crowd? The validation questions need to
be properly designed so that the human workers are actually giving the correct
answers that the reconciliation needs. Moreover, if the questions are more under-
standable for workers, the quality of the answers would be more likely better. And
consequently, the monetary cost and completion time can also be reduced.
7
1. Introduction
• How to aggregate the answers effectively from the crowd? The workers often give
different answers for the same question. Since they have a wide range of expertise,
it is often difficult to aggregate their answers. A good answer aggregation method
should be able to compute the aggregated answers with a high quality.
• How to reduce the monetary cost and completion time? In order to achieve a
high-quality validation, the answers should be obtained from as many workers as
possible. However, it takes much time and money to obtain a large set of worker
answers. The challenge then becomes how to avoid redudant questions to reduce
the monetary cost.
8
1.3 Contributions and Thesis Organization
• We propose a reasoning technique to reduce the involved human effort and detect
erroneous input in most cases. Due the presence of integrity constraints, there
are dependencies between correspondences. Based on these dependencies, we de-
rive the consequences from user input as well as detect the inconsistency in the
validation results.
• We model the schema matching network and the reconciliation process, where we
relate the experts’ assertions and the constraints of the matching network to an
argumentation framework [Dun95]. Our representation not only captures the ex-
perts’ belief and their explanations, but also enables to reason about these captured
inputs.
• We develop supporting techniques for experts to detect conflicts in the set of their
assertions. To do so, we construct an abstract argumentation [Dun95] from the
experts’ inputs. In terms of this abstract argumentation, we define some rules for
analyzing and detecting the conflicts of correspondence assertions.
9
1. Introduction
• Nguyen Quoc Viet Hung, Tri Kurniawan Wijaya, Zoltan Miklos, Karl Aberer,
Eliezer Levy, Victor Shafran, Avigdor Gal and Matthias Weidlich. Minimizing
Human Effort in Reconciling Match Networks. The 32nd International Conference
on Conceptual Modeling (ER), 2013. (Chapter 3, 4)
10
1.4 Selected Publications
• Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Zoltan Miklos, Karl Aberer, Avig-
dor Gal and Matthias Weidlich. Pay-as-you-go Reconciliation in Schema Matching
Networks. The 30th IEEE International Conference on Data Engineering (ICDE),
2014. (Chapter 4)
• Nguyen Quoc Viet Hung, Xuan Hoai Luong, Zoltan Miklos, Tho Quan Thanh, Karl
Aberer. Collaborative Schema Matching Reconciliation. The 21st International
Conference on Cooperative Information Systems (CoopIS), 2013. (Chapter 5)
• Nguyen Quoc Viet Hung, Xuan Hoai Luong, Zoltan Miklos, Tho Quan Thanh,
Karl Aberer. An MAS Negotiation Support Tool for Schema Matching. The
12th International Conference on Autonomous Agents and Multiagent Systems
(AAMAS), 2013. (Chapter 5)
• Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Zoltan Miklos, Karl Aberer. On
Leveraging Crowdsourcing Techniques for Schema Matching Networks. The 18th
International Conference on Database Systems for Advanced Applications (DAS-
FAA), 2013 (Best Student Paper Award). (Chapter 6)
• Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Lam Ngoc Tran, Karl Aberer. An
Evaluation of Aggregation Techniques in Crowdsourcing. The 14th International
Conference on Web Information System Engineering (WISE), 2013. (Chapter 6)
• Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Lam Ngoc Tran, Karl Aberer. A
Benchmark for Aggregation Techniques in Crowdsourcing. The 35th International
ACM SIGIR conference on research and development in Information Retrieval
(SIGIR), 2013. (Chapter 6)
11
1. Introduction
12
Chapter 2
Background
In this chapter, we review the literature related to this thesis work. For a better un-
derstanding with clear organization, we present four topics as the search background of
this thesis; i.e., schema matching, answer set programming, argumentation, and crowd-
sourcing. For each topic, we will describe characteristics, issues, and state-of-the-art
approaches. This chapter is organized as follows. In Section 2.1, we survey important
techniques and tools for schema matching. In Section 2.2, we discuss the essentials of
Answer Set Programming. In Section 2.3, we summarize theoretical advances of argu-
mentation. At the end, Section 2.4 contains a overview of crowdsourcing issues and
techniques. It is noteworthy that nothing in this chapter is new to this dissertation.
Important citations are provided for major definitions and results. We present here the
most relevant literature, without being exhaustive.
There are numerous techniques for the schema matching problem, ranging from general
purpose approaches to domain-specific algorithms. Each of them has different charac-
teristics and address different problems, therefore having an overview is desirable. In
this section, various schema matching techniques are presented aiming at showing the
essence of their approaches and targeted problems. In the sense of this thesis’s goals
13
2. Background
Multiple Combined
Pair-wise Large Schemas Semi-automatic
Schemas techniques
Post-matching
Element-based Reuse-based Partition-based Hybrid
Validation
Active Learning
Structure-based Holistic Parallel Composite
and Reasoning
Rule-based
Element-based techniques. The techniques fallen into this category match two given
schema elements solely based on their own characteristics (name, description, data type,
value range, cardinality, term, key, relationship etc.). We describe the most representa-
tive techniques in what follows.
14
2.1 Schema Matching
Beside pair-wise matching techniques, there are other approaches that consider multiple
schemas at the same time and collectively generate the matches between them. The core
idea is that each pair of schemas is still matched by existing techniques and the generated
matchings are collected and refined by each other. Two main categories include reuse-
based matching and holistic matching.
Reuse-based Matching. This approach stores and reuses the information from ex-
isting matchings to aid the generation of new matchings [DR07, SSC10a, SBH08]. The
reuse-oriented approach is promising, since schemas in the same domain are often sim-
ilar to each other. One possible reused information is the groups of similar schema
attributes that are matched frequently. For example, the attributes Buyer and Purchaser
are matched frequently in the schemas of purchase order documents and the more they
15
2. Background
appear the higher confidence value of their matches. Another possible reused informa-
tion is the generated correspondences which are validated by users. For example, given
three schemas s1 , s2 , s3 , if the correspondence between two attributes s1 .P ersonN ame
and s2 .P erson is already validated by user, then the correspondence between two at-
tributes s1 .P ersonN ame and s3 .P erson will be generated automatically, assuming that
the name P erson has a unique meaning.
Holistic Matching. The main idea of this approach is extracting collective properties
of the collected schemas in order to refine each pair-wise matching. There are sev-
eral methods to realize this approach. One possible method is to construct a single
mediated schema for web forms in the same domain [HMYW03]. Another method
is to statistical cooccurrences of attributes in different schemas and use them to de-
rive complex correspondences [SWL06, HC03]. Last, but not least, is the method that
uses a ‘corpus’ of schemas to augment the evidence which improves existing matching
and exploit constraints between attributes by applying statistical techniques [MBDH05].
Moreover, further collective properties such as network-level transitivity are also consid-
ered in [ACMH03, CMAF06b], in which the establishment of attribute correspondences
is studied in large-scale P2P networks.
16
2.1 Schema Matching
To improve the quality of matching results, it is common practice to use multiple schema
matching techniques at the same time. An individual matching technique is often de-
veloped for specific problems, thus unlikely to cover all good candidate correspondences.
Combining different individual matching techniques becomes a natural solution, since
the strengths of one technique can complement the weaknesses of another and vice-versa.
There are two possible combinational approaches:
Composite Matching. In this approach, one lets different matchers be executed in-
dependently and then combines their matching results. Many composite matching tech-
niques have been developed [DDH01a, ADMR05b, DR02b, DMD+ 03], such as combining
the correspondences at attribute-level (i.e. the confidence value is aggregated by aver-
age, max, or min functions) and at schema-level (i.e. all correspondences are aggregated
and refined by each other, taking the importance of individual matchers into account).
Comparing to the hybrid matching, the composite approach provides a higher degree
of customization. The matchers can be selected or combined either automatically or
manually, executed either in sequence or in parallel, and plugged in or plugged out on
demand (even a hybrid matcher can be even plugged into a composite matcher but not
vice-versa). Nevertheless, a composite matcher can be slower than the hybrid one or any
others, since the matching between two schemas has to be performed in multiple passes.
Beside automatic matching, the semi-automatic and manual approach have also received
considerable attention in the literature. Strategies have been proposed to incorporate
user interaction and feedback in the matching process. Since these techniques are de-
signed for human users, visualizing schemas and related information is a must. In many
systems, a graphical user interface (desktop-based or web-based GUI) is provided to
support the interactive inspection and correction of generated correspondences. For ex-
ample, in [ADMR05b, BMC06, CAS09, FN11], the GUI maintains all generated match-
ings and offers various functions to manipulate, merge, combine, and evaluate attribute
correspondences. Beside the visualization, there are other approaches in this category
as described in what follows.
17
2. Background
18
2.1 Schema Matching
matching parameters, and manually creating difficult matches. Until the 2011 survey
[BMR11a], more and more state-of-the-art matching tools have been developed, includ-
ing commercial and research prototypes. While commercial tools focus on supporting
manual matching due to strictly high quality requirements, the research ones attempt
to automate the matching process for reducing user effort. For a broad view, we hereby
provide a catalogue of important schema matching tools that are developed in this pe-
riod.
The increasing growth of commercial tools underlines the highly important role of schema
matching in practice. In many commercial information systems, schema matching is typ-
ically a first step for generating pragmatic attribute correspondences between schemas,
for the purpose of transforming and migrating data from legacy systems into enter-
prise applications. The common features of such tools include a GUI-based matching
editor and a manual specification of the attribute correspondences. Due to strict qual-
ity requirements (i.e. data must be transformed correctly), most commercial tools do
not support automatic matching techniques (e.g. reuse-based matching, partition-based,
parallel matching); and thus, a huge manual matching effort is usually required especially
for large-scale matching tasks.
Table 2.1 presents our tool catalogue of commercial prototypes, including IBM In-
fosphere Data Architect, Microsoft Biztalk server, SAP Netweaver Process Integration,
and Altova MapForce. The details of these tools are given below.
IBM Infosphere Data Architect. This tool 2 was developed from its research pro-
totype Clio [PVH+ 02] and has been released from 2009, with the price of $6,270 for
enterprise edition. It has a mapping editor that supports linguistic matching and differ-
ent types of databases like Oracle, DB2, Sybase, Microsoft SQL Server, MySQL
Microsoft Biztalk server. This tool 3 has been released from 2009, with the sell-
ing price of $10,835 for enterprise edition. Its user interface supports visualizing large
schemas and complex matchings [BMC06]. The Biztalk server is also incorporated in
Microsoft Visual Studio and .NET framework.
2
http://www-03.ibm.com/software/products/us/en/ibminfodataarch/price
3
http://www.microsoft.com/biztalk/en/us/
19
2. Background
SAP Netweaver Process Integration. This tool 4 has been released from 2010 as a
part of SAP NetWeaver enterprise solution. It allows reducing the cost and development
time of deployment projects, in which data is migrated from the legacy system into the
SAP ERP system. In addition, it supports many B2B document standards and business
rules, which are explicitly enforced and notified for user in the migration process.
Altova MapForce. This tool 5 has been released from 2008, with the price of e799 for
enterprise edition. It has a graphical schema mapping interface which supports XML,
relational databases, flat files, EDI, and Microsoft Excel. Comparing to other tools, it
supports more user interaction features such as matching filters, structural matching,
functional matching (union, intersection, sum, etc. operators).
Jitterbit. This tool 6 has been released from 2009, with the price of $4000/month for
enterprise edition. Beside user-friendly interfaces and wizard tools, Jitterbit supports
not only XML but also Web services.
Now we briefly compare a few research prototypes that have successfully been applied
in literature. Table 2.2 depicts the characteristics of compared prototypes. Especially
the year of introduction is given based on the oldest publication of each tool.
COMA. This tool has been developed for more than ten years, starting from the first
version COMA [DR02a] to the second version COMA++ [ADMR05a] and the current
version COMA 3.0 [MRA+ 11]. COMA is one of the most comprehensive research proto-
types that integrate most of the schema matching techniques as aforementioned. It has a
wide range of successful applications, including matching XML schemas, web directories,
UML models, and ontology matching. While COMA++ is free for research purposes,
COMA 3.0 community edition becomes an open-source project with world-wide usages.
Harmony (OpenII). Harmony is the matching tool inside the OpenII framework
[SMH+ 10]. It provides a graphical user interface and supports many well-known match-
ing techniques. Especially, it supports composite matching where different confidence
values (proposed by individual matchers) are aggregated at schema-level. Harmony
4
http://www.sdn.sap.com/irj/scn/nw-downloads
5
http://www.altova.com/mapforce.html
6
http://www.jitterbit.com/
20
2.2 Answer Set Programming
2.2.1 Introduction
For over ten years, the answer set programming (ASP) has become a novel paradigm for
declarative knowledge representation and reasoning [GL91a]. ASP is a form of declara-
tive programming which uses mathematical logic to describe the original problem that
we want to solve. By describing what the problem is instead of how the problem should
be solved, ASP brings many benefits. First, ASP has no side-effect since all logic state-
ments are evaluated at once; i.e., the answer is unique and independent of the evaluation
21
2. Background
order of statements. Second, ASP has high expressive and reasoning power, thus being
recognized as a programming paradigm despite the fact that it is a subset of logic pro-
gramming. Because of these benefits, ASP have been applied to various problem-solving
domains, such as graph algorithms, model checking, and other specialized reasoning
tasks.
ASP is rooted in logic programming and non-monotonic reasoning; in particular, the
stable model semantics for logic programs [GL88, GL91b] and default logic [Rei80]. In
particular, ASP is based on the disjunctive datalog language, which is a subset of the
Prolog language. Although the datalog language does not support compound terms in
logic rules, this restriction guarantees that ASP is decidable – a desirable property of
declarative programming. Another important property of ASP is the principle of nega-
tion as failure, in which an atomic proposition is false by default if it cannot otherwise
be proven in the program. This principle brings some advantages for ASP by making the
program more expressive, but also results in considerable limitations of having a more
complex set of logic rules.
22
2.2 Answer Set Programming
words, whenever the atoms b1 , . . . , bm hold and the atoms c1 , . . . , cn do not hold, at least
one of the atoms from a1 , . . . , ak must hold. We call a1 , . . . , ak the head of the rule,
whereas b1 , . . . , bm and c1 , . . . , cn are the body of the rule. A rule with an empty body
is a fact, since the head has to be satisfied in any case. A rule with an empty head is
a constraint; the body should never be satisfied. For illustration, consider the following
example.
Program Π defines three predicates p, q, r. The first rule is a fact and the third
rule denotes a constraint. Further, p(c), r(c) are ground atoms, and p(X), q(X), are
non-ground atoms. Informally, an answer set of a program is a minimal set of ground
atoms, i.e., predicates defined only over constants, that satisfies all rules of the program.
An example of an answer set of program Π would be {p(c), q(c)}.
Semantics. Now we define formal semantics for an answer set program. Consider an
ASP P, a set of rules r of the form 2.1. We refer to the head a1 , . . . , ak of the rule r as
H(r) and denote the parts of the body as follows: b1 , . . . , bm is denoted by B + (r) and
c1 , . . . , cn by B − (r). We define an interpretation M of P as a set of ground atoms which
can be formed from predicates and constants in P. An interpretation M is a model of
• a ground rule r, denoted as M |= r, if H(r) ∩ I 6= ∅ whenever B + (r) ⊆ M and
B − (r) ∩ M = ∅;
• a rule r, denoted as M |= r, if M |= r0 for each r0 ∈ gr(r);
• a program P, denoted as M |= P, if M |= r for each r ∈ P.
A model M of P is minimal if there is no M 0 ⊂ M which is also a model for P. A reduct
[GL88] of a ground program P with respect to an interpretation M , denoted as PM is
obtained from P by:
(i) removing rules with not p in the body if p ∈ M ; and
(ii) removing atoms not q from all rules if q 6∈ M
An interpretation M of P is a stable model of P, if M is a minimal model of PM . An
answer set of P is a stable model of P.
Reasoning. Finally, we recall the notion of cautious and brave entailment for ASPs
[EIK09a]. An ASP Π cautiously entails a ground atom a, denoted by Π |=c a, if a is
satisfied by all answer sets of Π. For a set of ground atoms A, Π |=c A, if for each a ∈ A
it holds Π |=c a. An ASP Π bravely entails a ground atom a, denoted by Π |=b a, if a
is satisfied by some answer sets of Π. For a set of ground atoms A, Π |=b A, if for each
a ∈ A it holds that some answer set M satisfies a.
ASP Solvers. To implement an underlying model for ASP, there is a large body of work
in literature [LTT99]. Many sophisticated solvers have been implemented as underlying
23
2. Background
reasoning engines. Such solvers include the DLV system [LPF+ 06], the Smodels system
[NS97], and the Clingo system [GKK+ 08], which provide front-ends and back-ends for
reasoning as well as computing the semantics of logic programs.
2.2.3 Applications
In spite of its young age, ASP has an outstanding development. It has shown to be
widely applicable not only in the field of knowledge representation and reasoning but
also in other domains. In what follows, we summarize most popular achievements of
ASP.
Planning. ASP has been widely applied for solving classical planning problems such
as conformant planning, conditional planning, planning under uncertainty, incomplete
information, action costs, and weak constraints[Lif02, Bar03, LRS01]. Many ASP solvers
have been implemented for this purpose, especially the planning extension DLVK of the
DLV system [EFL+ 01].
World-Wide-Web. ASP has been used to provide preference reasoning services, such
as recommendation systems [ICL+ 03, BE99]. Moreover, ASP is also well-suited for rep-
resenting dynamic knowledge bases of domain-specific languages and ontologies in Se-
mantic Web [EFST01]. The work in [HV03] also integrated ASP into ontology languages
to extend their expressive and reasoning power.
Software Verification. ASP was also applied in constraint programming, software
verification, and software configuration [Nie99, SN98, Syr00]. In that formal concepts
such as configuration models and requirements are represented as declarative semantics
in ASP. Moreover, ASP is also used for symbolic model checking [Hel99] and Boolean
equation systems [KN05].
Multi-agent Systems. The work in [DVV03] presented a multi-agent system that
used ASP to model the interactions between agents for reaching an agreement. Agents
communicate with each other by exchanging the answer sets with their information
integrated. In some works [CT04], ASP is also integrated inside the agent language
interpreter to provide more advanced features.
Security. ASP has been applied for security and cryptography. The work in [CAM01]
showed how security protocols can be verified by specifying actions and rules in terms
of logic program and reasoning using ASP. The work in [HMN00] presented an ASP-
encodings for Data Encryption Standard (DES). Another line of research [KM03] uses
ASP as the inference engine for access controls (e.g. credentials abduction, trust negoti-
ation, and declarative policy). In [GMM03], ASP was used to simulate security systems,
including visual design, integrity analysis, and detecting security weaknesses.
Game Theory. ASP was also extended for studying finite extensive games [DVV99].
Games are transformed to logic programs such that the answer sets of the program corre-
spond to either the Nash equilibria or perfect equilibria of the game. Some works [BDV03]
also combine ASP-based planning into game applications.
24
2.3 Argumentation
2.3 Argumentation
It is worth reminding that the second reconciliation setting studied in this thesis is the
collaborative reconciliation, in which a group of expert users collaborate with each other
to reconcile a schema matching network. Generally in such a collaborative setting, the
participants will try to pursue their own goals with different points of view. In order to
reach an agreement among them, there is a need of formal methods for analyzing the
participants’ opinions. Negotiation techniques could largely benefit from such represen-
tations. In this work, we study one such method, called argumentation.
The process of argumentation is an iterative exchange of arguments towards reducing
conflicts and promoting the agreement. In order to participate, a user needs the ability
to (i) express and maintain his own beliefs, (ii) derive consequences by combining with
other users’ beliefs, and (iii) affect other users’ beliefs. Over repeated exchanges, users
may observe and analyze each other’s intentions to establish a better understanding and
stronger feelings of trust. Through these analyses and observations, the users dynami-
cally update and refine their knowledge and individual goals. In the following we will
describe the concept, the proposed techniques, and the applications of argumentation.
2.3.1 Introduction
Argumentation is essential to reach an agreement in multi-user settings, especially when
users have incomplete knowledge about each other. It makes a user trust in his own
decisions and those of the others, resulting in a rapid and reliable negotiation process.
There is a large body of research about argumentation that tries to answer two important
questions: (i) how to represent the arguments, and (ii) how to analyze the relationships
between arguments. While the former question can be answered by techniques related to
logical argumentation, the latter question can be answered by the work based on abstract
argumentation framework. A survey paper for developments of research in argumentation
can be found in [Pra12].
Example 2. Let c1 and c2 be two facts in real life that cannot be true at the same time.
Now assume that one is able to prove that c1 is true. Thus c2 must be false. This point
of view can be represented as an argument h{c1 , ¬c1 ∨ ¬c2 }, ¬c2 i.
25
2. Background
An argument can attack other arguments. To model this relation, the literature de-
fined two types of attack, namely undercut and rebuttal. The undercut relation between
two arguments captures the case that the claim of one argument directly contradicts with
the support of another argument. The rebuttal relation between two arguments repre-
sents that their claims contradict one another. Formally, let hΦ, αi and hΨ, βi are two
arguments. hΦ, αi rebuts argument hΨ, βi if α ↔ ¬β is a tautology. hΦ, αi undercuts
argument hΨ, βi if ¬α ∈ Ψ.
It is important to note that the logic-based nature of logical argumentation leads to
scalability issues in the problem of generating arguments and attack relations efficiently.
Although many approaches have been proposed in literature, the problem itself remains
largely unsettled. Here we summarize a few of these approaches. While the authors
in [EH11] relied on connection graphs and resolution, the work in [BGPR10] gave a
SAT-based approach. However, even the most advanced SAT-solvers cannot achieve
reasonable computational performance for large data sets. Another approach is based
on Answer Set Programming (ASP), which was proposed in Vispartix [CWW13]. This
approach takes advantage of the declarative nature of ASP to model the argumentation
formalism. As mentioned above, although ASP brings some advantages of adding ex-
pressive power to the formulation, it has some difficulties in describing complex inference
rules.
2.3.3.1 Model
26
2.3 Argumentation
b c d
a e
Figure 2.2: Graph presentation of an argumentation framework
2.3.3.2 Semantics
Given a set of arguments and their attack relations, we can construct many possible
conflict-free subsets of arguments. An interesting observation is that for an attack rela-
tion between two arguments, it is not straightforward to choose an argument and discard
the other. Intuitively, an argument is selected or rejected according to how it is argued
against the attacking arguments. For example, one could argue that an argument should
be discarded because it is attacked by many arguments; while another could also argue
that argument should be selected because it is defended by many other arguments. As
a result, there are many criteria to select the ‘appropriate’ arguments, which are de-
fined by semantics on argumentation frameworks, so called acceptability semantics in
the literature.
Acceptability semantics formally define which arguments are selected in an argumen-
tation framework. In other words, among possible subsets of arguments, the conflicts
27
2. Background
between arguments are analyzed and resolved differently. Formally, given an argumen-
tation framework AF = hA, Ri, an acceptability semantics divides A into one or many
subsets (possibly overlap with each other) that satisfies certain constraints. Such subsets
are referred as extensions. In the abstract argumentation literature, several acceptability
semantics have been proposed. Most of them are summarized or introduced by [Dun95].
In the following, we describe the formal definitions of the most representative semantics.
28
2.4 Crowdsourcing
Scalability issues appear when computing acceptability semantics for a large numbers
of arguments and attack relations. Generally, algorithms to compute semantics that are
more demanding in terms of defense and conflict-freeness have higher complexity. A
survey on computational complexity of such algorithms can be found in [BDG11]. It
is worth nothing that the graph representation of argumentation frameworks as graphs
is extremely helpful when it comes the implementation of algorithms for acceptability
semantics. We can implement different acceptability semantics not only by existing
graph-based algorithms but also the represenative power of graphs. The analysis of rela-
tions between arguments and the selection of ‘appropriate’ arguments becomes traversing
through graph edges and retaining the graph nodes that satisfy pre-defined criteria.
2.3.4 Applications
The argumentation-based approach has been successfully applied to many practical ap-
plications. In e-commerce systems [BAMN10], argumentation is used for solving conflicts
that may arise among distributed providers in large scale networks of web services and
resources, thus improving the automation level of business processes. In collaborative &
cooperative planning [ENP11], argumentation can be combined with other techniques
(e.g. machine learning) to help participants collaborate to solve problems by determining
what policies are operating by each participant. In social-network platforms [GCM12],
arguments can be extracted from natural language and then argumentation is used to
determined the social agreements among participants. In cloud-computing [HdlPR+ 12],
argumentation can be used to help cloud providers, who manage computational resources
in the platform, to reach an agreement on reacting against physical failures. In semantic
web [RZR07], argumentation has been modeled under Argument Interchange Format
(AIF) ontology, which forms the foundation for a large-scale collection of interconnected
arguments in the Web. In this work, we apply argumentation in data integration domain,
which is an active research field for more than ten years [BMR11b].
2.4 Crowdsourcing
As the volumes of AI problems involving human knowledge are likely to soar, crowd-
sourcing has become essential in a wide range of applications. Its benefits vary from
unlimited labour resources of user community to cost-effective business models. The
book [vA09] summarized problems and challenges in crowdsourcing as well as promis-
ing research directions for the future. A wide range of crowdsourcing platforms, which
allows users to work together in a large-scale online community, have been developed.
In this section, we firstly describe the concept of crowdsourcing and its elements. Next,
29
2. Background
2.4.1 Introduction
In recently years, there are more and more AI problems that cannot be solved completely
by computers such as image labeling, text annotation, and product recommendation.
Moreover, as the scale of these problems is beyond the effort of a single person, there
is a need of connecting a mass number of people in the form of an open call. Such
processes of employing an undefined network of workers are defined as crowdsourcing
[How06]. Rather than developing a complicated algorithm, it has been shown that a
large-scale problem can be solved by people on the Web, in which the tedious work is
split into small units that each person can solve individually.
In general, a crowdsourcing system comprises three elements: (1) Requester (ones
who have works to be done), (2)Worker (ones who work on tasks for money), and
(3) Platform (the system manages the task and the answer). Figure 2.3 depicts the
architecture of general crowdsourcing system.
Requester
Crowdsourcing
Platform Publish Tasks
Worker
• Worker: A worker is a person who accepts the offer and complete the crowd-
sourced tasks, in exchange for monetary or non-monetary rewards. Workers have
wide-ranging levels of expertise and knowledge, thus they can make some errors
while answering the questions. Statistically, the overall quality of workers is not
high.
30
2.4 Crowdsourcing
even allow requesters and workers to work together. One such famous platform is
the Amazon Mechanical Turk (AMT), which is mainly used in our work.
Using crowdsourcing both have pros and cons. On one hand, advantages of crowd-
sourcing includes the ability to assemble a large amount of users online, which reduces
the amount of completion time while minimizing labor expenses. Crowdsourcing also
encourages creativity since many ideas are contributed in the same place. On the other
hand, disadvantages of crowdsourcing concern the management issues. The crowd are
not traditional employees, thus it is not possible to control their commitment to control
how they work. Besides, crowd workers also need satisfaction and reputation, which are
often ignored by the requesters.
In this work, we only concern two aspects of quality control in crowdsourcing, namely
worker quality and answer aggregation.
31
2. Background
The workers in the crowd have wide-ranging levels of different characteristics, such as
education, age, and nationality. Many surveys have been conducted on existing crowd-
sourcing platforms for statistical analysis. For example, the findings in [RIS+ 10] suggest
that the workers on Amazon Mechanical Turk (AMT) are highly educated as more than
50% of workers have Bachelor and Master degrees. According to the statistical data in
the same survey, more than 50% of AMT workers have age below 34, and the majority
of workers on AMT are Indian and US citizens.
Based on those survey results, there is a large body of work proposed to capture and
formalize the worker characteristics. Most relevant studies are [KKMF11, VdVE11],
which characterized different types of crowd workers based on their expertise. In that,
workers with high expertise often give correct answers, while low expertise workers in-
tentionally or unintentionally give incorrect answers. Following the statistical data in
[VdVE11], we classify 5 worker types as depicted in Figure 2.4.
(1) Experts: who have deep knowledge about specific domains and answer the questions
with high reliability.
(2) Normal workers: who have general knowledge to give correct answers, but with few
occasional mistakes.
(3) Sloppy workers: who have very little knowledge and thus often give wrong answers,
but unintentionally. Their answers should be eliminated to ensure the overall quality.
(4) Uniform spammers: who intentionally give the same 15answer for every questions.
(5) Random spammers: who carelessly give the random answer for any question.
Crowd Simulation
1
Uniform
Expert
Sensitivity (true positive rate)
Spammer
0.8
Normal
0.6 Worker
Random
0.4 Spammer
Sloppy
0.2 Worker Uniform
Spammer
0
0 0.2 0.4 0.6 0.8 1
Specificity (true negative rate)
The first three worker types are often called truthful workers, whereas the last two types
are often called untruthful workers (or just spammers). Spammers just want to get as
much as money without spending too much time. Classifying workers is important for
controlling the quality of validation answers. For example, the trustworthiness of answers
can be evaluated not only by the majority but also the expertise of the associated workers.
To simulate the expertise of crowd workers, there are two well-known models, namely
one-coin model [MSD08] and two-coin model [IPW10]. The former represents the ex-
pertise of a worker by a probability p ∈ [0, 1] that his answer is correct. The lat-
ter uses two parameters: sensitivity—the proportion of actual positives that are cor-
rectly identified—and specificity—the proportion of negatives that are correctly iden-
tified. Following the statistical result in [KKMF11], we set randomly the sensitivity
32
2.4 Crowdsourcing
and specificity of each type of worker as follows. For experts, the range is [0.9, 1].
For normal workers, it falls into [0.6, 0.9]. For sloppy workers, the range [0.1, 0.4]
is selected. For random spammers, it varies from 0.4 to 0.6. Especially for uniform
spammers, there are two regions: (i) sensitivity ∈ [0.8, 1], specificity ∈ [0, 0.2] and (ii)
sensitivity ∈ [0, 0.2], specificity ∈ [0.8, 1].
Crowdsourcing relies on human workers to complete a problem, but humans are prone
to errors, which can make the results of crowd-sourcing arbitrarily bad. The reason is
two-fold. First, to obtain rewards, a malicious worker can submit random answers to all
questions. This can significantly degrade the quality of the results. Second, for a complex
problem, the worker may lack the required knowledge for handling it. As a result, an
incorrect answer may be provided. To address the above issues, a problem is split into
many tasks and each task is assigned to multiple workers so that replicated answers
are obtained. If conflicting answers are observed, the answers of different workers are
aggregated to determine a final result.
In the domain of crowdsourcing, a large body of work has studied the problem
of aggregating worker answers, which is formulated as follows. There are n objects
{o1 , . . . , on }, where each object can be assigned by k workers {w1 , . . . , wk } into one of
m possible labels L = {l1 , l2 , . . . lm }. The aggregation techniques take as input the set
of all worker answers that is represented by an answer matrix :
a11 ... a1k
M= .. ..
.
..
(2.2)
. .
an1 ... ank
where aij ∈ L is the answer of worker wj for object oi . The output of aggregation
techniques is a set of aggregated values {γo1 , γo2 , . . . γon }, where γoi ∈ L is the unique
label assigned for object oi . In order to compute aggregated values, we first derive the
probability of possible aggregations P (Xoi = lz ), where Xoi is a random variable of the
aggregated value γoi and its domain value is L. Each technique applies different models
to estimate these probabilities. For simplicity sake, we denote γoi and XOi as γi and Xi ,
respectively. After obtaining all probabilities, the aggregated value is computed by 7 :
A rich body of research has proposed different techniques for answer aggregation. In
what follows, we describe the most representative techniques for answer aggregation.
Majority Decision. Majority Decision (MD) [KWS03] is a straightforward method
that aggregates each object independently. Given an object oi , among k received answers
for oi , we count the number of answers for each possible label lz . The probability P (Xi =
lz ) of a label lz is the percentage of its count over k; i.e. P (Xi = lz ) = k1 kj=1 1aij =lz .
P
7 P
Note that lz ∈L P (Xi = lz ) = 1
33
2. Background
However, MD does not take into account the fact that workers might have different levels
of expertise and it is especially problematic if most of them are spammers.
Honeypot. In principle, Honeypot (HP) [LCW10] operates as MD, except that un-
trustworthy workers are filtered in a preprocessing step. In this step, HP merges a set
of trapping questions Ω (whose true answer is already known) into original questions
randomly. Workers who fail to answer a specified number of trapping questions are ne-
glected as spammers and removed. Then, the probability of a possible label assigned for
each object oi is computed by MD among remaining workers. However, this approach
has some disadvantages: Ω is not always available or is often constructed subjectively;
i.e truthful workers might be misidentified as spammers if trapping questions are too
difficult.
Expert Label Injected Crowd Estimation Expert Label Injected Crowd Estimation
(ELICE) [KSA11] is an extension of HP. Similarly, ELICE also uses trapping questions
Ω, but to estimate the expertise level of each worker by measuring the ratio of his answers
which are identical to true answers of Ω. Then, it estimates the difficulty level of each
question by the expected number of workers who correctly answer a specified number
of the trapping questions. Finally, it computes the object probability P (Xi = lz ) by
logistic regression [HL00] that is widely applied in machine learning. In brief, ELICE
considers not only the worker expertise (α ∈ [−1, 1]) but also the question difficulty
(β ∈ [0, 1]). The benefit is that each answer is weighted by the worker expertise and
the question difficulty; and thus, the object probability P (Xi = lz ) is well-adjusted.
However, ELICE also has the same disadvantages about the trapping set Ω like HP as
previously described.
34
2.4 Crowdsourcing
• Easy for humans. Since humans will perform the computation, it is beneficial if
the problem (or small parts of the problem) can be easily solved by a human.
• Hard for computers. Human computation time are typically much slower and
more expensive than computer computation time. Crowdsourcing is therefore most
usefully applied to computational problems that cannot be solved by a computer.
• Reasonable working effort. Even though the number of crowd workers on the web
is large, it is still infeasible to solve problems that require exponential numbers of
working hours. Thus, the problem should be solvable by a considerable number of
people.
35
2. Background
• The work is decomposable. Since human cognitive load is limited, the problem at
hand should be split into micro tasks. A task that takes an average worker several
hours to complete should be avoided.
There are a lot of surveys and tutorials [DFKK11, DRH11, LY12, QB11] toward
applying crowdsourcing to various research areas. In what follows, we highlight some
well-known problems and applications that use crowdsourcing successfully.
ESP game. The ESP Game [VAD04] provides an example of a novel interactive system
that allows people to label images while enjoying themselves. In this game, the partici-
pant labels an input image by a keyword, which most properly describes the image, from
a set of provided keywords. The ultimate goal is to obtain proper labels for each image.
The collection of all propoer labels about images on the Web is an invaluable data for
information retrieval applications.
Enterprises. Employing a crowd of users is not a new paradigm in enterprises. In
the past, many companies have engage end-users to contribute towards their products
and marketplace, often through the form of a competition, such as designing advertising
campaign, vetting new product ideas and solving challenging R&D problems [Sur05].
Examples include the Xerox’s Eureka system [BW02] – which uses internal employees
as the crowd for extracting business knowledge, and the ReferralWeb system [KSS97] –
which allows users to collaborative online in creating Web contents.
Crowd-based websites. Crowdsourcing is widely adopted in Web 2.0 sites. For in-
stance, Wikipedia has thousands of editors, who continually edit articles and contribute
knowledge for building a free encyclopedia. Another site is Yahoo! Answers, where users
submit and answer questions. Another example is the Goldcorp Challenge 8 , which em-
ploys geological experts distributed all over the world to identify the locations of gold
desposits. Crowdsourcing is also used on software development life cycles, as in Top-
coder.com. Recently, many piggy-back applications, which use programming API of
crowdsourcing platforms, have been developed such as CrowdDB [FKK+ 11], HumanGS
[PSGM+ 11], and CrowdSearch [YKG10].
8
http://www.goldcorpchallenge.com/
36
Chapter 3
Schema Matching Network - Modeling,
Quantifying, and Partitioning
3.1 Introduction
Let us consider the following scenario where three video content providers EoverI, BBC,
and DVDizzy would like to create a shared website to publicize their offerings, which
link back to the particular website for the purchases. The shared website needs informa-
tion from the individual content providers (e.g. title, date, review) so that consumers
searching on the site can find the products they want. Although it would be conceivable
to construct a global schema for the three providers, as more providers would join to
this shared site, such a global schema could become impractical. We assume a scenario
where the correspondences are established in a pairwise manner.
s3: DVDizzy s3: DVDizzy
title
productionDate a4: productionDate
s1: EoverI
availabilityDate s1: EoverI c4
title
assessment c5
userJugment
releaseDate
a3: availabilityDate
totalScore
review
a1: releaseDate c2
userComment
c3
overallScore
title
c1
screeningDate
review s2: BBC a2: screeningDate
s2: BBC userMemo
averageScore
Figure 3.1: A full matching network Figure 3.2: Simplified matching network
Figure 3.1 shows simplified schemas to illustrate this scenario. The three boxes
represent the schemas of EoverI (s1 ), BBC (s2 ), and DVDizzy (s3 ) respectively. Schema
s1 has five attributes: title, releaseDate, review, userComment, and overallScore. Schema
s2 contains five attributes: title, screeningDate, review, userMemo, and averageScore.
Schema s3 consists of six attributes: title, productionDate, availabilityDate, assessment,
userJudgment, totalScore.
To interconnect the data, we need to find equivalent attributes for each pair of
schemas. This equivalence relation is binary and represented by a correspondence be-
tween two attributes. The attribute correspondences can be automatically generated
37
3. Schema Matching Network - Modeling, Quantifying, and Partitioning
by typical schema matching tools such as COMA++ [ADMR05a] and AMC [PER11].
Combining generated correspondences, we have a notion of schema matching network —
a network of connected schemas in which two schemas to be matched do not exist in
isolation but participate in a larger interaction and connect to several other schemas at
the same time. For presentation purposes, we do not draw all correspondences here.
For simplicity sake, we now consider only a small portion of the network in Figure
3.2, with date-related attributes : s1 .releaseDate, s2 .screeningDate, s3 .productionDate,
and s3 .availabilityDate. The figure shows four correspondences, denoted by c1 , c2 , c3 ,
and c4 . As the names of the involved attribute are rather similar, automatic match-
ers fail to give the attribute correspondences without ambiguity. For example, the
attribute s1 .releaseDate is both matched to the attribute s3 .productionDate and the at-
tribute s3 .availabilityDate; hence, it is difficult to judge which correspondence (c2 or c4 )
is correct. This problem, among others, will be addressed more precisely in subsequent
chapters.
In this chapter, we formulate the concept of schema matching network to provide
the fundamental model for other chapters. To this end, we identify elements of the
matching network and attempt to define them in a domain-independent way. After
that, we opt for representing matching network in Answer Set Programming (ASP) and
point out the advantages of this representation in the context of this work. With the
benefits of ASP, we can modify the network easily, e.g. assert correspondences, update
constraints. Finally, we introduce some measurements to quantify the network quality.
During the course of the chapter, we use the motivating example (Figure 3.2) as an
informal illustration to help readers understand formal definitions.
3.2.1 Schema
We model a schema as a tuple s = hAs , δs i, such that As = {a1 , ..., an } is a finite set
of attributes and δs ⊆ As × As is a relation capturing attribute dependencies. This
model largely abstracts from the peculiarities of schema definition formalisms, such as
relational or XML-based models. As such, we do not impose specific assumptions on δs ,
which may capture different kinds of dependencies, e.g., composition or specialization of
attributes. For example, in Figure 3.1, s1 is a schema, in which the attribute review is a
complex attribute composed of two simple attributes userComment and overallScore; i.e.,
the dependency between the attribute review and each of the two attributes userComment,
overallScore is composition.
38
3.2 Elements of a Schema Matching Network
Let S = {s1 , ..., sn } be a set of schemas that are built of unique attributes 1 , i.e.
Asi ∩ Asj = ∅ for all 1 ≤ i, j ≤ n and i 6= j, and let AS denote the set of attributes
in S, i.e. AS = i Asi . In a schema matching network, S represents the set of involved
S
schemas that are matched against each other. To represent the connections between these
schemas, we layout the network in terms of an interaction graph in the next subsection.
39
3. Schema Matching Network - Modeling, Quantifying, and Partitioning
consists of correspondences whose associated value is above a given threshold. The set
of candidate correspondences C for an interaction graph GS consists of all candidates for
S
pairs corresponding to its edges, i.e. C = (si ,sj )∈E(GS ) ci,j . C is typically the outcome
of schema matchers [Gal11, GSW+ 12]. Most such matchers generate simple one-to-
one attribute correspondences, which relate an attribute of one schema to at most one
attribute in another schema. However, our model does not preclude handling of one-to-
many or many-to-many relations among sets of attributes. A natural approach to handle
these complex relations is treating the set of involved attributes as a complex attribute.
• One-to-one constraint. In some cases, one expects that each attribute of one
of the schemas is matched to at most one attribute of any other schemas. For
example in Figure 3.2, the attribute s1 .releaseDate is only allowed to match to
either s3 .productionDate or s3 .availabilityDate. However, both correspondences c2
(s1 .releaseDate ↔ s3 .availabilityDate) and c4 (s1 .releaseDate ↔ s3 .productionDate)
are generated by automatic matchers and we have to eliminate one of them by the
reconciliation.
• Cycle constraint. If the schemas are matched in a cycle, the matched attributes
should form a closed cycle. This is a natural expectation, if one would like to
exchange data that is stored in the databases corresponding to the schemas. Such
network-level constraints describe important consistency conditions; one would
like to avoid constraint violations. For example, among automatically generated
correspondences in Figre 3.2, the set of correspondences {c3 , c1 , c4 } violates the
cycle constraint.
40
3.3 Representation in ASP
ΠS ={attr(a, si ) | si ∈ S, a ∈ Asi }
∪ {dep(a1 , a2 ) | si ∈ S, (a1 , a2 ) ∈ δsi }
ΠC = {cor(a1 , a2 ) | (a1 , a2 ) ∈ C}
Basic assumptions (Πbasic ). We describe our basic assumption as rules in the program
Πbasic :
41
3. Schema Matching Network - Modeling, Quantifying, and Partitioning
• An attribute cannot occur in more than one schema. We encode this knowledge by
adding a rule with an empty head, i.e., a constraint, so that no computed answer
set will satisfy the rule body. For each attribute a ∈ AS and schemas s1 , s2 ∈ S,
we add the following rule to Πbasic :
Example 3. Back to the running example in Figure 3.2, we have the follow-
ing ASP programs respectively. The program of schemas and attributes is ΠS =
{attr(a1 , s1 ), attr(a2 , s2 ), attr(a3 , s3 ), attr(a4 , s3 )}. The program of candidate correspon-
dences is ΠC = {cor(a1 , a2 ), cor(a1 , a3 ), cor(a1 , a4 ), cor(a2 , a3 ), cor(a2 , a4 )}. No rule in
the program of basic assumption Πbasic is unsatisfied.
reach(X, Y ) ← match(X, Y )
reach(X, Z) ← reach(X, Y ), match(Y, Z)
← reach(X, Y ), attr (X, S), attr (Y, S),
X 6= Y.
42
3.4 Quantifying the Network Uncertainty
In large matching networks, detecting such constraint violations is far from trivial
and an automatic support is crucial. Adopting the introduced representation enables
us to compute violations of constraints automatically. Technically, a violation V of a
integrity constraint γ is encoded in ASP as ΠS ∪ Πbasic ∪ Πγ 6|=b ΠV . With the help
of ASP solvers, we can detect these constraint violations, each of which has the form
{corr(ai , aj ), . . . , corr(ak , al )}, and then convert them to the set V iolation(C) easily.
Moreover, the ASP representation also allows for expressing reconciliation goals. A
frequent goal of experts is to eliminate all violations, that we can express as ∆NoViol =
{Π(i) |=b ΠC (i)}, i.e., the joint ASP bravely entails the program of the correspondences
at the i-th step of the reconciliation process.
Example 5. We depict how to detect constraint violations by using the running example
in Figure 3.2. With respect to one-to-one constraint and cycle constraint, we have a set of
minimal violations V iolation({c1 , c2 , c3 , c4 , c5 }) = {{c2 , c4 }, {c3 , c5 }, {c3 , c1 , c4 }, {c5 , c1 , c2 }}.
43
3. Schema Matching Network - Modeling, Quantifying, and Partitioning
probabilistic model that maintains a set of probabilities P where each element pc is asso-
ciated with a correspondence c ∈ C. Combining the introduced notions, we extend from
a schema matching network N = (S, GS , C, Γ) into the notion of probabilistic matching
network, denoted as hN, P i.
Our probabilistic model provides a unified way to encode all relevant information
on top of a schema matching network. Usually schema matchers deliver a so called
confidence value to each candidate correspondence [RB01a]. A confidence value may
be interpreted as an indicator for the uncertainty of the correspondence in the match-
ing. However, it has been observed that these confidence values are not normalized,
often unreliable, dependent of the used matcher and unrelated to the application goals
[BMR11a]. Thus, we take a different approach to measure the uncertainty for corre-
spondences. In the context of this work, we assume that the integrity constraints and
user input are of paramount importance for the applications. To unify the constraints
and user input, we compute the probabilities of correspondences from both, which will
be described more precisely below. For example, having all probability values equal to
one means that the associated correspondences satisfy all integrity constraints and re-
spect user input. Moreover, on top of this model, potential applications can exploit the
embedded information easily and leverage the theoretical advances of probability theory.
This section is dedicated to discussing the problem of establishing a probabilistic
matching network. In particular, we introduce how to compute the probabilities as-
sociated to the correspondences. Moreover, as computing the exact probability values
for correspondences can be computationally expensive, we then also develop techniques
to approximate these values. Finally, we quantify overall uncertainty of the network
through computing the sum of individual uncertainty.
We assume that the integrity constraints and user input are of paramount importance
to the data integration applications. Here we denote user input as hF + , F − i where F +
is a set of approved correspondences and F − is a set of disapproved correspondences,
regardless of the reconciliation setting. In a schema matching network, we call a set of
correspondences, which satisfies all the integrity constraints and respects user input, a
matching of the network. If all correspondences in the schema matching network are
validated by the user, we call the set of all approved correspondences the final matching,
assuming that user input is correct and consistent (i.e. no constraint violation). From
this starting point, we adopt a model in which a correspondence is more like to occur
in the final matching, if it is present in many matchings that qualify as approximations
of final matching. This property should also hold in the presence of user input (ap-
provals or disapprovals of correspondences) that we consider correct. That is, for the
computation of probabilities, we consider possible matchings that include all approved
correspondences and exclude all disapproved correspondences (all possible matchings are
considered as equally probable). We capture the intuition of a matching that qualifies as
44
3.4 Quantifying the Network Uncertainty
an approximation of the final matching with the notion of a matching instance, defined
as follows:
Using a Venn diagram, Figure 3.3 illustrates the relationship of matching instances
with candidate correspondences and user input. Any matching instance includes all ap-
proved correspondences and excludes all disapproved correspondences. The number of
all possible matching instances is at most 2|C| as they are subsets of candidate corre-
spondences. F + and F − are disjoint since a correspondence cannot be approved and
disapproved at the same time.
I …
, : sets of approved/disapproved correspondences
Figure 3.3: Relationship between the set of candidate correspondences C, the user input hF + , F − i, and
the matching instances I1 , ..., In
45
3. Schema Matching Network - Modeling, Quantifying, and Partitioning
|{I ∈ Ω∗ | c ∈ I}|
pc = lim (3.2)
Ω∗ →Ω(F + ,F − ) |Ω∗ |
In order to design an efficient sampling technique for a stream of user assertions, two
factors have to be considered:
1. The sample space: it is critical to draw good samples that well capture the exact
distribution. Because of the integrity constraints, some correspondences always
go together, whereas some others never do. These correlations between corre-
spondences create a complex joint distribution incorporating all possible matching
instances.
2. The running time: we consider reconciliation as a pay-as-you-go process where
only a few changes are made at a time. Hence, it is not reasonable to re-sample
the matching instances from scratch for each user assertion. Instead, we need a
technique to maintain a set of preceding samples and update it upon the arrival of
a new user assertion.
Addressing these aspects, we rely on a sampling technique that supports non-uniform
sampling (to approximate the sample space of matching instances) and view mainte-
nance [BLT86] (to improve the running time).
Non-Uniform Sampling. Because of the complex joint distribution of matching in-
stances, uniform sampling methods like Monte Carlo are insufficient [GHS07] for prob-
ability estimation. Our non-uniform sampling overcomes this limitation by making use
of a random-walk strategy and simulated annealing. The role of random-walk is to ex-
plore the sample space by generating a next instance from the previous one. Technically,
the next instance is computed by randomly adding a correspondence to the current in-
stance and resolving all constraint violations created by this correspondence. However, a
random-walk may get trapped in the sample regions with high density [DLK+ 08]. Hence,
the role of simulated annealing is to “jump” out of such regions. Due to the dependencies
(i.e., integrity constraints) between correspondences in our set-up, the space of match-
ing instances is divided into regions of different magnitude which are not reachable from
each other. As a result, combining random-walk and simulated annealing ensures that
an instance in a high-magnitude region should be sampled with a high chance and that
an instance in a low-magnitude region should be sampled with a low chance.
We show the details of our non-uniform sampling in Algorithm 3.1. The algorithm
has two parameters (n, k), four inputs (C, Γ, F + , F − ), and returns a set of sampling
instances Ω∗ as output. First, it starts with a trivial sample which contains all the
approved correspondences (line 1). Next, it generates the next n samples, each of them
being computed from the previous one using random-walk [MSK97, WES04] (line 4).
46
3.4 Quantifying the Network Uncertainty
View Maintenance. To realize view maintenance, we always keep the set of preceding
samples Ω0 and update it based on the new assertion of a correspondence c. The idea is
that if a correspondence is approved, all the samples not containing this correspondence
are discarded; otherwise, we remove all the samples which contain this correspondence.
More precisely, the set of samples Ω∗ is derived as follows, depending on whether c is
approved (Ω∗ (F + ∪ {c}, F − )) or disapproved (Ω∗ (F + , F − ∪ {c})):
Ω∗ (F + ∪ {c}, F − ) = Ω0 (F + , F − ) \ {I ∈ Ω0 (F + , F − )|c 6∈ I}
Ω∗ (F + , F − ∪ {c}) = Ω0 (F + , F − ) \ {I ∈ Ω0 (F + , F − )|c ∈ I}
Maintenance may reduce the number of samples as we do not generate any further
sampling instances, leading to poor estimation of the probabilities. To cope with this
limitation, we define a tolerance threshold nmin , such that more samples are generated
if |Ω∗ | < nmin . Moreover, if the size of Ω∗ is still smaller than nmin after two consecutive
samplings, it implies that the actual number of all matching instances is smaller than
nmin and it holds Ω∗ = Ω. Hence, it is not necessary to re-sample since all matching
instances have already been generated.
47
3. Schema Matching Network - Modeling, Quantifying, and Partitioning
X
H(C, P ) = − [pc log pc + (1 − pc ) log (1 − pc )] (3.3)
c∈C
A network uncertainty H(C, P ) = 0 means that all probabilities are equal to one or
zero; or in other words, there is only one matching instance remaining. In that case, all
remaining candidate correspondences, except the disapproved ones, construct a matching
that satisfies all the integrity constraints, the final matching. Hence, our goal in the
reconciliation of a probabilistic matching network is to reduce the network uncertainty
to zero. It is worth noting that the user input F is not included in eq. (3.3) since it is
already incorporated in the correspondence probabilities P (implicitly by the probability
computation). It is also noteworthy that asserted correspondences do not contribute to
the network uncertainty since their probability is either one or zero. In other words, the
network uncertainty can be computed on the set of non-asserted correspondences only,
i.e., H(C, P ) = H(C \ (F + ∪ F − ), P ).
48
3.5 Network Partitioning
these works is that even if each sub-schema is small enough, the number of schemas in
the network is still too large for the computation.
In our work, we study a fine-grained approach that decompose the original network
into small regions, which are reasonably small for computational tractability. This ap-
proach offers several advantages. First, it avoids the potential computation problems of
detecting and resolving large constraint violations. Second, we can devise means for mul-
tiple users to work parallelly on different regions of network. To realize this approach,
we propose two graph-based techniques: (i) decomposition by connected components
and (ii) decomposition by k-way partitioning. While the former focuses on the connec-
tion between attributes, the latter preserves the correlation between correspondences in
terms of integrity constraints. In the end, we additionally mention our another decom-
position technique based on schema cover [GKS+ 13], which can be combined with these
two graph-based techniques.
Definition 3.3. Attribute graph is an overlay graph derived from a schema matching
network. Let (S, GS , C, Γ) be a schema matching network with k schemas. Then, an
attribute graph is defined as a k-partite graph G = (V1 ∪ V2 ∪ . . . ∪ Vk , E), where each
partite set Vi represents all attributes of the schema si ∈ S and each edge e = (ai , aj ) ∈ E
represents a candidate correspondence between two attributes ai and aj (i.e. (ai , aj ) ∈
C).
49
3. Schema Matching Network - Modeling, Quantifying, and Partitioning
must not exceed the number of schemas. When the number of violations increases, the
size of connected components also increases but the number of them decreases. In
the worst case, there could be only one connected component which is also the whole
attribute graph. For example in Figure 3.2, since all attributes are connected, there
is only one connected component {c1 , c2 , c3 , c4 , c5 }. As a result, there is no bound on
the size of connected components, motivating us to devise other decomposition schemes
for better modularity. In what follows, we present one such scheme based on k-way
hypergraph partitioning to construct a disjoint sets of correspondences with tolerated
size.
50
3.5 Network Partitioning
c4
v1 v2 v1
c5 v1 = {c2, c4} k=3 v3 v4
v2 = {c3, c5}
c2 v2
size of a sub-network is small enough for computational tractability. On the other hand,
increasing k would shorten the imbalance gap between correspondence subsets but might
also enlarge their dependency; i.e. increase the number of disconnected violations.
Hypergraph Reformulation. Problem 1 can be transformed directly into the the
k-way hypergraph partitioning problem [GJ90]. Based on this reformulation, we are
able to show the NP-hardness of our problem, since this is a one-to-one transformation.
Moreover, it allows us to apply heuristic-based algorithms, which have been proposed by
a large body of work in the literature. Formally, we represent a set of correspondences
C and a set of violations V in terms of a hypergraph H = (N, E). Each node n ∈ N
represents for a correspondence c ∈ C and associated with a weight equals to the number
of violations of c: w(n) = |{v ∈ V | c ∈ v}|. Each hyperedge e ∈ E is a subset of nodes
which share at least one violation. Formally, E = {e ⊆ N |∃v ∈ V, ∀cj ∈ ei , cj ∈ v}.
Now given a hypergraph H(N, E) which is transformed from (V, C), the dependence
minimization on (R, C) is equivalent to the k-way hypergraph partitioning on H(N, E).
In that, the goal is to partition the set N into k disjoint subsets, N1 , N2 , ..., Nk such that
(1 − δ) × N ≤ |Ni | ≤ (1 + δ) × N
Pk
P i |N |
where |Ni | = nj ∈Ni w(nj ) is the total weight of nodes in Ni and N =
i=1
k .
P
• The sum of external degrees |E(Ni )| of a partition is minimized, where the
external degree |E(Ni )| of a subset Ni is defined as the number of hyperedges that
are incident but not fully inside this partition.
Example 6. In Figure 3.4, we have a schema matching network with five correspon-
dences: C = {c1 , c2 , . . . , c5 }. If we only consider one-to-one constraint and cycle con-
straint (i.e. Γ = {γ1−1 , γ }), we have four constraint violations V = {v1 , v2 , v3 , v4 }. In
that, v1 = {c2 , c4 }, v2 = {c3 , c5 }, v3 = {c1 , c3 , c4 }, v4 = {c1 , c2 , c5 }. We can construct
a violation hypergraph with 4 nodes and 5 hyperedges. With the number of partitions
k = 3 and the balance tolerance δ = 0.2 , the optimal partitioning contains three par-
titions {v1 , v2 }, {v3 } and {v4 } with minimal total number of interconnected violations
between partitions |{c1 , c2 , c3 , c4 , c5 }| = 5.
51
3. Schema Matching Network - Modeling, Quantifying, and Partitioning
In our system, we use hMETIS [KAKS99] to compute k-way partition. hMETIS uses
novel approaches to successively reduce the size of the hypergraph as well as to refine
the quality of partitions. Comparing to other similar algorithms, hMETIS can provide
very high quality partitions of hypergraphs with thousands of nodes in an extremely fast
computing time.
52
3.5 Network Partitioning
middleware, providing a shared set of abstractions that faciliates data integration, espe-
cially schema matching. Informally, a concept is a collection of attributes that frequently
appear together. Each concept has a specific meaning and can be used to build schemas
by combining several of them. For instance, “street”, “city” and “zip code” often go
together and all of them describe a specific meaning “address”. This is a promising
approach since a concept is more meaningful than separate attributes.
Schema cover matches parts of schemas (called subschemas) with concepts, using
schema matching techniques (see chapter 2) and then adds cover-level constraints. Let
SBs = {sb1 , . . . , sbm } be a set of subschemas of s, where a subschema contains a subset
of the attributes of s (i.e. sbi ⊆ s). Let CP = {cp1 , . . . , cp } be a set of concepts,
where a concept is a schema by itself. A cover σ(s, CP ) between a schema and a
set of concepts is a set of pairs, where each pair in the set is a matching between a
subschema and a concept. For example in Figure 3.5, {({s1 .title, s1 .releaseDate}, cp1 ),
({s1 .review, s1 .userComment, s1 .overallScore}, cp2 )} is a schema cover between the schema
s1 and the concept repository.
In our work, we study two cover-level constraints: (i) ambiguity and (ii) completeness.
The former measures the part of a schema that can be covered by a set of concepts and
the latter examines the amount of overlap between concepts in a cover. Ambiguity and
completeness constraints, to be formally defined as follows, are embedded as (either hard
or soft) constraints in an optimization cover problem.
• Ambiguity: represents the number of times an attribute is matched to attributes
in several concepts. Ambiguity was first introduced in [SSC10c] as a phenomenon
where several concepts may give a different semantic interpretation to an attribute
in a schema. In our setting, we define the ambiguity of a cover to be the sum of
duplicate apperancees of an attribute in a cover:
X
Aσ (s) = (|(sb, cp) ∈ σ : a ∈ sb| − 1) (3.4)
a∈s
53
3. Schema Matching Network - Modeling, Quantifying, and Partitioning
54
3.6 Evaluation Methodology
These datasets are publicly available 11 and descriptive statistics for the schemas are
given in Table 3.1.
Tools. To generate candidate correspondences, we used two well-known schema match-
ers, COMA++ [DR02b, ADMR05b] and AMC [PER11]. All experiments ran on an
Intel Core i7 system (2.8GHz, 4GB RAM). ASP reasoning tasks are conducted with
the DLV system12 , release 2010-10-14, a state-of-the-art ASP interpreter. For network
partitioning, we worked with a state-of-the-art Hypergraph Partitioning tool, namely
the hMETIS system 13 , release 2007 − 05 − 25. We used the MOSEK system [20] for the
network partitioning by schema cover.
Metrics. We rely on the following evaluation measures.
2
openTRANS E-business document standards http://www.opentrans.de/
3
XML Common Business Library
4
The RosettaNet Standard, http://www.rosettanet.org/
5
SAP Purchase Order Standard, http://www.sap.com
6
Centro Ricerche Fiat http://www.crf.it
7
Common Application https://www.commonapp.org
8
Universal College Application https://www.universalcollegeapp.com/
9
Embark, http://www.embark.com
10
CollegeNet http://www.collegenet.com/elect/app/app
11
http://lsir.epfl.ch/schema_matching
12
http://www.dlvsystem.com
13
http://glaros.dtc.umn.edu/gkhome/metis/hmetis/overview
55
3. Schema Matching Network - Modeling, Quantifying, and Partitioning
• Precision & Recall: For defining precision and recall, we rely on an exact matching
G, containing correct correspondences (validated before-hand). In our context,
the precision and recall are defined for a set of correspondences V . Formally,
P rec(V )=(|V ∩ G|)/|V | and Rec(V )=(|V ∩ G|)/|G|, where G is the exact matching
(i.e. ground truth) given by the dataset provider.
3.7 Summary
In this chapter, for the first time, we defined and proposed the notion of schema matching
network and its elements. The model formally and declaratively represents the network
and generic network-level constraints. We deployed this model as an answer set program.
Using the reasoning capabilities of ASP and simple yet generic constraints, we prepared
the means to reduce the necessary user efforts in subsequent chapters. Moreover, we
proposed a probabilistic model that allows the capture of the uncertainty in the matching
network in a systematic way, independent of the used schema matching tools and data
integration tasks that shall be solved. Finally, we develop network modularity techniques
to decompose a schema matching network into small regions. The empirical results in
our original publications showed the effectiveness of our model and techniques in real
datasets.
56
Chapter 4
Pay-as-you-go Reconciliation
4.1 Introduction
In this chapter, we study the reconciliation within the most simple setting, in which
a single expert user (i.e. his input is absolutely correct) validates the correspondences
generated by automatic matchers. Specifically, we go beyond the common practice of hu-
man reconciliation in improving and validating matchings for a pair of schemas. Instead,
we study the reconciliation for a schema matching network, in which the participating
expert should respect the network-level integrity constraints to guarantee the overall
matching quality. The presence of such integrity constraints creates a number of depen-
dencies between correspondences, which may be hard to oversee especially in large-scale
networks. Despite of this challenge, dependencies between correspondences open an op-
portunity to guide the expert’s work by defining the order in which the expert gives his
input (e.g. in which order to assert whether a correspondence is correct).
The approach taken in this chapter strives to reduce the user effort needed for rec-
onciliation. We achieve this objective by relying on two strategies: (i) ordering – defines
the order of input sequences in which user is introduced with carefully selected corre-
spondences, and (ii) reasoning – for certain correspondences, we never elicit any feedback
since the application of reasoning may allow us to conclude on the assertions for these
correspondences as consequences of the remaining user input. Using the reasoning ca-
pabilities of Answer Set Programming (ASP) and simple yet generic constraints, as well
as a heuristic ordering of the issues a human has to resolve, we will be able to reduce
the necessary user interactions.
Guiding and minimizing user effort is essential for effective reconciliation. Yet, our
ultimate goal is to instantiate a selective matching – a high-quality set of correspon-
dences that satisfies all the integrity constraints – even if not all necessary input is
collected. This is because we can expect that in real-world settings, an expert has a
limited effort-budget and applications require fast setup time, so that waiting for full
reconciliation is not a feasible option. Indeed, if the schema matching network contains
a lot of problematic correspondences, the reconciliation effort can be considerable. Of-
ten however one does not need the entire network: for certain applications a subset of
57
4. Pay-as-you-go Reconciliation
Now we give an example to illustrate that different validation sequences might lead to
different necessary user efforts and why we need to instantiate a selective matching.
Continuing the running example in Figure 3.2, Figure 4.1 illustrates user validation for
the presented schema matching network, in which the correspondences are approved
(as true) and disapproved (as false) by a single user. Only 1-1 constraint and cycle
constraint are considered. In this specific case, we assume that c1 , c2 , c3 are true (solid
lines) and c4 , c5 are false (dash lines). We also assume that user can only validate two
correspondences c3 and c5 , now consider two possible validation sequences:
s1 s3
c4
s1 s3
c2
c4 c1
c5 c3
c2 s2
s3
c1 c3 s1
c4
s2 c2
c1 c3
s2
58
4.2 Model and Approach
As a result, the second validation order has less number of steps than the first one. In
other words, the user effort (i.e. number of necessary validation steps) depends on the
order of which correspondences are validated first. This, in turn, motivates us to design
ordering and reasoning techniques for minimizing user effort for a given reconciliation
goal (e.g. all the integrity constraints are satisfied).
Moreover, we can observe that after the sequence (ii), four correspondences c1 , c2 , c3 , c4
remains. This ‘matching’ cannot be used since it still has constraint violations. To make
the system ready for operation, we have to select a single trusted set of correspondences
without any violation, coined the term ‘selective matching’. A possible selective match-
ing of this example is {c1 , c2 , c3 }.
4.2.2 Framework
For an overview of our approach, we describe our framework before giving a precise model
of the collaborative reconciliation process in the next subsection. Figure 5.3 presents a
simplified architecture of our framework. The system involves a human expert in the
reconciliation, who works as long as the selective matching does not reach a quality he
is satisfied with. We start from a set of matching candidates, generated by automatic
schema matchers. The matching candidates are used to construct the schema matching
network. Upon any changes of the network, Instantiation automatically instantiates a
selective matching as the output of the system. For the purpose of guiding user and
minimizing his validation effort, we build the Effort Minimization component consisting
of an Ordering engine and a Reasoning engine. While the ordering engine is responsible
for generating and ranking all correspondence candidates shown to user, the reasoning
engine derives the validation consequences from user input. The interaction between the
framework and the user is repeated incrementally as the system runs. The more user
assertions, the higher quality of the selective matching.
Pay-as-you-go Reconciliation
Schemas
Schema Matching Network
Matchers
Instantiation
Selective matching
User
Effort Minimization
Ordering Reasoning
To realize the pay-as-you-go methods for reconciliation, we have to cope with the
following problems: (1) How to choose a good order of correspondences? (2) How to
derive the validation consequences from user input? (3) How to obtain a possibly-
complete set of correspondences during the reconciliation process, based on potentially
incomplete input. These questions are addressed in the detailed functionality of three
main components as follows.
59
4. Pay-as-you-go Reconciliation
Ordering. This engine takes the role of guiding the expert work to validate the schema
matching network. It generates and ranks correspondence candidates shown to user. The
correspondences are prioritized for user assertion in an incremental order that brings the
most “benefit” to the network. As a result, the user effort is minimized. The details
of how to define the network benefit and design the ordering criteria are described in
Section 4.3.
Reasoning. Base on the ASP formalization of user input, we use reasoning techniques
to conclude on the consequences of user input. This avoids eliciting feedback for any
correspondence for which there is an assertion in the consequences. In contrast to the
baseline (without reasoning), the assertions are no longer limited to the correspondences
for which user input has been elicited. Thus, when updating the active set as part of one
step in the reconciliation process, we also consider correspondences for which assertions
have been derived by reasoning. As a result, the user effort is minimized. The details of
this component is described in Section 4.3.
Instantiation. Instead of using heuristics or arbitrary rules, the Instantiation compo-
nent systematizes the use of the network uncertainty in 3.4 to make sensible decisions
about which correspondences are kept in the selective matching. This component instan-
tiates the selective matching by harnessing the computed correspondences probabilities,
via the maximal likelihood principle. Technically, from the original set of correspon-
dences (generated by matchers), we eliminate a minimal subset of correspondences with
low probability such that the remaining ones satisfy all integrity constraints. The in-
stantiation problem is described in detail in Section 4.4.
60
4.2 Model and Approach
at step i and Ui is the set of user input assertions until step i, i.e. Ui = {uj | 0 ≤ j ≤ i}.
The consequences of such user input assertions Ui are modeled as a set Cons(Ui ) ⊆ UC
of positive or negative assertions for correspondences. They represent all assertions that
can be concluded from the user input assertions.
61
4. Pay-as-you-go Reconciliation
• Selection: select(C \ Cons(Ui )). A user selects a random correspondence from the
set C \ Cons(Ui ).
• Consequence: conclude(Ui ). The consequences of user input are given by the input
assertions Ui , that is conclude(Ui ) = Ui .
We are going to formulate two problems. The first problem is about effort mini-
mization (Section 4.3), in which we implement the two routines select and conclude to
minimize user effort for a given reconciliation goal. The second problem is about selec-
tive matching instantiation (Section 4.4), in which we implement the instantiate routine
select a set of ‘best probable’ correspondences up until an arbitrary step of reconciliation
process.
We now consider minimization of user effort based on the selection strategy that is used
for identifying the correspondence that should be presented to the user. In Section 4.2.3,
we showed that without any tool support, a random correspondence would be chosen.
Depending on the order of steps in the reconciliation process, however, the number of
necessary input steps might vary. Some input sequences may help to reduce the required
user feedback more efficiently. In this section, we focus on a heuristic selection strategy
that exploits a ranking of correspondences for which feedback shall be elicited.
62
4.3 Minimize User Effort
This section approaches the minimization of user effort by providing two heuristics
of the selection function in the reconciliation process. The first heuristic exploits the
constraint violations associated with each correspondence. The second heuristic employs
the concept of information gain, which measured the amount of uncertainty reduction if
the correctness of a certain correspondence is given.
Ordering by Min-violation. Our selection function is based on a min-violation scor-
ing that refers to the number of violations that are caused by a correspondence. The
intuition behind this heuristic is that eliciting feedback on correspondences that violate
a high number of constraints is particular beneficial for reconciliation of a matching
network. Given a set C 0 ⊆ C of candidate correspondences, function minViol assigns
to each correspondence c ∈ C 0 the number of minimal violations (Violation(C 0 ), cf.,
Section 3.3.2) that involve c:
This scoring function is the basis for the selection of correspondences for eliciting
user feedback. As defined in Algorithm 4.1, selection is applied to the set of candidate
correspondences C once all correspondences for which we already have information as
the consequence of earlier feedback (represented by Cons(Ui )) have been removed. Thus,
selection is applied to C 0 = C \ {c | u+ −
c ∈ Cons(Ui ) ∨ uc ∈ Cons(Ui )} and defined as
follows:
63
4. Pay-as-you-go Reconciliation
This ordering function is the basis for the selection of correspondences in eliciting user
feedback. As defined in Algorithm 4.1, selection is applied to the set of candidate
correspondences once all correspondences for which user already asserted have been re-
moved. Thus, the selection is applied to C 0 = C \ (F + ∪ F − ) and defined as follows:
c∗ = argmaxc∈C 0 IG(c). If the highest information gain is observed for multiple corre-
spondences, one is randomly chosen.
Example 7 (Reasoning with user input). Consider two schemas, s1 and s2 , and three
of their attributes, encoded in ASP as attr (x, s1 ), attr (y, s2 ), and attr (z, s2 ). Assume
that a matcher generated candidate correspondences that are encoded as C = {cor (x, y),
cor (x, z)}. Further, assume that Γ consists of the one-to-one constraint. By approving
correspondence (x, y), we can conclude that candidate correspondence (x, z) must be false
and should not be included in any of the answer sets. Hence, in addition to validation of
correspondence (x, y), falsification of correspondence (x, z) is also a consequence of the
user input on (x, y).
To leverage reasoning for effort minimization, we first explain how to represent the
user input assertions in ASP, then turn to the actual reasoning about them, and finally
discuss how to detect inconsistencies in user input for avoiding wasted efforts.
Representing user input assertions. We represent user input assertions with an ASP
Πp (i). The construction of this program is close to the one presented in the previous
section. In fact, we largely rely on the same subprograms. We only change the way
correspondences and constraints are connected, i.e., Πcc is replaced by Π0cc , and add a
program to capture the user assertions, ΠU (i). Then, the program is constructed as
Πp (i) = ΠS ∪ ΠC ∪ ΠD (i) ∪ Πbasic ∪ ΠΓ ∪ Π0cc ∪ ΠU (i).
• Connecting correspondences and constraints (Π0cc ). To reason about the user input,
we use a slightly different rule to compute the set of correspondences satisfying
the constraints of the matching network. In contrast to the previous rule (cf.,
64
4.3 Minimize User Effort
Section 3.3), this rule considers all candidate correspondences (cor (X, Y )) and not
only those from the active set (corD(X, Y )):
• User input (ΠU (i)). For representing the user input, we add distinguished atoms
for the approval or disapproval of a correspondence to ΠU (i). We have chosen
not to represent the user assertion of approving (disapproving) a correspondence
cor(a, b) directly as match(a, b) (noMatch(a, b)), to be able to detect problems
with the user input, w.r.t. the integrity constraints. Further, for the case of
disapproval, ASP enables the use of ‘strong negation’ (in ASP syntax: ¬), which
directly corresponds to our intention: if a user disapproves a correspondence, then
it should not appear in any of the answer sets of the joint program Πp (i). We
define the atoms for the user input and the rules that connect the approval or
disapproval of correspondences with their correctness as follows (program ΠU 0 (i)
with U 0 (i) ⊆ U (i) is defined analogously):
Reasoning mechanism. Based on the ASP formalization of user input, we use rea-
soning techniques to conclude on the consequences of user input. Let Πp (i) be the ASP at
some stage of the reconciliation process. We define ΠP U = {incl cand (a, b), ¬incl cand (a, b) |
cor (a, b) ∈ ΠC } as the ASP representation of potential user inputs, i.e. the set of ground
atoms that correspond to an approval or disapproval of a correspondence from C. Then,
we capture the consequences of user input ΠU (i) in ASP by the set ΠCons (Ui ) ⊆ ΠP U ,
such that
• the consequences cover at least the user input, ΠU (i) ⊆ ΠCons (Ui ), and
• the reconciliation process cautiously entails consequences, Πp (i) |=c ΠCons (Ui ), and
• the consequences are maximal, for each t ∈ ΠP U \ ΠCons (Ui ) it holds that Πp 6|=c
ΠCons (Ui ) ∪ {t}.
Based on the consequences captured in ASP representation, we define the function
for concluding on consequences as follows:
This instantiation of the conclude routine avoids eliciting feedback for any correspon-
dence for which there is an assertion (in the ASP representation) in the set ΠCons (Ui ) \
ΠU (i). We conclude on those assertions automatically. In contrast to the aforementioned
conclude function (in section 4.2.3) used as a baseline, therefore, the assertions are no
longer limited to the correspondences for which user input has been elicited. Thus, when
65
4. Pay-as-you-go Reconciliation
updating the active set as part of one step in the reconciliation process, we also consider
correspondences for which assertions have been derived by reasoning.
Detecting problems in user input The ASP-based reasoning can also assist in de-
tecting problems in the user input. Assume that a user provides input ui , such that
Πp (i) 6|=b ΠU (i + 1). Then, the new input ui together is inconsistent with the previous
input. In this case, we determine the root cause of the inconsistency as a set of wit-
ness correspondences W (i) ⊆ Ui . Intuitively, W (i) represents a set of correspondences
that, together with ui , caused the inconsistency. We characterize such a set of witness
correspondences by Πp (i) 6|=b ΠU 0 (i) with U 0 (i) = W (i) ∪ {ui } in ASP representation.
To provide a meaningful feedback to the user, we require that W (i) be minimal. Then,
presenting W (i) to the user helps in resolving the respective problem by highlighting
which inputs jointly caused the inconsistency.
Enhancing Scalability. In general, invoking ASP is an expensive computational task.
It is inefficient to send the whole program Πp (i) to ASP at each reconciliation step i. In
fact, reconciliation is an incremental process where only a few changes are made once
at a time. These changes affect a small region of the network (see Figure 4.3). There-
fore, rather than perform reasoning over the whole network, we maintain the preceding
reasoning result and perform reasoning only over this small region. In the following,
we introduce how to determine the small region affected by a new feedback as well as
construct the corresponding ASP program for this region.
Δ ,
Figure 4.3: C is a set of correspondences, Cons(U ) is the consequences of user input, c is the corre-
spondence on which new feedback is given, ∆U,c is the set of correspondences needed for reasoning.
First of all, computing the set of correspondences needed for reasoning depends on
the user-input consequences Cons(U ) and the correspondence c on which a new feedback
is given. We denote ∆U,c as the set of correspondences affected by c and U in reasoning
(|∆U,c | C). With the integrity constraints and reasoning technique mentioned above,
a correspondence ĉ ∈ ∆U,c must satisfy two conditions: (i) exists at least one simple
(no repeated) path c, c1 , c2 , ..., ĉ connecting c and ĉ, (ii) only one correspondence on this
path does not belong to Cons(U ). After obtaining ∆U,c , we compute ∆Πp (i) that is sent
to ASP in order to compute new consequences. Formally, we have:
∆Πp (i) = ∆ΠC ∪ ∆ΠD (i) ∪ Πbasic ∪ ΠΓ ∪ Π0cc ∪ ∆ΠU (i) (4.3)
where
• ∆ΠC = {attr(a, si ) | si ∈ S, a ∈ si , ∃b : (a, b) ∈ ∆U,c } ∪ {cor(a, b) | (a, b) ∈ ∆U,c }
66
4.4 Instantiate Selective Matching
67
4. Pay-as-you-go Reconciliation
Note that the instantiation problem is defined only on the probabilities P assigned to
correspondences since user assertions are already incorporated implicitly in P . Solving
the instantiation problem requires knowledge about the integrity constraints in the net-
work. Unfortunately, even under the simplistic one-to-one constraint and even without
the maximal likelihood condition, the instantiation problem is computationally hard.
Proof. To prove the NP-completeness of our decision problem, we show that: (i) it is in
NP and (ii) it is NP-hard. Given a matching instance I, one can check in polynomial
time its repair distance (i.e. ∆(I, C)) is less than θ; so (i) is true. By definition, a
matching instance I must not violate the one-to-one constraint; i.e. ∀c ∈ I, @c0 ∈ I such
that c and c0 share exactly one common attribute and their remaining attributes belong
to the same schema. This case can be represented as an undirected graph G = (V, E)
where each vertex v ∈ V is a correspondence and each edge e = (vi , vj ) ∈ E represents
a constraint violation between vi and vj . G can be constructed in polynomial time by
iterating over all the correspondences and creating an edge between any two attributes
that match one attribute of a schema to two different attributes of another schema.
Finding a matching instance I with minimal repair distance is equivalent to finding
a maximum independent set (MIS) of G, as no two vertices being adjacent means no
violations and ∆(I, C) = |C| − |I|. Since the MIS problem is NP-complete [Kar72a], (ii)
is true.
68
4.4 Instantiate Selective Matching
69
4. Pay-as-you-go Reconciliation
Recall that user input F = hF + , F − i in Section 3.4 is the set of user assertions in Section
4.2.3. Starting with the best sample, the local search is repeated until the termination
condition is satisfied (line 3). At each iteration, we first generate a set of remaining
correspondences and their probabilities. Among these correspondences, we add one into
the current instance I based on Roulette wheel selection [Gol89]. The rational behind
this heuristic is that the chosen correspondence has a high chance of being consistent
with the others. When a certain correspondence is inserted, it might produce some
constraint violations. Thus, the repair() function (defined formally below) is invoked
to eliminate new violations by removing problematic correspondences from I (line 7).
However, a correspondence could be added into I and then removed immediately by the
repair function, leaving I unchanged, so that the algorithm would be trapped in local
optima. For this reason, we employ the Tabu search method [GM86] that uses a fixed-
size “tabu” (forbidden) list of correspondences so that the algorithm does not consider
these correspondences repeatedly (line 6). Finally, a matching instance H is returned
by evaluating the repair distance and likelihood of matching instances explored so far.
Proof. Termination follows from the fact that the termination condition of the routine
between line 3 and line 11 can be defined as a constant number k of iterations.
Correctness follows directly from the following points. (1) A new correspondence is not
chosen from disapproved correspondences (line 4). (2) When a new correspondence ĉ
added to I (line 5) causes constraint violations, I is repaired immediately (line 7). (3) H
always maintains the instance with smallest repair distance (line 8) and largest likelihood
(line 10). Therefore, the algorithm’s output is a near optimal matching instance that
satisfies all the integrity constraints and respects the user assertions.
Finally, we observe that the presented heuristic indeed allows for efficient instantia-
tion. In fact, the algorithm requires quadratic time in the number of candidate corre-
spondences, is tractable for real datasets.
Proof. The most expensive operation in Algorithm 4.2 is the function repair(), which
takes at most O(|I|2 ), as outlined below. Since I ⊆ C and there are at most k iterations
of the local search, we have O(k × |C|2 ).
Repair Heuristic by Greedy Removal. Algorithm 4.3 shows the details of our repair
heuristic, which implements the repair() function in algorithms 4.2 and 3.1. This repair
algorithm is used, for a particular instance, to resolve all violations caused by the new
correspondence added into that instance. This algorithm’s key idea is to greedily remove
the correspondences involving new violations, one-by-one, until no violation remains (line
2). In it, we remove the correspondence that causes most constraint violations (line 5
and 6). The insight behind this greediness is that removing correspondences with a high
number of violations should be able to minimize the repair distance.
70
4.5 Empirical Evaluation
We use the same datasets and tools as in Section 3.6. We evaluated our reconciliation
framework in different settings. We varied the construction of schema matching networks
in terms of dataset, matcher, and network topology. For the reconciliation process, we
considered different types of users and reconciliation goals. We measured the quality
improvements achieved by reconciliation and the required human efforts as follows:
71
4. Pay-as-you-go Reconciliation
Precision measures quality improvements. Similar to 3.6, precision is defined for the i-
th step in the reconciliation process given an exact matching G. Then, the precision
of the active set at step i is defined as Pi = (|Di ∩ G|)/|Di |.
User effort is measured in terms of feedback steps. Since a user examines one cor-
respondence at a time, the number of feedback steps is the number of asserted
correspondences. For a better comparison, we express this number i relative to
the size of the matcher output C, i.e., Ei = i/|C|.
In this set of experiments, we study the extent to which our approach reduces user ef-
fort. For each dataset, we generate a complete interaction graph and obtain candidate
correspondences using automatic matchers. Then, we simulate the pay-as-you-go recon-
ciliation process where user assertions are generated using the exact matching, which is
constructed in advance by the dataset provider.
User guiding strategies. We explored how the quality of the match result in terms
of precision improved when eliciting user feedback according to different strategies. For
the BP and PO datasets, Figure 4.4 depicts the improvements in precision (Y-axis) with
increased feedback percentage (X-axis, out of the total number of correspondences) using
four strategies, namely
(1) Rand NoReason: feedback on each correspondence in random order, consequences
of feedback are defined as the user input assertions (this is the baseline described in
Section 4.2.3);
(2) Rand Reason: reconciliation using random selection of correspondences, but apply-
ing reasoning to conclude consequences;
(3) MinViol NoReason: reconciliation selection of correspondences based on the min-
violation heuristic, consequences of feedback are defined as the user input assertions;
and finally
(4) MinViol Reason: reconciliation with the combination of the min-violation heuristic
for selecting correspondences and reasoning for concluding consequences.
(5) Info Reason: reconciliation with the combination of the info-gain heuristic for se-
lecting correspondences and reasoning for concluding consequences.
The results depicted in Figure 4.4 show the average over 50 experiment runs. The
dotted line in the last segment of each line represents the situation where no correspon-
dence in the active set violated any constraints, i.e. the reconciliation goal ∆NoViol has
been reached. In those cases, we used random selection for the remaining correspon-
dences until we reached a precision of 100%. The other datasets demonstrate similar
results and are omitted for brevity sake.
The results show a significant reduction of user effort for all strategies with respect to
the baseline. Our results further reveal that most improvements are achieved by apply-
ing reasoning to conclude on the consequences of user input. Applying the min-violation
72
4.5 Empirical Evaluation
heuristic or the info-gain heuristic for selecting correspondences provides additional ben-
efits. The combined strategies (MinViol Reason and Info Reason) showed the highest
potential to reduce human effort, requiring only 40% or less of the user interaction steps
of the baseline.
Rand_NoReason Rand_Reason MinViol_NoReason MinViol_Reason Info_Reason
BusinessPartner PurchaseOrder
Precision
1 1 1
0.9
0.8
0.7
0.6
0.5
0.4
0.9
0.9 0% 20% 40% 60% 80% 100%
0.8
Precision
Percentage of feedbacks
0.8 0.7
0.6
0.7
0.5
0.6 0.4
0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%
User Effort User Effort
Effects of the network topology. We have analyzed the influence of the topology
of the interaction graph on the reduction of the necessary efforts. For this purpose we
have used randomly generated interaction graphs, instead of the complete graphs (i.e.
cliques) of the previous experiments. We have constructed these graphs G(|S|, p) using
the Erdős-Rényi random graph model [ER60], where p is the inclusion probability. We
have constructed 10 graphs, and applied the reconciliation procedure. The results are
obtained as an average of 5 runs per graph.
73
4. Pay-as-you-go Reconciliation
100%
80%
User Effort
60%
40%
20%
0%
NoViol Precision=1.0 AllConcluded
criteria
Figure 4.6 (for the PurchaseOrder dataset and the UAF dataset) depicts the im-
provements of necessary user efforts, for different graphs. The X-axis corresponds to the
inclusion probability (the probability whether a given edge is included in the graph),
that we used to construct the interaction graphs, while the Y-axis shows the user efforts.
One can observe that the techniques that use reasoning significantly reduce the necessary
efforts, independently of the topology of the interaction graph. Also these methods are
robust w.r.t. the structure of the graph. Moreover, we could achieve a more expressed
reduction w.r.t. the cases where the interaction graph was more dense.
PurchaseOrder UAF
100% 100%
100% 80%
60%
80% 40% 80%
20%
0%
User Effort
60% 60%
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
40% 40%
20% 20%
0% 0%
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
74
4.5 Empirical Evaluation
to elicit a user input assertion for a correspondence. One may argue that in many
practical scenarios, however, this assumption does not hold. Users have often only
partial knowledge of a domain, which means that for some correspondences a user cannot
provide any feedback. We studied the performance of our approach in this setting, by
including the possibility of skipping a correspondence in the reconciliation process. Thus,
for certain correspondences, we never elicit any feedback. However, the application
of reasoning may allow us to conclude on the assertions for these correspondences as
consequences of the remaining user input.
p : skipping probability
Dataset
5% 10% 15% 20% 25% 30%
BP 0.29 0.26 0.27 0.23 0.20 0.18
PO 0.31 0.30 0.26 0.22 0.22 0.16
UAF 0.21 0.20 0.16 0.15 0.14 0.11
WebForm 0.31 0.32 0.26 0.19 0.16 0.20
75
4. Pay-as-you-go Reconciliation
p : probability of noise
Dataset
5% 10% 15% 20% 25% 30%
BP 0.99 0.97 0.98 0.98 0.98 0.98
PO 0.75 0.76 0.80 0.80 0.81 0.81
UAF 0.48 0.46 0.49 0.53 0.54 0.55
WebForm 0.75 0.74 0.80 0.83 0.82 0.83
Prec (H)
Rec (H)
Rec (H)
0.9 0.9 0.9 0.9
0.8 0.8 0.8 0.8
0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.5 0.5 0.5 0.5
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
User Effort (%) User Effort (%) User Effort (%) User Effort (%)
Figure 4.7: Effects of correspondence ordering Figure 4.8: Effects of the likelihood function on
strategies on instantiation. H is the instantiated instantiation. H is the instantiated matching of
matching of our algorithm. our algorithm.
Effects of Ordering Strategies. Clearly, the two above ordering strategies used for
reducing the network uncertainty have a great influence on the quality of the instantiated
matching. We investigate this aspect with an experiment in which, given a pre-defined
user effort (e.g., 5% of all candidate correspondences), we reduce network uncertainty
with these strategies (i.e., the Random and the Heuristic). Then, we compare the results
in terms of precision and recall of the matching derived by instantiation according to
Algorithm 4.2.
Figure 4.7 illustrates the influence of the ordering strategies on quality of the instan-
tiated matching for the BP dataset (again, the other datasets showed the same trend
and are omitted for brevity). Here, we varied the budget of user effort (x-axis) from
0% to 15%. A key finding is that our heuristic ordering outperforms the baseline with
an average difference of 0.12 (for precision) and 0.08 (for recall). At the beginning (0%
user effort), there is no difference between two ordering strategies because no correspon-
dence is selected for user validation. We conclude that our heuristic ordering plays an
important role in improving the quality of the matching that approximates the selective
matching and is derived by instantiation.
Effects of Maximal Likelihood. Instantiation is guided by the repair distance (num-
ber of candidate correspondences that are removed to satisfy the integrity constraints)
and the likelihood of a particular matching (cf., Section 4.4). We argued that the repair
distance shall be minimal in any case to keep us much information on correspondences
as possible. Yet, in this experiment, we study the importance of also considering the
likelihood for instantiation. To this end, we compare the result of instantiation with and
without the likelihood criterion. We quantify the results in terms of precision and recall
for the derived matching.
Figure 4.8 shows the obtained results: the percentage of user effort relative to the
quality of the matching measured by precision and recall. We observe that consider-
76
4.6 Summary
ing the likelihood criterion, indeed leads to a matching of better quality. This result
underlines the benefits of our probabilistic model in quantifying the uncertainty of cor-
respondences as well as of the network as a whole.
4.6 Summary
This chapter presents the first reconciliation setting of this thesis; i.e., the pay-as-you-go
reconciliation. The main outcome is a pay-as-you-go framework that enables to reduce
user effort and instantiate a trusted set of correspondences over time. As such, the
approach can be used for supporting data integration at any point in time, while still
continuously improving the quality of the instantiated matching by reconciliation of
the network. We presented a comprehensive experimental evaluation, indicating that
the approach is applicable for large, real-world datasets and allows for effective and
efficient reconciliation. We showed that our approach significantly reduces the number
of user interactions, up to 40% compared to the baseline of an expert working without
assistance. Our experimental results also indicate that the reduction remains significant
over different reconciliation goals, different network topology, or limited user knowledge.
In some cases, our framework can even detect erroneous user feedback based on reasoning
over the schema matching constraints.
77
4. Pay-as-you-go Reconciliation
78
Chapter 5
Collaborative Reconciliation
5.1 Introduction
Until this chapter, the task of reconciling a schema matching network was performed
by a single expert. As the size of networks in data integration grows, the complex
reconciliation tasks should be performed by not only one but several experts, to avoid
the overload on a single expert and also to assign each expert the parts of the problem
about which he is more familiar. Moreover, typical information systems need to involve a
wide range of expertise knowledge, since schemas are often designed by different persons
and with different domain purposes. As a result, there is a need for a mechanism that
allows not a single expert but an expert team work collaboratively to reconcile the output
of automatic matchers.
In this chapter, we develop such a multi-user mechanism to enable collaborative rec-
onciliation process. It is challenging to achieve this goal since we have to face the three
following issues. Note that hereby two terms—experts and users—are used interchange-
ably to represent the participants in this process.
1. How to encode user inputs? The inputs of users should be encoded to not only
capture fully information given by users but also support reasoning on the informa-
tion. In other words, from user inputs, we can derive consequences and compute
their explanations.
2. How to detect conflicting inputs? As users might have different opinions about the
correctness of correspondences, their inputs inevitably involve conflicts. Detecting
conflicts is an important step to eliminate inconsistency.
79
5. Collaborative Reconciliation
To address these issues, we leverage theoretical advances and the multi-user nature
of argumentation [Dun95, BH08]. The overall contributions of our work are as follows.
We model the schema matching network and the reconciliation process, where we relate
the experts’ assertions and the constraints of the matching network to an argumenta-
tion framework [Dun95]. Our representation not only captures the experts’ belief and
their explanations, but also enables to reason about these captured inputs. On top
of this representation, we develop support techniques for experts to detect conflicts in
a set of their assertions. Then we guide the conflict resolution by offering two primi-
tives: conflict-structure interpretation and what-if analysis. While the former presents
meaningful interpretations for the conflicts and various heuristic metrics, the latter can
greatly help the experts to understand the consequences of their own decisions as well as
those of others. Last but not least, we implement an argumentation-based negotiation
support tool for schema matching (ArgSM) [NLM+ 13], which realizes our methods to
help the experts in the collaborative task.
(a) (b)
Figure 5.1: The motivating example. (a) A network of schemas and correspondences generated by
matchers. There are two violations: {c2 , c4 } w.r.t. the one-to-one constraint, {c1 , c3 , c4 } w.r.t. the cycle
constraint. (b) An illustrated collaborative reconciliation between three video content providers: EoverI,
BBC, and DVDizzy. The assertions (approvals/disapprovals) of BBC and DVDizzy are identical and
different from those of EoverI.
In this example, the experts might agree or disagree about certain correspondences.
For example, c3 is approved by all the experts but c4 is only approved by two. To obtain
a final decision, we have to resolve the conflicts (i.e. approvals and disapprovals on
same correspondences). However, the simple techniques for conflict resolution such as
majority voting are not applicable, if the application requires the integrity constraints.
80
5.2 Model and System Overview
For example, the choice of considering a correspondence correct can influence the possible
choices for other correspondences. Thus, the resulting set of correspondences would not
comply to these constraints. To resolve these problems, the experts need to discuss and
negotiate which correspondences to accept or reject. Because of complex dependencies
between correspondences in the schema matching network, it is very challenging for the
experts to overlook all possible consequences of their decisions. Thus on one hand it is
highly desirable to split the reconciliation task, on the other hand combining individual
results is very challenging. Our work addresses exactly this problem by proposing a
number of services and a tool realizing those services to enable the collaborative process.
81
5. Collaborative Reconciliation
Since the collaborative reconciliation involves multiple experts to validate the generated
correspondences at the same time, we need to distribute the work among them. Espe-
cially if the size of the schema matching network is large, the validation task can be
rather expensive. Moreover, some experts might be more knowledgeable about some
parts of the network, thus in these cases it is very natural to split/partition the task
among multiple experts. In order to realize this task partitioning, we employ the network
modularity techniques described in Section 3.5. It is worth noting that the problem of
how to partition the task efficiently is out of the scope of this work. We only focus on
resolving the conflicts happened during the collaborative reconciliation.
Example 8. Figure 5.2 illustrates the task partitioning problem for a schema matching
network. There are three experts participating to validate five correspondences. In this
specific case, we use the k-way hypergraph partitioning technique as aforementioned in
Section 3.5. The output contains three subsets of correspondences assigned for each
expert: {c2 , c3 , c4 , c5 }, {c1 , c2 , c5 }, and {c1 , c3 , c4 }
c2 c3 c1 c2
c4 c5 c5
c4
c5
c2 K-way Hypergraph Argumentation
Partitioning
c1 k=3
c3
c1 c3
c4
• Input combination: In the second phase, the individual inputs are combined. The
goal of the collaborative reconciliation process is to construct a set of correspon-
dences M that satisfies all constraints. If there are conflicting views about corre-
spondences (for example, one expert considers correct while the other incorrect)
then they need to come to a conclusion and chose which view to accept.
82
5.3 Detecting Conflicts in User Inputs
Correspondences
Participant Inputs C Participant Inputs
F1 Fn
Expert 1 ... Expert n
A1 An
Arguments & Argumentation framework Arguments &
Explanations Explanations
3.1. Negotiation
Expert 1
Detect conflicts
2. Individual Guide conflict resolution
Validation
1. Automatic
•••
Matching Intermediate Rounds
•••
C Correspondences
COMA
AMC Expert n
Correspondences
2. Individual Participant Inputs C Participant Inputs
Validation F1 Fn
Expert 1 ... Expert n
A1 An
Arguments & Argumentation framework Arguments &
Explanations Explanations
3.N. Negotiation
Detect conflicts
Guide conflict resolution
1 2
(a) Phase 1 - Individual validation (b) Phase 2 - Input combination
Figure 5.3: The collaborative reconciliation process starts with a set of correspondences C generated
by matchers. In Phase 1, each expert (user/participant) i is responsible for validating a particular set
Ci ⊂ C. It is followed by Phase 2 that has multiple negotiation steps (3.1. to 3.N) to resolve conflicts in
user inputs.
We leverage existing techniques from a large body of research [JFH08, Bel11, QCS07] for
the first phase. In this chapter, we focus on the second phase. More precisely, we apply
the theoretical advances of argumentation to detect conflicts in user inputs and guide
them to resolve conflicts. Section 5.3, 5.4, and 5.5 will describe those functionalities in
detail.
Let us consider a setting where several experts assess a set of attribute correspondences
in a schema matching network. They might have different views whether a given corre-
spondence should be correct or not. To complete the reconciliation task, they need to
discuss and resolve these conflicts to obtain a globally consistent set of correspondences.
The conflicts between the different user views can be rather complex in the presence of
integrity constraints. We call a situation direct conflict if two experts disagree about
a given correspondence (one of them thinks it is correct, while the other claims that it
is incorrect). In the presence of integrity constraints, we can also talk about indirect
conflicts. For example, in Figure 5.1, if we assume the one-to-one constraint between
S1 and S3 and an expert considers c4 correct, then c2 must be incorrect (otherwise the
constraint would be violated). We call a situation where a second expert thinks that c2
is correct an indirect conflict.
83
5. Collaborative Reconciliation
84
5.3 Detecting Conflicts in User Inputs
might exists in the experts’ inputs. The argumentation framework enables even more
complex tasks that we explain in the following sections.
In general, an argument reflects not only the current beliefs of users but also the conse-
quences of those beliefs. Formally, an argument consists of two elements: (i) a claim that
shows the belief and (ii) a support that provides the belief’s explanation. In our setting,
a claim is either an approval or a disapproval of a correspondence, while a support is a
conjunction of approvals, disapprovals, and integrity constraints. For example, consider
the argument w4e = h{c2 , ¬c2 ∨ ¬c4 }, ¬c4 i in Figure 5.1, the claim is a disapproval (¬c4 )
and the support contains an approval (c2 ) and an one-to-one constraint (¬c2 ∨¬c4 ). The
formal definition of argumentation was proposed in [BH08]:
85
5. Collaborative Reconciliation
Combining the inputs of participants usually involves conflicts. In fact, there are two
types of conflict: direct conflict and indirect conflict. Direct conflict is a contradic-
tion regarding two opposite assertions on a particular correspondence. In other words,
given a particular correspondence, the simultaneous existence of the approval and dis-
approval on this correspondence is called a direct conflict. Whereas, indirect conflict is
a contradiction that emerges after some reasoning steps. Recall the running example in
Figure 5.1, the approval of EoverI on c2 and that of BBC on c4 form an indirect conflict
according to the one-to-one constraint.
Detecting direct conflicts is trivial. For indirect ones, however, this is not the case
as reasoning ability is required, especially when the set of integrity constraints is con-
tinuously modified. We harness the power of abstract argumentation [Dun95] to detect
both of them. More precisely, we analyze the attack relation between arguments based
on the concept of abstract argumentation framework.
• A is a set of arguments, A ⊂ ∆.
For brevity, we reserve the term argumentation framework for the abstract one from
this point on because more about this framework will be discussed later in this paper.
Continuing the running example in Figure 5.1b, we have w2b attacks w6e and and vice
versa, while w5e attacks w2b . Hence, the argumentation framework has {w1b , w3e , w5e , w6b } ⊂
A and {w2b ↔ w6e , w5e → w2b } ⊂ R. More details about attacks and the complete argu-
mentation framework will be presented later.
Now we show how to detect conflicts using abstract argumentation given a set of user
inputs {F1 , F2 ....Fn }. For each user i, we maintain a framework instance hLi , `, ∆i i based
S
on his assertions Fi . Then, we combine all arguments of users in a single set A = ∆i
and construct an argumentation framework hA, Ri. By analyzing attack relations in
R, not only do we know which correspondences contradict with each other but also
the reasons of contradiction. The construction of R, specific for schema matching, is
described below. It is worth noting that a set of user inputs is conflict-free if and only
if the set of attack relations R = ∅.
86
5.4 Guiding the Conflict Resolution
determine the reasons for these problems. In this section we focus on techniques that
exploit this information to guide the experts in resolving these conflicts.
In particular, we describe here two services that can largely help the collaborating
experts. These are the (1) the interpretation of conflict structures, with which we can
present meaningful interpretations for the conflicts together with some associated metrics
that can largely support the negotiation of the experts and the (2) what-if analysis,
with which we can compute (and in the tool visualize) the consequences of a particular
potential decision. The details of these services are provided in what follows.
87
5. Collaborative Reconciliation
constitute not only a one-to-one constraint violation (with c2 ) but also a cycle constraint
violation (with c1 and c3 ).
We cannot only compute the extensions and witnesses to facilitate the discussion among
the experts, but also associate heuristic metrics to further support their work. Further-
more, we enable the users to rank decisions, based on the strengths of their explanations
(arguments) or the decisions themselves.
Argument strength. Computing extensions is a preliminary step to evaluate argu-
ments. From the occurrences of an argument in the extensions, we compute the argument
strength. Given the set of extensions E of an argumentation framework with respect to an
acceptability semantics, the strength of an argument a is the number of its occurrences
in E divided by the size of E:
P
∈E 1a∈
argument strength(a) =
|E|
With argument strength, we have a more fine-grained metric to rank arguments and
assist the users to make wiser decisions. Indeed, we were motivated by the notion of
argument acceptance [DM04], which evaluates arguments based on their occurrences
in all extensions of a given acceptability semantics. However, this is a rough metric,
which does not take into account the difference between the number of occurrences of
arguments in the extensions. This shortcoming prevents the users from having detailed
looks on the credibility of arguments to compare them.
Decision strength. Providing explanations might be overwhelming for the users, es-
pecially when there are too many arguments. Therefore, we associate each decision with
a quantitative metric reflecting the decision strength, which is computed by applying
aggregate operators (max, min, avg, etc.) on the set of supporting arguments. Based on
this metric, the users can evaluate which decisions should be made given a specific cir-
cumstance. It is also important to note that the number of ambiguous correspondences
is generally large. As a result, identifying the sequence of correspondences to negotiate
is necessary. In practice, one may define such a sequence by taking pairs of decisions for
and against a correspondence in the ascending order of the level of ambiguity, which can
be measured by the difference between the strengths of associated decisions.
Example 10. An example of argument and decision strength can be found in Figure 5.4.
In that figure, we have on the left the decisions (circle shapes) supporting and opposing
each correspondence in the network, as well as the associated arguments (square shapes).
We follow the complete acceptability semantics to compute argument strengths and apply
the sum operator to evaluate decision strengths. Those values are displayed right above
the corresponding shapes. We can observe that the decision to disapprove c4 possesses
higher strength. Thus, disapproving c4 would be the better decision to take. In fact, this
outcome aligns with what the users should achieve using the qualitative metrics solely.
Having the quantitative metrics (argument and decision strength), however, brings the
users more details thus increases the confidence in making decisions.
88
5.5 Implementation
By answering these questions, we can provide the experts with two important views.
The (1) Local view reflects the relationships among inputs of a single participant. Each
participating expert can check whether his new assertion conflicts with the previous in-
puts. Technically, we use the answers of Q1 and Q2 to construct this view. When a user
gives a new assertion, his own arguments are maintained. If any attack between those
arguments is found, the user is notified to adjust his inputs to avoid further inconsis-
tencies. Besides, the (2) Global view reflects the connections between inputs of multiple
users. All participants can observe the negotiation progress. To construct this view, we
use the answers of Q2 and Q3 . The number of attacks and extensions are maintained.
On the one hand, the users understand the current conflicts (attacks) among their ar-
guments and the impact of those inconsistencies. On the other hand, keeping track of
the extensions lets them know the current state of the system and when it reaches an
agreement.
Example 11. In Figure 5.1, three video content providers now attempt to change their
assertions to reach an agreement. In the view-point of EoverI, he might change his disap-
proval of c4 since the others both approve c4 . If EoverI approves c4 , two new arguments
w7e = h{c4 }, c4 i, w8e = h{c4 , ¬c2 ∨ ¬c4 }, ¬c2 i and two new attacks w7e ↔ w4e , w8e ↔ w2e will
be added. Through local view, EoverI can foresee these new arguments and attacks to
realize the contradiction by himself. In the view-point of BBC and DVDizzy, they might
change the approval of c4 because of EoverI. If they disapprove c4 , two arguments w2d
and w2b will be deleted; and hence, there is no attack between remaining arguments and
only one extension remains. Through global view, they can foresee this consequence and
feel more confident to make changes. In addition, they might also agree with EoverI on
c1 and c2 since no further contradiction exists.
5.5 Implementation
In the previous sections, we discussed how to support the collaborative reconciliation
through detecting conflicts in the assertions of multiple experts and guiding the resolu-
tion of these conflicts. To accomplish these tasks, we need to realize the argumentation
89
5. Collaborative Reconciliation
framework, which can only be achieved after generating arguments and computing the
attack relation. This section serves a two-fold purpose. First, we present how to instan-
tiate an argumentation framework using ASP-based tools. Second, we describe how to
implement the proposed services on top of this argumentation framework.
Second, we use rules to express the network-level integrity constraints. Atoms prefixed
with # are built-in functions of DLV-Complex.
• Cycle constraint: a path of correspondences must not make any two attributes of
a schema reachable. Each rch(S, a, a0 ) signifies the reachability between attributes
a and a0 via the correspondences in S. Below is the encoding of the program πcycle :
90
5.5 Implementation
ΠEoverI
inputs = {app(c1 ). app(c2 ). app(c3 ). dis(c4 ).}
DV Dizzy
ΠBBC
inputs = Πinputs = {app(c3 ). app(c4 ).}
Constructing the set of formulae (ΠΦ ). We construct this set by extracting propo-
sitional formulae from either the assertions (through πsimple ) or the detected violations
(through πextract ). Formally, ΠΦ = πsimple ∪ πextract . Each atom kb(·) captures a for-
mula. For assertions, we consider each assertion app(c) (or (dis(c)) as a simple formula
c (or ¬c). This is captured by the program πsimple :
Things are more complex in the cases of the constraint violations (detected by the
πcycle and π1−1 of ΠΓ ). From a violation {c1 , · · · , cn }, we can state that at least one of
the assertions must be false. This is expressed formally by the formula ¬c1 ∨ · · · ∨ ¬cn .
Such formulae are extracted from the detected violations by the program πextract , whose
process is described below:
• From each list L, we create sublists starting from the first to the (n − 1)th element
([c1 , · · · , cn ], [c2 , · · · , cn ], · · · , [cn−1 , cn ]).
Example 12. In Figure 5.1b, we have the collected the inputs of EoverI. Through πsimple ,
we obtain the formulae kb(c1 ), kb(c2 ), kb(c3 ), and kb(neg(c4 )). Besides, we also detected
the violations in the schema matching network, from which πextract form two formulae
kb(or(neg(c2 ), neg(c4 ))) and kb(or(neg(c1 ), or(neg(c3 ), neg(c4 )))). These ground atoms
kb(·) compose the set of formulae of EoverI.
91
5. Collaborative Reconciliation
With the set of formulae, we proceed to first generate arguments then the attack
relation, with the goal to compose an argumentation framework. One could consider
formulae in the set we just obtained as candidates to be argument claims. This ap-
proach, however, would easily overwhelm the users due to the huge amount of generated
arguments. The reason is that many formulae are syntactically different but semanti-
cally equivalent. To avoid this scenario, we limit the candidates for argument claims.
In practice, users are concerned more with arguments claiming to approve or disap-
prove correspondences. We thus select the set of possible claims from the assertions.
In the motivating example, the possible claims for EoverI are cl(c1 ), cl(c2 ), cl(c3 ), and
cl(neg(c4 )).
We take advantage of Vispartix [CWW12], an ASP-based tool, to not only generate
arguments but also to compute the attack relation. For argument generation, the tool
considers only subsets of the set of formulae and the set of possible claims. It then looks
for pairs which can be considered as arguments (Figure 5.1b presents an example of
these generated arguments). Once the set of arguments is ready, we start to compute
the attack relation. This is done by invoking the corresponding feature of Vispartix
with the set of all arguments (the union of the arguments of each user) as the input.
Visaprtix provides the users with several attack types [BH08], such as defeat, undercut,
and rebut.
In Section 5.3 and 5.4, we showed the elements of an argumentation framework as well as
offered possible services (conflict detection, interpretation of conflict structures, what-if
analysis) on top of this framework. In this subsection, we will describe how to realize
these services, with the focus on technical aspects.
5.5.2.1 Conflict Detection.
We detect conflicts based on the results of ASP-solver in section 5.5.1. In that section,
we described how to encode the integrity constraints in the language of ASP. The solver
DLV-Complex is responsible for detecting the violations based on our encodings. Based
on the results of the solver, we have the atoms vioList(L) as lists of violation, each of
which contains a set of involved correspondences. Moreover, in our system, we show not
only the violations but also the explanations for these violations. In doing so, we analyze
the attack relations R of the argumentation framework. The user inputs are valid if this
vioList is empty or R is empty.
5.5.2.2 Interpretation of Conflict Structures.
To realize this service, we need to compute four elements: the extension, the witness, the
argument strength, and the decision strength. In Section 5.5.1, we already generated a
set of arguments. As previously defined, a witness of a claim is a set of arguments having
this claim. By grouping arguments sharing the same claim, we obtain the witnesses
for all possible claims. Then, we employ Vispatrix [CWW12] to generate all possible
92
5.6 Tool - ArgSM
extensions with different semantics. After obtaining all extensions, we compute argument
and decision strength as mentioned in Section 5.4.1.
5.5.2.3 What-If Analysis.
To realize this service, we need to recompute the argumentation framework and all
possible extensions when a user modifies an assertion. Then, we compare the differences
between the newly computed results and the current ones. Based on these differences,
we can know what arguments, attacks, and extensions are added or deleted to answer
three what-if questions in Section 5.4.2. This service is implemented with the support
of Vispatrix [CWW12], which allows efficient recomputation.
All above-mentioned services are integrated in our argumentation-based negotiation sup-
port tool, namely ArgSM. This tool not only implements these services but also provides
graphical user interface. The details will be described in the next section.
• Schema view. This view shows the schema matching network for the users who
do not have deep understandings of argumentation, hence another name User view.
In this view, correspondences are highlighted according on to their status. There
are three possible status: (1) all approved and (2) all disapproved respectively for
correspondences that are approved and disapproved by all users, and (3) ambiguous
for those that are approved by some and disapproved by the others.
1
JUNG - http://jung.sourceforge.net
2
https://code.google.com/p/argsm/wiki/ArgSM
93
5. Collaborative Reconciliation
• Argumentation view. Also called the technical view, it is intended for those who
have knowledge on argumentation. In this view, the numbers outside the shapes in-
dicate the strengths of decisions (circles) or witnesses (squares) respectively. There
are two perspectives supported:
Those views are further supported by three view modes. Apart from the Normal mode,
which is set by default and has no interaction at all, the others allow users to interact
with the network and the arguments:
For a better understanding and stronger feelings of trust, ArgSM not only generates
explanations but also provides the foreseeable effects of each decision. Technically, we
keep the strength of arguments and the possible decisions up-to-date during negotiation.
• Partitioning. We divide the correspondences into subsets that are small enough
and doable for the experts. The details are abovementioned in Section 5.2.
94
5.7 Summary
Figure 5.4: The GUI of ArgSM, with Argumentation view (left) and Schema view (right)
• Filtering. It is not useful to generate each and every argument. We only filter
for arguments of predefined claims. Hence, not only does it reduce computation
time but also avoid overwhelming the users. Every argumentation process should
operate on the filtered set. That set may be refined by modifying the predefined
claims.
5.7 Summary
We presented an argumentation-based tool to support collaborative reconciliation, where
multiple users, with different sorts of opinions, cooperate to validate the outputs of au-
tomatic matchers. While splitting the reconciliation task is highly desirable, combining
the individual results in the presence of consistency constraints is very challenging for
the collaborating experts. Our tool and its services shall facilitate collaboration. In par-
ticular, we systematically detect conflicts, provide the experts with visual information to
understand the causes of the problems. Moreover, we offer services to better understand
decision consequences and make collaborative reconciliation more transparent.
Our work opens up some future research directions. First, we will design a negotiation
protocol to enable negotiation within our tool. Second, we would like to extend the notion
of proposed constraints and consider further integrity constraints that are relevant in the
praxis (e.g., functional dependencies, domain-specific constraints). Third, we would like
95
5. Collaborative Reconciliation
to apply our methods to other problems. While our work focuses on schema matching,
our techniques, especially the argumentation-based reconciliation, could be applicable
to other tasks such as entity resolution or business process matching.
96
Chapter 6
Crowdsourced Reconciliation
6.1 Introduction
In this chapter, we approach the reconciliation of a schema matching network by leverag-
ing the “wisdom of the crowd” in order to assert correspondences. Getting the assertion
feedback by expert(s) can be expensive and time consuming. Especially with a large
number of schemas and (possible) connections between them, the validation task would
require an extreme effort. Addressing this problem, this chapter demonstrates the use
of crowdsourcing for reconciling a schema matching network. In that, we employ a
large number of online users, so called crowd workers, to assert the correspondences by
answering validation questions. With the advent of crowdsourcing platforms such as
Amazon Mechanical Turk and CloudCrowd, it has become faster and cheaper to acquire
the assertions from several crowd workers in a short amount of time.
Crowdsourcing techniques have been successfully applied for several data manage-
ment problems, for example in CrowdSearch [YK10] or CrowdScreen [PGMP+ 12]. Mc-
Cann et al. [MSD08] have already applied crowdsourcing methods for schema matching.
In their work, they focused on matching a pair of schemas, but their methods are not
directly applicable for the matching network that is our main interest. On top of such
networks, we exploit the relations between correspondences to define the network-level
integrity constraints. Leveraging these constraints opens up several opportunities to not
only guide the crowd workers effectively but also reduce the necessary human efforts
significantly.
The main takeaways of this chapter are as follows. First of all, we develop a crowd-
sourcing framework built on top of the schema matching network. Secondly, we design
questions presented to the crowd workers in a systematic way. In our design, we focus on
providing contextual information for the questions, especially the transitivity relations
between correspondences. The aim of this contextual information is to reduce question
ambiguity such that workers can answer more rapidly and accurately. Finally, we design
an aggregation mechanism to combine the answers from multiple crowd workers. In par-
ticular, we study how to aggregate answers in the presence of integrity constraints. Our
97
6. Crowdsourced Reconciliation
theoretical and empirical results show that by harnessing the network-level constraints,
the worker effort can be lowered considerably.
The outline of the chapter is given as follows. We first present an overview of our
framework in Section 6.2. In Section 6.3, we describe how to design the questions
that should be presented to crowd workers. In Section 6.4, we formulate the problem
of aggregating the answers obtained from multiple workers and clarify our aggregate
methods that exploit the presence of integrity constraints. In Section 6.5, we show how
to evaluate and control the worker quality, given that the crowd workers have wide-
ranging levels of expertise. Section 6.6 presents experimental results, while Section 6.7
concludes the chapter.
3
Figure 6.1: Architecture of the crowdsourced reconciliation framework
For realizing this process, we propose the framework as depicted in Figure 6.1. The
input to our framework is a set of correspondences C. These correspondences are fetched
to Question Builder component to generate questions presented to crowd workers. A
worker’s answer is the validation of worker ui on a particular correspondence cj ∈ C,
denoted as a tuple hui , cj , ai, where a is the answer of worker ui on correspondence cj .
Domain values of a are {true, f alse}, where true/f alse indicates cj is approved/dis-
approved. In general, the answers from crowd workers might be incorrect. There are
several reasons for this, such as the workers might misunderstand their tasks, they may
accidentally make errors, or they simply do not know the answers. To cope with the
problem of possibly incorrect answers, we need aggregation mechanisms, realized in the
98
6.3 Question Design
• Transitive closure: We do not only display all alternatives, but also the tran-
sitive closure of correspondences. The goal of displaying the transitive clo-
sure is to provide a context that shall help workers to resolve the ambiguity,
when otherwise these alternatives are hard to distinguish. For example, in Fig-
ure 6.2(B), workers might not be able to decide which one of two attributes
DVDizzy.productionDate and DVDizzy.availabilityDate corresponds to the attribute
Eoverl.releaseDate. Thanks to the transitive closure DVDizzy.availabilityDate →
BBC.screeningDate → Eoverl.releaseDate, workers can confidently confirm the cor-
rectness of the match between Eoverl.releaseDate and DVDizzy.availabilityDate.
99
Demo Scenarios
6. Crowdsourced Reconciliation
EoverI DVDizzy EoverI DVDizzy EoverI DVDizzy
productionDate productionDate
productionDate
releaseDate releaseDate
Contextual releaseDate availabilityDate availabilityDate
Information availabilityDate
Does attribute releaseDate match attribute availabilityDate? Does EoverI.releaseDate match DVDizzy. availabilityDate? Does EoverI.releaseDate match DVDizzy.productionDate?
Question Yes No Yes No Yes No
Figure 6.2: Question designs with 3 different contextual information: (A) All alternative targets, (B)
Transitive closure, (C) Transitive violation.
100
6.4 Answer Aggregation
compute this probability such as majority voting [vA09] and expectation maximization
(EM) [DS79]. While majority voting aggregates each correspondence independently, the
EM method aggregates all correspondences simultaneously. More precisely, the input of
majority voting is the worker answers πc for a particular correspondence c, whereas the
S
input of EM is the worker answers π = c∈C πc for all correspondences.
In this paper, we use EM as the main aggregation method to compute the probability
P r(Xc ). The EM method differs from majority voting in considering the quality of
workers, which is estimated by comparing the answers of each worker against other
workers answers. More precisely, the EM method uses maximum likelihood estimation
to infer the aggregated value of each correspondence and measure the quality of that
value. The reason behind this choice is that the EM model is quite effective for labeling
tasks and robust to noisy workers [SP08].
After deriving the probability P r(Xc ) for each correspondence c ∈ C, we will compute
the aggregation decision hac , ec i = gπ (c), where ac is the aggregated value and ec is the
error rate. The aggregation of this decision is formulated as follows:
htrue, 1 − P r(Xc = true)i If P r(Xc = true) ≥ 0.5
gπ (c) = (6.1)
hfalse, 1 − P r(Xc = false)i Otherwise
In equation 6.1, the error rate is the probability of making wrong decision. In order
to reduce error rate, we need to reduce the uncertainty of Xc (i.e., entropy value H(Xc )).
If the entropy H(Xc ) is close to 0, the error rate is closed to 0. For the experiments
described in section 6.6, in order to achieve lower error rate, we need to ask more
questions. However, with given requirements of low error rate, the monetary cost is
r = 0.6 r = 0.7 r = 0.8
limited and needs to be reduced. In next section, we will leverage the constraints to
4 0.4
solve this problem.
3 0.3
0.4
0.3
Error Rate
2 0.2
6.4.2
0.2 Leveraging Constraints to Reduce Error Rate
1 0.1 0.1
0
0 From many
-0.1 1 7 13studies
19 in250 the31literature
37 43 [IPW10,
49 PGMP+ 12], it has been shown that to
1 4 7 10 13 16 'yes' and
achieve rate, 1more
Gap between
lower error 11
answers21are needed.
'no' answers 31 41 51 in
This is, fact, the trade-off between
Gap between 'yes' and 'no' answers #Answers
the cost and the accuracy[YK10]. The higher curve of Figure 6.3 depicts empirically a
general case of this trade-off.
Increase Error Rate
Without constraint
Goal
11 21 31 41 51 Increase #Answers
#Answers
Figure 6.3: Optimization goal
101
6. Crowdsourced Reconciliation
In section 6.4.1, we already formulate the answer aggregation. Now we leverage con-
straints to adjust the error rate of the aggregation decision. More precisely, we show
that by using constraints, we need fewer answers to obtain an aggregated result with the
same error rate. In other words, given the same answer set on a certain correspondence,
the error rate of aggregation with constraint is lower than the one without constraint.
We consider very natural constraints that we assume to hold; in other words we assume
that these are hard constraints.
Given the aggregation gπ (c) of a correspondence c, we compute the justified ag-
gregation gπγ (c) when taking into account the constraint γ. The aggregation gπγ (c) is
obtained similarly to equation 6.1, except that the probability P r(Xc ) is replaced by the
conditional probability P r(Xc |γ) when the constraint γ holds. Formally,
htrue, 1 − P r(Xc = true|γ)i If P r(Xc = true|γ) ≥ 0.5
gπγ (c) = (6.2)
hfalse, 1 − P r(Xc = false|γ)i Otherwise
In the following, we describe how to compute P r(Xc |γ) with 1-1 constraint and cycle
constraint. Then, we show why the effect of constraints can reduce the error rate. We
leave the investigation of other types of constraints as an interesting future work.
Our approach is based on the intuition illustrated in Figure 6.4(A), depicting two corre-
spondences c1 and c2 with the same source attribute. After receiving the answer set from
workers and applying the probabilistic model (section 6.4.1), we obtained the probability
P r(Xc1 = true) = 0.8 and P r(Xc2 = false) = 0.5. When considering c2 independently,
it is hard to conclude c2 being approved or disapproved. However, when taking into
account c1 and 1-1 constraint, c2 tends to be disapproved since c1 and c2 cannot be ap-
proved simultaneously. Indeed, following probability theory, the conditional probability
P r(Xc2 = false|γ1−1 ) ≈ 0.83 > P r(Xc2 = false).
In what follows, we will formulate 1-1 constraint in terms of probability and then
show how to compute the conditional probability P r(Xc |γ1−1 ).
Formulating 1-1 constraint. Given a matching between two schemas, let us have
a set of correspondences {c0 , c1 , . . . , ck } that share a common source attribute. With
102
6.4 Answer Aggregation
S1 S2
Pr 0.8 c1 c2 c3 Prob ↻
S T c1 T T T 0.32 1.0
c1 0.5
T T F 0.32 0.0
c2 c2 c3 T F T 0.08 0.0
c1 c2 Prob T F F 0.08 Δ
S3
T T 0.4 Δ F T T 0.08 0.0
T F 0.4 1.0 Pr 0.8 F T F 0.08 Δ
F T 0.1 1.0 Pr 0.8 F F T 0.02 Δ
F F 0.1 1.0 Δ
Pr 0.5 F F F 0.02
(A) (B)
Figure 6.4: Compute conditional probability with (A) 1-1 constraint and (B) cycle constraint
respect to 1-1 constraint definition, there is at most one ci is approved (i.e., Xci = true).
However there are some exceptions where this constraint does not hold. For instance,
the attribute name might be matched with f irstname and lastname. But these cases
only happen with low probability. In order to capture this observation, we formulate 1-1
constraint as follows:
1 If m ≤ 1
P r(γ1 −1 |Xc0 , Xc1 , . . . , Xck ) = (6.3)
∆ ∈ [0, 1] If m > 1
where m is the number of Xci assigned as true. When ∆ = 0, there is no constraint
exception. In general, ∆ is close to 0. An approximated value of ∆ can be obtained
through statistical model [CMAF06a].
x can be interpreted as the probability of the case where all other correspondences ex-
cept c being disapproved. y can be interpreted as the probability of the case where
all correspondences being disapproved or only one of them being disapproved. The
precise derivation of equation eq. (6.4) is given as follows. According to Bayes the-
P r(γ1−1 |Xc0 )×P r(Xc0 )
orem, P r(Xc0 |γ1−1 ) = P r(γ1−1 ) . Now we need to compute P r(γ1−1 ) and
P r(γ1−1 |Xc0 ). Let denote pi = P r(Xci = true), for short. In order to compute P r(γ1−1 ),
we do following steps: (1) express P r(γ1−1 ) as the sum from the full joint of γ1−1 ,
c0 , c1 , . . . , ck , (2) express the joint as a product of conditionals. Formally, we have:
103
6. Crowdsourced Reconciliation
P
P r(γ1−1 ) = P r(γ1−1 , Xc0 , Xc1 , . . . , Xck )
Pc0 ,c1 ,...,ck
= P r(γ1−1 |Xc0 , Xc1 , . . . , Xck ) × P r(Xc0 , Xc1 , . . . , Xck )
= 1 × P r(Xc0 , Xc1 , . . . , Xck |m(Xc0 , Xc1 , . . . , Xck ) ≤ 1)
+ ∆ × P r(Xc0 , Xc1 , . . . , Xck |m(Xc0 , Xc1 , . . . , Xck ) > 1)
= y + ∆ × (1 − y)
Similar to computing P r(γ1−1 ), we also express P r(γ1−1 |Xc0 ) as the sum from the
full joint of γ1−1 , c1 , . . . , ck and then express the joint as a product of conditionals. After
these steps, we have P r(γ1−1 |Xc0 = true) = x + ∆ × (1 − x), where x = ki=1 (1 − pi ).
Q
After having P r(γ1−1 ) and P r(γ1−1 |Xc0 ), we can compute P r(Xc0 |γ1−1 ) as in equation
6.4.
Theorem 2. The conditional probability of a correspondence c being false with 1-1 con-
straint is less than or equal to the probability of c being false without constraint. Formally,
P r(Xc = false|γ1−1 ) ≥ P r(Xc = false).
Proof. From equation 6.4, we can rewritten y = x + ki=1 [pi kj=0,j6=i (1 − pj )]. Since
P Q
Pk Qk
i=1 [pi j=0,j6=i (1 − pj )] ≥ 0 and ∆ ≤ 1, we have x + ∆(1 − x) ≤ y + ∆(1 − y).
Following this inequality and equation 6.4, we conclude P r(Xc = true|γ1−1 ) ≤ P r(Xc =
true) ⇔ P r(Xc = false|γ1−1 ) ≥ P r(Xc = false).
From this theorem, we conclude that the error rate is reduced only when the ag-
gregated value is false. From equation 6.1 and 6.2, the error rate with 1-1 constraint
(i.e. 1 − P r(Xc = false|γ1−1 )) is less than or equal to the one without constraint (i.e.
1 − P r(Xc = false)). In other words, the 1-1 constraint supports reducing the error rate
when the aggregated value is false.
104
6.4 Answer Aggregation
where m is the number of Xci assigned as true and ∆ is the probability of compensating
errors along the cycle (i.e., two or more incorrect assignment resulting in a correct
reformation).
Computing conditional probability. Given a closed cycle along c0 , c1 , . . . , ck , let
denote the constraint on this circle as γ and pi as P r(Xci = true) for short. Without
loss of generality, we consider c0 to be the favorite correspondence whose probability
p0 is obtained by the answers of workers in the crowdsourcing process. Following the
Bayesian theorem and equation 6.5, the conditional probability of correspondence c0
with circle constraint is computed as:
x can be interpreted as the probability of the case where only one correspondence
among c1 , . . . , ck except c0 is disapproved. y can be interpreted as the probability of
the case where only one correspondence among c0 , c1 , . . . , ck is disapproved. The de-
tail derivation of equation 6.6 is given as follows. According to Bayesian theorem,
P r(γ |Xc0 )×P r(Xc0 )
P r(Xc0 |γ ) = P r(γ ) . In order to compute P r(γ |Xc0 ) and P r(γ ), we
also express P r(γ |Xc0 ) as the sum from the full joint of γ1−1 , c0 , c1 , . . . , ck and then
express the joint as a product of conditionals. After some transformations, we can obtain
equation 6.6.
Proof. After some transformations, we can derive P r(Xc = true|γ ) ≥ P r(Xc = true)
is equivalent to (1 − p0 ) k1 pi ≥ ∆(x − y). Moreover, we have x − y = (1 −
Q
Note that the condition of ∆ is often satisfied since ∆ closed to 0 and pi closed to 1.
From this theorem, we conclude that the error rate is reduced only when the aggregated
105
6. Crowdsourced Reconciliation
value is true. With an appropriately chosen ∆, in equation 6.1 and 6.2, the error rate
with cycle constraint (i.e. 1 − P r(Xc = true|γ )) is less than or equal to the one without
constraint (i.e. 1 − P r(Xc = true)). In other words, circle constraint supports reducing
the error rate when the aggregated value is true.
In general settings, we could have a finite set of constraints Γ = {γ1 , . . . , γn }. Let denote
the aggregation with a constraint γi ∈ Γ is gπγi (c) = haic , eic i, whereas the aggregation
without any constraint is simply written as gπ (c) = hac , ec i. Since the constraints are
different, not only could the aggregated value aic be different (aic 6= ajc ) but also the error
rate eic could be different (eic 6= ejc ). In order to reach a single decision, the challenge
then becomes how to define the multiple-constraint aggregation gπΓ (c) as a combination
of single-constraint aggregations gπγi (c).
Since the role of constraints is to support reducing the error rate and the aggregation
gπ (c) is the base decision, we compute the multiple-constraint aggregation as gπΓ (c) =
hac , eΓc i, where eΓ = min({eic |aic = ac } ∪ ec ). We take the minimum of error rates in
order to emphasize the importance of integrity constraints, which is the focus of this
work. Therefore, the error rate of the final aggregated value is reduced by harnessing
constraints. For the experiments with real datasets described in the next section, we will
show that this aggregation reduces half of worker efforts while preserving the quality of
aggregated results.
106
6.5 Worker Assessment
increase the cost (since they need to be paid) and at the same time decrease the accuracy
of final aggregated results.
A mechanism to detect and eliminate spammes is a desirable feature for any crowd-
sourcing application. Once spammers are detected, we can greatly improve the result
accuracy (e.g. reject the answers of spammers). In the following, we propose three such
mechanisms to detect spammers.
Detect by trapping questions. We use a set of trapping questions whose true answer
is already known to test the expertise of workers. Workers who fail to answer a specified
number of trapping questions are neglected as spammers and removed. Specifically, the
trapping questions can be used in two schemes:
• Random injection: the trapping questions are injected randomly into the question
set. The reliability of workers is still evaluated as above and the spammers are
filtered out by a pre-defined reliability threshold. The advantage of this scheme is
that the spammers are not prepared for being tested. However, this scheme would
incur more cost as we might still have to pay for the trapping questions.
Detect by thresholding. Sometimes the trapping questions are not available for
detecting spammers; e.g. even the requester does not know the ground truth. We
need an algorithm that detects the spammers based only on the crowd itself. More
specifically, we define a scalar metric, called reliability, to represent the expertise of
workers. Spamers have a reliability close to zero and the good workers have a reiliability
close to one. Since the same question is answered by multiple workers and a worker
answers multiple questions, we can use the answers of good workers to eliminate those
of spammers. The core idea is that spammers often give answers different to all of the
others; and thus, their reliability will be reduced while the others’ increases.
We leverage the results of aggregation techniques to realize our algorithm. Most of
answer aggregation techniques return the reliability of each worker, beside the aggre-
gated answer and error rate of each question [NNTLNA13]. However, they do not have
a mechanism to explicitly detect spammers. To overcome this limitation, we devise such
a mechanism by iteratively thresholding on the worker reliability. Particularly, we elim-
inate the workers with reliability less than pre-defined threshold (e.g. 0.1), re-estimate
the reliability of remaining workers, and repeat this procedure until no further spammer
is removed.
107
6. Crowdsourced Reconciliation
Detect by integrity constraints. We can further refine the reliability of a worker via
the integrity constraints. So far we have assumed that these constraints are of paramount
importance to control the quality of schema matching networks. As aforementioned in
Section 4.3.2, we can detect constraint violations in the input of any worker. Based
on this detection, spammers can be treated as ones who have more than a pre-define
numbers of violations or ones who have distinctly more violations than other workers.
For brevity sake, we omit the formal details of this mechanism.
108
6.5 Worker Assessment
answers: At denotes the set of true answers both w1 and w2 provide, Af denotes the set
of false answers both w1 and w2 provides, and Ad denotes the set of answers on which
w1 and w2 provide different values. Denote A = At ∪ Ad ∪ Af , which is also the union
of all answers of w1 and w2 . We apply Bayesian theory to compute the probability that
w1 and w2 are dependent given their answers:
P r(A|w1 ∼ w2 )P r(w1 ∼ w2 )
P r(w1 ∼ w2 |A) = (6.7)
P r(A|w1 ∼ w2 )P r(w1 ∼ w2 ) + P r(A|w1 ⊥ w2 )P r(w1 ⊥ w2 )
−1
1 − α kt −|V |
kf +|V | kd kd
P r(w1 ∼ w2 |A) = 1 + ( )r (1 − r) (2(1 − r)) r (6.11)
α
109
6. Crowdsourced Reconciliation
6.6 Experiments
The main goal of the following evaluation is to analyze the use of crowdsourcing tech-
niques for schema matching networks. To verify the effectiveness of our approach, four
experiments are performed: (i) effects of contextual information on reducing question
ambiguity, (ii) relationship between the error rate and the matching accuracy, (iii) effects
of the constraints on worker effort, and (iv) effects of constraints on detecting worker de-
pendency. We proceed to report the results on the real datasets using both real workers
and simulated workers.
We use the same datasets and tools as in Section 3.6. To simulate workers, we assume
that the ground truth is known in advance (i.e. the ground truth is known for the
experimenter, but not for the (simulated) crowd worker). Each simulated worker is
associated with a pre-defined reliability r that is the probability of his answer being
correct against the ground truth.
In this experiment, we select 25 correct correspondences (i.e., exist in ground truth) and
25 incorrect correspondences (i.e., do not exist in ground truth). For each correspon-
dence, we ask 30 workers (Bachelor students) with three different contextual information
elements: (a) all alternatives, (b) transitive closure, (c) transitive violation. Then, we
collect the worker answers for each correspondence.
Figure 6.5 presents the result of this experiment. The worker answers of each case
are presented by a collection of ‘x’ and ‘o’ points in the plots. In that, ’o’ points indicate
correspondences that exist in ground truth, whereas ‘x’ points indicate correspondences
that do not exist in ground truth. For a specific point, X-value and Y-value are the
110
6.6 Experiments
30 30 30
#Disapprovals (a) (b) (c)
20 20 20
10 10 10
0 0 0
0 10 20 30 0 10 20 30 0 10 20 30
#Approvals
Figure 6.5: Effects of contextual information. (a) all alternatives, (b) transitive closure, (c) transitive
violation
In order to assess the matching accuracy, we borrow the precision metric from infor-
mation retrieval, which is the ratio of (true) correspondences existing in ground truth
among all correspondences whose aggregated value is true. However, the ground truth is
not known in general. Therefore, we use an indirect metric—error rate—to estimate the
matching quality. We expect that the lower error rate, the higher quality of matching
results.
The following empirical results aim to validate this hypothesis. We conduct the
experiment with a population of 100 simulated workers and their reliability scores are
generated according to normal distribution N(0.7, 0.04). Since the purpose of this ex-
periment is to study the relationship between error rate and matching accuracy only, we
do not consider spammers and worker dependency in the crowd. Figure 6.6 depicts the
relationship of the error rate and precision. In that, we vary error threshold from 0.05
111
6. Crowdsourced Reconciliation
1
Constraint
0.9 NoConstraint
Precision
0.8
0.7
0.6
0.
0.
0.
0.
0.
0.
05
15
25
3
Error Rate
to 0.3, meaning that the questions are posted to workers until the error rate of aggre-
gated value is less than the given threshold . The precision is plotted as a function of
. We aggregate the worker answers by two strategies: without constraint and with con-
straint. Here we consider both 1-1 constraint and cycle constraint as hard constraints,
thus ∆ = 0.
The key observation is that when the error rate is decreased, the precision approaches
to 1. Reversely, when the error rate is increased, the precision is reduced but greater
than 1 − . Another interesting finding is that when the error rate is decreased, the value
distribution of precision in case of with and without constraint is quite similar. This
indicates our method of updating the error rate is relevant.
In summary, the error rate is a good indicator of the quality of aggregated results.
Since the ground truth is hidden, our goal was to verify if the error rate is a useful
metric for matching quality. The result indicated that there was no significant difference
between the two metrics. In terms of precision, the quality value is always around 1 − .
In other words, the error threshold can be used to control the real matching quality.
In this experiment set, we will study the effects of constraints on the expected cost in real
datasets. In Section 6.4.2, we already saw the benefit of using constraints in reducing
error rate. Therefore, with given requirement of low error, the constraints help to reduce
the number of questions (i.e., the expected cost) that need to be asked from the workers.
More precisely, given an error threshold ( = 0.15, 0.1, 0.05), we iteratively post questions
to workers and aggregate the worker answers until the error rate is less than . Similar
to the above experiment, we use simulated workers with reliability r varying from 0.6
to 0.8 and we set ∆ = 0. For simplicity sake, we do not simulate spammers and worker
dependency in the worker population. The results are presented in Figure 6.7.
A significant observation in the results is that for all values of error threshold and
worker reliability, the expected cost of the aggregation with constraints is definitely
smaller (approximately a half) than the case without constraints. For example, with
worker reliability is r = 0.6 and error threshold = 0.1, the expected number of questions
is reduced from 31 (without constraints) to 16 (with constraints). This concludes the fact
112
6.6 Experiments
NoConstraint Constraint
Expected Cost
20 20 20
0 0 0
0.6 0.65 0.7 0.75 0.8 0.6 0.65 0.7 0.75 0.8 0.6 0.65 0.7 0.75 0.8
Worker Reliablity
that the constraints help to reduce the error rate, and subsequently reduce the expected
cost.
Another key finding in Figure 6.7 is that, for both cases (using vs. not using con-
straints in the aggregation), the expected cost increases significantly as the value for
error threshold decreases. For example, it requires about 20 questions (without con-
straints) or 10 questions (with constraints) to satisfy error threshold = 0.15. Whereas,
it takes about 40 questions (without constraints) or 20 questions (with constraints) to
satisfy error threshold = 0.05. This result supports the fact that to reduce error rate,
we need to ask more questions.
In this experiment, we would like to study the worker dependency as described in Section
6.5.2. To do so, we simulate 100 workers, 40 of which are copiers (i.e. workers who copy
answers from others). A copier is simulated by randomly choosing one of 60 independent
workers and copying all of his answers. Two workers are dependent if one copies from
another (one independent and one copier) or both of them copy from a same worker (two
copiers). There are 50 correspondences given to all workers for validation. We detect
dependence for each pair of workers and count the number of true-positive and false-
2×precision×recall
positive detections. We use F-score as metric (F-score = precision+recall ), in which the
precision is computed as the number of true-positive detections over the total number
of detections and the recall is computed as the number of true-positive detecitons over
the total pairs of dependent workers. The results are averaged over 100 runs.
0.9
NoConstraint
Constraint
0.8
F-score
0.7
0.6
0.5
0.4
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9
Worker Reliability
113
6. Crowdsourced Reconciliation
Figure 6.8 illustrates the result. The X-axis is the reliability of 60 independent
workers (normal distribution with variance = 0.1), varying from 0.5 to 0.9. The Y-
axis is the F-score. Two settings are studied: (i) NoConstraint – the probability of
dependence is calculated without constraints, and (ii) Constraint – integrity constraints
are considered when computing the probability of dependence. A key observation is that
the presence of integrity constraints makes the dependency detection more accurate, from
about 37% up to about 55% of F-score. This is because the constraints help to identify
the incorrect answers that are copied across the answer set.
Another interesting finding is that although the detection accuracy goes up when
the worker reliability increases, it goes down a little bit when the worker reliability is
greater than 0.8. This is reasonable because of two reasons. First, the detection is
only useful if many incorrect answers are copied. When the worker reliability is small,
correct and incorrect answers are mixed, thus leading to misidentification between copied
answers and independent answers. Second, the detection becomes insignificant if there
are too many correct answers. Indeed, when the worker reliability is higher than 0.8,
most of the answers (both independent and copied ones) are correct and it is difficult
to identify which correct answers are copied. Moreover in practice, it is shown that the
average reliability of crowd workers (not counting spammers) is around 0.75 as surveyed
in [VdVE11]. This statistical finding supports the fact that our dependence detection
technique works well for practical applications, as the F-score is already high (nearly
0.85) with worker reliability = 0.75.
6.7 Summary
We have presented a crowdsourcing platform that is able to support reconciling a schema
matching network. The platform takes the candidate correspondences that are generated
by schema matching tools and generates questions for crowd workers. The structure of
the matching network can be exploited in many ways. First, as this is a contextual infor-
mation about the particular matching problem, it can be used to generate questions that
guide the crowd workers and help them to answer the questions more accurately. Sec-
ond, natural constraints about the attribute correspondences at the level of the network
enable to reduce the necessary efforts, as we demonstrated this through our experiments.
While our focus was to reduce the efforts in the post matching phase, we should remem-
ber that schema matchers themselves reduce the necessary work: constructing all the
correspondences manually would require significant efforts.
In our work we did not assume any formalism for the schemas. We have worked
with relational schemas and relational schema matchers, but one could have used other
formalisms e.g. ontologies (with their corresponding alignment tools). There is no lim-
itation w.r.t. the formalism for which our techniques could use, however the type of
constraints we used for reducing the necessary efforts, esp. the cycle constraint assumes
certain consistency conditions, thus it is more promising to aim effort reduction in set-
tings where such consistency conditions are relevant.
114
6.7 Summary
Our work opens up several future research directions. First, one can extend our
notion of schema matching network and consider representing more general integrity
constraints (e.g., functional dependencies or domain-specific constraints). Second, one
can devise more applications which could be transformed into the schema matching
network. While our work focuses on schema matching, our techniques, especially the
constraint-based aggregation method, can be applied to other tasks such as entity reso-
lution, business process matching, or Web service discovery.
115
6. Crowdsourced Reconciliation
116
Chapter 7
Conclusion
117
7. Conclusion
• Reconciliation with single expert. A single expert was employed to validate the cor-
rectness of correspondences in a pay-as-you-go fashion. We aimed for minimizing
user efforts for the reconciliation and instantiating a single trusted set of corre-
spondences even not all necessary user input is collected. The empirical results
highlighted that the presented approach supports pay-as-you-go reconciliation. In
that, we were able to guide user feedback precisely, observing improvements of up
to 48% over the baselines. Also, we demonstrated that the approach improves the
quality of instantiated matchings significantly in both precision and recall.
Through the above reconciliation settings, the notion of schema matching network
has proven to be robust and beneficial in its own right. The presented findings highlighted
the ability of schema matching networks in capturing the semantic interoperability in a
uniform fashion, independent of used matching tools and data integration tasks. Schema
matching networks also enable collaborative integration scenarios, and scenarios where
the monolithic mediated schema approach is too costly or simply infeasible. Moreover,
because of its nature representation, the schema matching network allows to formulate
and extend integrity constraints that relate to user expectations in both literature and
practice.
118
7.2 Future Directions
On top of schema matching networks, we can develop a wide range of potential appli-
cations such as e-commerce, enterprise information integration, and information reuse.
To reduce the development complexity, it is desirable to bootstrap a management sys-
tem and supporting tools for all of these applications. To design and implement such a
management system, we envision the following key questions:
119
7. Conclusion
• Selecting and filtering data sources: Big data integration opens an opportunity
to construct a diverse schema matching network, as each source might contain a
wide range of data in different domains. The more data sources, the higher diver-
sity of data application domains. However, it is imprudent to integrate all data
sources because of three reasons: (i) integration cost - expenses for purchasing
data, cleaning, reformatting, and integrating data, (ii) source dependency - some
data sources may copy data from other sources and publish the copied data with-
out provenance, and (iii) irrelevant data - not all collected data is relevant and
consistent for particular application domains. As a result, the goal of this step
is to select and filter the data sources before integration, while considering the
balance between diversity of collected data and integration cost. The data sources
should be prioritized according to their potential “benefit” for the integration. By
this way, the resulting schema matching network is scaled up incrementally in a
systematic way.
120
7.2 Future Directions
• Ontology alignment. The fast growth rate of the Web today brings the field of
Semantic Web, in which the contents of Web pages are enriched by semantic de-
scriptions. The enrichment is done by using ontologies to define the semantics and
concepts of the Web contents. An ontology can be viewed as a set of vocabularies
to define the conceptual models of some particular domain. The distinctive feature
of ontologies with other semantic models (e.g. database schemas, XML schemas)
is that its logical representation is independent of the underlying systems and the
relationships between concepts are specified explicitly. However, ontologies from
different Web pages are heterogeneous due to the differences in syntactic, termi-
nological, and conceptual representation between the sources. This motivates the
need of establishing semantic correspondences between ontologies of the sources,
namely ontology alignment, to reduce the heterogeneity. The established network
of ontologies resembles with the concept of schema matching network, in which
the notion of schema is now replaced by the ontology representation. Such resem-
blance enables us to apply the proposed techniques for modeling and reconciling
the “ontology matching network” as well as opens up further research directions
in the field.
121
7. Conclusion
122
Bibliography
[ACMO+ 04] Karl Aberer, Philippe Cudre-Mauroux, Aris M. Ouksel, Tiziana Catarci,
Mohand-Said Hacid, Arantza Illarramendi, Vipul Kashyap, Massimo Me-
cella, Eduardo Mena, Erich J. Neuhold, Olga De Troyer, Thomas Risse,
Monica Scannapieco, Felix Saltor, Luca De Santis, Stefano Spaccapietra,
Steffen Staab, and Rudi Studer. Emergent semantics principles and issues.
In DASFAA’04, pages 25–38, 2004. 4
[ADMR05a] David Aumueller, H.H. Do, Sabine Massmann, and Erhard Rahm. Schema
and ontology matching with COMA++. In SIGMOD, pages 906–908,
2005. 20, 38
[ADMR05b] David Aumueller, Hong-Hai Do, Sabine Massmann, and Erhard Rahm.
Schema and ontology matching with coma++. In SIGMOD, pages 906–
908, 2005. 17, 39, 55
[AGPS09] Liliana Ardissono, Anna Goy, Giovanna Petrone, and Marino Segnan.
From service clouds to user-centric personal clouds. In Cloud Computing,
2009. CLOUD’09. IEEE International Conference on, pages 1–8. IEEE,
2009. 4
[AHPT12] Bogdan Alexe, Mauricio Hernández, Lucian Popa, and Wang-Chiew Tan.
Mapmerge: correlating independent schema mappings. JVLDB, pages
191–211, 2012. 48
[ASS09] Alsayed Algergawy, Eike Schallehn, and Gunter Saake. Improving xml
schema matching performance using prüfer sequences. Data Knowl.
Eng., pages 728–747, 2009. 16
123
BIBLIOGRAPHY
[BAMN10] Jamal Bentahar, Rafiul Alam, Zakaria Maamar, and Nanjangud C. Naren-
dra. Using argumentation to model and deploy agent-based b2b applica-
tions. KBS, pages 677–692, 2010. 29
[BDV03] Martin Brain and Marina De Vos. Implementing oclp as a front-end for
answer set solvers: From theory to practice. In Answer Set Programming,
2003. 24
[BE99] Gerhard Brewka and Thomas Eiter. Preferred answer sets for extended
logic programs. Artificial intelligence, 109(1):297–356, 1999. 24
[BEP+ 08] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie
Taylor. Freebase: a collaboratively created graph database for structuring
human knowledge. In SIGMOD, pages 1247–1250, 2008. 1
[BET11] Gerhard Brewka, Thomas Eiter, and Miroslaw Truszczyński. Answer set
programming at a glance. Communications of the ACM, 54(12):92–103,
2011. 90
[BGPR10] Philippe Besnard, Éric Grégoire, Cédric Piette, and Badran Raddaoui.
Mus-based generation of arguments and counter-arguments. In IRI, pages
239–244, 2010. 26
[BM02] Jacob Berlin and Amihai Motro. Database schema matching using ma-
chine learning with feature selection. In CAiSE, volume 2348, pages 452–
466. Springer, 2002. 5, 15
124
BIBLIOGRAPHY
[BM07] Philip A. Bernstein and Sergey Melnik. Model management 2.0: manip-
ulating richer mappings. In SIGMOD, pages 1–12, 2007. 1
[BMR11a] P.A. Bernstein, J. Madhavan, and Erhard Rahm. Generic Schema Match-
ing, Ten Years Later. In VLDB, 2011. 2, 13, 19, 44
[BSZ03] Paolo Bouquet, Luciano Serafini, and Stefano Zanobini. Semantic coordi-
nation: a new approach and an application. In The Semantic Web-ISWC
2003, pages 130–145. Springer, 2003. 15
[CAM01] Luigia Carlucci Aiello and Fabio Massacci. Verifying security protocols
as planning in logic programming. ACM Transactions on Computational
Logic (TOCL), 2(4):542–580, 2001. 24
[CAS09] Isabel F. Cruz, Flavio Palandri Antonelli, and Cosmin Stroe. Agreement-
maker: efficient matching for large real-world schemas and ontologies.
Proc. VLDB Endow., pages 1586–1589, 2009. 17
[CHW+ 08] Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and
Yang Zhang. Webtables: exploring the power of tables on the web. Proc.
VLDB Endow., pages 538–549, 2008. 4
125
BIBLIOGRAPHY
[CL98] Jason Cong and Sung Kyu Lim. Multiway partitioning with pairwise
movement. In Proceedings of the 1998 IEEE/ACM international confer-
ence on Computer-aided design, ICCAD ’98, pages 512–516, New York,
NY, USA, 1998. ACM. 52
[CT04] Stefania Costantini and Arianna Tocchio. The dali logic programming
agent-oriented language. In Logics in Artificial Intelligence, pages 685–
688. Springer, 2004. 24
[CWW12] Gunther Charwat, Johannes Peter Wallner, and Stefan Woltran. Uti-
lizing asp for generating and visualizing argumentation frameworks. In
ASPOCP, pages 51–65, 2012. 92, 93
[CWW13] Günther Charwat, Johannes Peter Wallner, and Stefan Woltran. Utiliz-
ing asp for generating and visualizing argumentation frameworks. arXiv
preprint arXiv:1301.1388, 2013. 26
[DBC11] Fabien Duchateau, Zohra Bellahsene, and Remi Coletta. Matching and
alignment: what is the cost of user post-match effort? In OTM, pages
421–428, 2011. 21
[DCBM09a] Fabien Duchateau, Remi Coletta, Zohra Bellahsene, and Renée J. Miller.
(not) yet another matcher. In CIKM, pages 1537–1540, 2009. 21
[DCBM09b] Fabien Duchateau, Remi Coletta, Zohra Bellahsene, and Renée J. Miller.
Yam: a schema matcher factory. In CIKM, pages 2079–2080, 2009. 21
[DCSW09] Umeshwar Dayal, Malu Castellanos, Alkis Simitsis, and Kevin Wilkinson.
Data integration flows for business intelligence. In Proceedings of the 12th
International Conference on Extending Database Technology: Advances
in Database Technology, pages 1–11. Acm, 2009. 4
[DDH01a] AnHai Doan, Pedro Domingos, and Alon Y Halevy. Reconciling schemas
of disparate data sources: A machine-learning approach. In ACM Sigmod
Record, volume 30, pages 509–520. ACM, 2001. 17
126
BIBLIOGRAPHY
[DDH01b] AnHai Doan, Pedro Domingos, and Alon Y. Halevy. Reconciling schemas
of disparate data sources: a machine-learning approach. In SIGMOD,
pages 509–520, 2001. 40
[DFKK11] AnHai Doan, Michael J. Franklin, Donald Kossmann, and Tim Kraska.
Crowdsourcing applications and platforms: A data management perspec-
tive. PVLDB, 4(12):1508–1509, 2011. 36
[DLHPB09a] Giusy Di Lorenzo, Hakim Hacid, Hye-young Paik, and Boualem Bena-
tallah. Data integration in mashups. SIGMOD Rec., pages 59–66, 2009.
2
[DLHPB09b] Giusy Di Lorenzo, Hakim Hacid, Hye-young Paik, and Boualem Benatal-
lah. Data integration in mashups. SIGMOD, pages 59–66, 2009. 4
[DLK+ 08] Pedro Domingos, Daniel Lowd, Stanley Kok, Hoifung Poon, Matthew
Richardson, and Parag Singla. Just add weights: Markov logic for the
semantic web. In URSW, pages 1–25, 2008. 46
[DM04] Sylvie Doutre and Jérôme Mengin. On sceptical versus credulous accep-
tance for abstract argument systems. In JELIA, pages 462–473, 2004.
88
[DMD+ 03] AnHai Doan, Jayant Madhavan, Robin Dhamankar, Pedro Domingos,
and Alon Halevy. Learning to match ontologies on the semantic web. The
VLDB Journal?The International Journal on Very Large Data Bases,
12(4):303–319, 2003. 17
[DMDH02] AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy.
Learning to map between ontologies on the semantic web. In Proceedings
of the 11th international conference on World Wide Web, pages 662–673.
ACM, 2002. 15
[DR02a] H.H. Do and Erhard Rahm. COMA: a system for flexible combination
of schema matching approaches. In PVLDB, pages 610–621, 2002. 5, 14,
15, 17, 20, 55
[DR02b] Hong Hai Do and Erhard Rahm. COMA - A System for Flexible Combi-
nation of Schema Matching Approaches. In VLDB, pages 610–621, 2002.
17, 55
127
BIBLIOGRAPHY
[dro] https://www.dropbox.com. 4
[DSFG+ 12] Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee,
Fei Wu, Reynold Xin, and Cong Yu. Finding related tables. In SIGMOD,
pages 817–828, 2012. 1
[Dun95] Phan Minh Dung. On the acceptability of arguments and its fundamental
role in nonmonotonic reasoning, logic programming and n-person games.
Artif. Intell., pages 321–358, 1995. 9, 26, 28, 29, 80, 81, 84, 86, 87
[DVV99] Marina De Vos and Dirk Vermeir. Choice logic programs and nash equi-
libria in strategic games. In Computer Science Logic, pages 266–276.
Springer, 1999. 24
[DVV03] Marina De Vos and Dirk Vermeir. Logic programming agents playing
games. In Research and Development in Intelligent Systems XIX, pages
323–336. Springer, 2003. 24
[EFL+ 01] Thomas Eiter, Wolfgang Faber, Nicola Leone, Gerald Pfeifer, and Axel
Polleres. System description: The dlvk planning system. In Logic Pro-
gramming and Nonmotonic Reasoning, pages 429–433. Springer, 2001. 24
[EFST01] Thomas Eiter, Michael Fink, Giuliana Sabbatini, and Hans Tompits. A
framework for declarative update specifications in logic programs. In
IJCAI, volume 1, pages 649–654. Citeseer, 2001. 24
[EH11] Vasiliki Efstathiou and Anthony Hunter. Algorithms for generating ar-
guments and counterarguments in propositional logic. Int. J. Approx.
Reasoning, 52(6):672–704, 2011. 26
128
BIBLIOGRAPHY
[ES04] Marc Ehrig and Steffen Staab. Qom - quick ontology mapping. In ISWC,
pages 683–697, 2004. 17
[ESS05] Marc Ehrig, Steffen Staab, and York Sure. Bootstrapping ontology align-
ment methods with apfel. In ISWC, pages 186–200, 2005. 17
[eye] http://www.eyeos.com. 4
[fac] http://www.factual.com. 1
[Fan08] Wenfei Fan. Dependencies revisited for improving data quality. In SIG-
MOD, pages 159–170, 2008. 48
[FHM05] Michael Franklin, Alon Halevy, and David Maier. From databases to
dataspaces: a new abstraction for information management. SIGMOD
Rec., pages 27–33, 2005. 3, 6, 67
[FKK+ 11] Michael J Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, and
Reynold Xin. Crowddb: answering queries with crowdsourcing. In Pro-
ceedings of the 2011 ACM SIGMOD International Conference on Man-
agement of data, pages 61–72. ACM, 2011. 36
[Gal06b] Avigdor Gal. Why is schema matching tough and what can we do about
it? SIGMOD Rec., 35:2–5, 2006. 5
[Gal11] Avigdor Gal. Uncertain Schema Matching. Morgan & Claypool, 2011. 40
[GCM12] Kathrin Grosse, Carlos Ivan Chesnevar, and Ana Gabriela Maguitman.
An argument-based approach to mining opinions from twitter. In AT,
pages 408–422, 2012. 29
[GHJ+ 10] Hector Gonzalez, Alon Y. Halevy, Christian S. Jensen, Anno Lan-
gen, Jayant Madhavan, Rebecca Shapley, Warren Shen, and Jonathan
Goldberg-Kidon. Google fusion tables: web-centered data management
and collaboration. In SIGMOD, pages 1061–1066, 2010. 1
[GHKR10] Anika Gross, Michael Hartung, Toralf Kirsten, and Erhard Rahm. On
matching large life science ontologies in parallel. In DILS, pages 35–49,
2010. 16
129
BIBLIOGRAPHY
[GKK+ 08] Martin Gebser, Roland Kaminski, Benjamin Kaufmann, Max Ostrowski,
Torsten Schaub, and Sven Thiele. A user?s guide to gringo, clasp, clingo,
and iclingo, 2008. 24
[GKS+ 13] Avigdor Gal, Michael Katz, Tomer Sagi, Karl Aberer, Zoltán Miklós,
Quoc Viet Hung Nguyen, Eliezer Levy, and Victor Shafran. Completeness
and ambiguity of schema cover. In CoopIS, 2013. 18, 49
[GL88] Michael Gelfond and Vladimir Lifschitz. The stable model semantics for
logic programming. In ICLP/SLP, pages 1070–1080. MIT Press, 1988.
22, 23
[GL91a] Michael Gelfond and Vladimir Lifschitz. Classical negation in logic pro-
grams and disjunctive databases. New generation computing, 9(3-4):365–
385, 1991. 21
[GL91b] Michael Gelfond and Vladimir Lifschitz. Classical negation in logic pro-
grams and disjunctive databases. Journal of New Generation Computing,
9(3/4):365–386, 1991. 22
[GM86] Fred Glover and Claude McMillan. The general employee scheduling prob-
lem. an integration of ms and ai. COR, pages 563–573, 1986. 70
[GMM03] Paolo Giorgini, Fabio Massacci, and John Mylopoulos. Requirement en-
gineering meets security: A case study on modelling secure electronic
transactions by visa and mastercard. In Conceptual Modeling-ER 2003,
pages 263–276. Springer, 2003. 24
[GSW+ 12] A. Gal, T. Sagi, M. Weidlich, E. Levy, V. Shafran, Z. Miklós, and N.Q.V.
Hung. Making sense of top-k matchings: A unified match graph for
schema matching. In IIWeb, 2012. 40
[Haa06] Laura Haas. Beauty and the beast: the theory and practice of information
integration. In ICDT, pages 28–43, 2006. 1
130
BIBLIOGRAPHY
[HAB+ 05] Alon Y. Halevy, Naveen Ashish, Dina Bitton, Michael Carey, Denise
Draper, Jeff Pollock, Arnon Rosenthal, and Vishal Sikka. Enterprise infor-
mation integration: successes, challenges and controversies. In SIGMOD,
pages 778–787, 2005. 1
[HC03] Bin He and Kevin Chen-Chuan Chang. Statistical schema matching across
web query interfaces. In Proceedings of the 2003 ACM SIGMOD interna-
tional conference on Management of data, SIGMOD ’03, pages 217–228,
2003. 16
[HdlPR+ 12] Stella Heras, Fernando de la Prieta, Sara Rodriguez, Javier Bajo, Vi-
cente J. Botti, and Vicente Julien. The role of argumentation on the
future internet: Reaching agreements on clouds. In AT, pages 393–407,
2012. 29
[Hel99] Keijo Heljanko. Using logic programs with stable model semantics to solve
deadlock and reachability problems for 1-safe petri nets. Fundamenta
Informaticae, 37(3):247–268, 1999. 24
[HFM06] Alon Halevy, Michael Franklin, and David Maier. Principles of datas-
pace systems. In Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-
SIGART symposium on Principles of database systems, pages 1–9. ACM,
2006. 3
[HMH01] Mauricio A Hernández, Renée J Miller, and Laura M Haas. Clio: A semi-
automatic tool for schema mapping. In ACM SIGMOD Record, volume 30,
page 607. ACM, 2001. 21
[HMN00] Maarit Hietalahti, Fabio Massacci, and Ilkka Niemela. Des: a challenge
problem for nonmonotonic reasoning systems. arXiv preprint cs/0003039,
2000. 24
[HMYW03] Hai He, Weiyi Meng, Clement Yu, and Zonghuan Wu. Wise-integrator:
An automatic integrator of web search interfaces for e-commerce. In Pro-
ceedings of the 29th international conference on Very large data bases-
Volume 29, pages 357–368. VLDB Endowment, 2003. 16
[How06] Jeff Howe. The rise of crowdsourcing. Wired magazine, 14(6):1–4, 2006.
30
131
BIBLIOGRAPHY
[HQC08] Wei Hu, Yuzhong Qu, and Gong Cheng. Matching large ontologies: A
divide-and-conquer approach. Data Knowl. Eng., pages 140–160, 2008.
16
[HSB10] Vaughn Hester, Aaron Shaw, and Lukas Biewald. Scalable crisis relief:
Crowdsourced sms translation and categorization with mission 4636. In
Proceedings of the First ACM Symposium on Computing for Development,
ACM DEV ’10, pages 15:1–15:7, New York, NY, USA, 2010. ACM. 31
[HT73] John Hopcroft and Robert Tarjan. Algorithm 447: efficient algorithms
for graph manipulation. Communications of the ACM, pages 372–378,
1973. 49
[HV03] Stijn Heymans and Dirk Vermeir. Integrating description logics and an-
swer set programming. In PPSWR, pages 146–159. Springer, 2003. 24
[ICL+ 03] Giovambattista Ianni, Francesco Calimeri, Vincenzino Lio, Stefania Gal-
izia, and Agata Bonfa. Reasoning about the semantic web using answer
set programming. In APPIA-GULP-PRODE, pages 324–336, 2003. 24
[IPW10] Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. Quality manage-
ment on amazon mechanical turk. In Proceedings of the ACM SIGKDD
Workshop on Human Computation, HCOMP ’10, pages 64–67, New York,
NY, USA, 2010. ACM. 10, 32, 34, 101
132
BIBLIOGRAPHY
[KKMF11] Gabriella Kazai, Jaap Kamps, and Natasa Milic-Frayling. Worker types
and personality traits in crowdsourcing relevance labels. In CIKM, 2011.
32
[KN05] Misa Keinänen and Ilkka Niemelä. Solving alternating boolean equation
systems in answer set programming. In Applications of Declarative Pro-
gramming and Knowledge Management, pages 134–148. Springer, 2005.
24
[KOS11] DR Karger, S Oh, and D Shah. Iterative learning for reliable crowdsourc-
ing systems. In NIPS, 2011. 35
[KSS97] Henry Kautz, Bart Selman, and Mehul Shah. Referral web: combining
social networks and collaborative filtering. Communications of the ACM,
40(3):63–65, 1997. 36
[LCW10] Kyumin Lee, James Caverlee, and Steve Webb. The social honeypot
project: protecting online communities from spammers. In WWW, 2010.
34
[Lif02] Vladimir Lifschitz. Answer set programming and plan generation. Artif.
Intell., 138(1-2):39–54, 2002. 24
[LPF+ 06] Nicola Leone, Gerald Pfeifer, Wolfgang Faber, Thomas Eiter, Georg Gott-
lob, Simona Perri, and Francesco Scarcello. The dlv system for knowledge
representation and reasoning. ACM Trans. Comput. Logic, 7:499–562,
July 2006. 24
[LRS01] Nicola Leone, Riccardo Rosati, and Francesco Scarcello. Enhancing an-
swer set planning. In IJCAI-01 Workshop on Planning under Uncertainty
and Incomplete Information, pages 33–42, 2001. 24
133
BIBLIOGRAPHY
[LSDR07] Yoonkyong Lee, Mayssam Sayyadian, AnHai Doan, and Arnon S. Rosen-
thal. eTuner: tuning schema matching software using synthetic scenarios.
JVLDB, pages 97–122, 2007. 17
[LTLL09] Juanzi Li, Jie Tang, Yi Li, and Qiong Luo. Rimom: A dynamic multi-
strategy ontology alignment framework. Knowledge and Data Engineer-
ing, IEEE Transactions on, pages 1218–1232, 2009. 17
[LTT99] Vladimir Lifschitz, Lappoon R Tang, and Hudson Turner. Nested expres-
sions in logic programs. Annals of Mathematics and Artificial Intelligence,
25(3-4):369–389, 1999. 23
[LY12] Matthew Lease and Emine Yilmaz. Crowdsourcing for information re-
trieval. SIGIR Forum, 45(2):66–75, January 2012. 36
[LYHY02] Mong Li Lee, Liang Huai Yang, Wynne Hsu, and Xia Yang. Xclust:
Clustering xml schemas for effective integration. In CIKM, pages 292–
299, 2002. 52
[MAL+ 05] Robert McCann, Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, and
AnHai Doan. Mapping maintenance for data integration systems. In
Proceedings of the 31st international conference on Very large data bases,
pages 1018–1029. VLDB Endowment, 2005. 4
[MBDH05] Jayant Madhavan, Philip A. Bernstein, AnHai Doan, and Alon Halevy.
Corpus-based schema matching. In ICDE, pages 57–68, 2005. 16
[MBR01] Jayant Madhavan, Philip A Bernstein, and Erhard Rahm. Generic schema
matching with cupid. In VLDB, volume 1, pages 49–58, 2001. 5, 14, 15,
21
[MRA+ 11] Sabine Massmann, Salvatore Raunich, David Aumüller, Patrick Arnold,
and Erhard Rahm. Evolution of the coma match system. Ontology Match-
ing, page 49, 2011. 20
134
BIBLIOGRAPHY
[MSD08] Robert McCann, Warren Shen, and AnHai Doan. Matching Schemas in
Online Communities: A Web 2.0 Approach. In ICDE, pages 110–119,
2008. 18, 32, 97, 100
[MSK97] David McAllester, Bart Selman, and Henry Kautz. Evidence for invariants
in local search. In AAAI, pages 321–326, 1997. 46
[MWJ99] Prasenjit Mitra, Gio Wiederhold, and Jan Jannink. Semi-automatic inte-
gration of knowledge sources. Proceedings of Fusion’99, July 1999, 1999.
15
[NB12] Duy Hoa Ngo and Zohra Bellahsene. YAM++ : (not) Yet Another
Matcher for Ontology Matching Task. In BDA, 2012. 21
[NFP+ 11] Hoa Nguyen, Ariel Fuxman, Stelios Paparizos, Juliana Freire, and Rakesh
Agrawal. Synthesizing products for online catalogs. PVLDB, pages 409–
418, 2011. 4
[Nie99] Ilkka Niemelä. Logic programs with stable model semantics as a con-
straint programming paradigm. Annals of Mathematics and Artificial
Intelligence, 25(3-4):241–273, 1999. 24
[NLM+ 13] Quoc Viet Hung Nguyen, Xuan Hoai Luong, Zoltán Miklós, Tho Quan
Thanh, and Karl Aberer. An mas negotiation support tool for schema
matching (demonstration). In AAMAS, pages 1391–1392, 2013. 10, 80,
118
[NLMA13] Quoc Viethung Nguyen, Hoai Xuan Luong, Zoltán Miklós, and Karl
Aberer. Collaborative schema matching reconciliation. In CoopIS, 2013.
18
[NNMA13] Quoc Viet Hung Nguyen, Thanh Tam Nguyen, Zoltán Miklós, and Karl
Aberer. On leveraging crowdsourcing techniques for schema matching
networks. In DASFAA, pages 139–154, 2013. 18
[NNTLNA13] Quoc Viet Hung Nguyen, Tam Nguyen Thanh, Tran Lam Ngoc, and Karl
Aberer. An Evaluation of Aggregation Techniques in Crowdsourcing. In
WISE, 2013. 10, 107
135
BIBLIOGRAPHY
[NS97] Ilkka Niemelä and Patrik Simons. Smodels?an implementation of the sta-
ble model and well-founded semantics for normal logic programs. In Logic
Programming and Nonmonotonic Reasoning, pages 420–429. Springer,
1997. 24
[PBR10] Eric Peukert, Henrike Berthold, and Erhard Rahm. Rewrite techniques
for performance optimization of schema matching processes. In EDBT,
pages 453–464, 2010. 17
[Pra12] Henry Prakken. Some reflections on two current trends in formal argu-
mentation. In Logic Programs, Norms and Action, pages 249–272, 2012.
25
[PSGM+ 11] Aditya Parameswaran, Anish Das Sarma, Hector Garcia-Molina, Neoklis
Polyzotis, and Jennifer Widom. Human-assisted graph search: it’s okay
to ask questions. Proceedings of the VLDB Endowment, 4(5):267–278,
2011. 36
[PVH+ 02] Lucian Popa, Yannis Velegrakis, Mauricio A Hernández, Renée J Miller,
and Ronald Fagin. Translating web data. In Proceedings of the 28th
international conference on Very Large Data Bases, pages 598–609. VLDB
Endowment, 2002. 19
[QCS07] Yan Qi, K. Selçuk Candan, and Maria Luisa Sapino. Ficsr: feedback-
based inconsistency resolution and query processing on misaligned data
sources. In SIGMOD, pages 151–162, 2007. 83
[Rah07] Andreas Thor David Aumueller Erhard Rahm. Data integration support
for mashups. 2007. 4
136
BIBLIOGRAPHY
[Rei80] Ray Reiter. A logic for default reasoning. Artificial Intelligence, 13(1-
2):81–132, 1980. 22
[RG06] Haggai Roitman and Avigdor Gal. Ontobuilder: fully automatic extrac-
tion and consolidation of ontologies from web sources using sequence se-
mantics. In EDBT, pages 573–576, 2006. 21, 39, 55
[RIS+ 10] Joel Ross, Lilly Irani, M. Six Silberman, Andrew Zaldivar, and Bill Tom-
linson. Who are the crowdworkers?: shifting demographics in mechanical
turk. In CHI, pages 2863–2872, 2010. 32
[RNC+ 95] Stuart Jonathan Russell, Peter Norvig, John F Canny, Jitendra M Ma-
lik, and Douglas D Edwards. Artificial intelligence: a modern approach.
Prentice hall Englewood Cliffs, 1995. 63
[Rob86] John Michael Robson. Algorithms for maximum independent sets. Jour-
nal of Algorithms, 7(3):425–440, 1986. 110
[RZR07] Iyad Rahwan, Fouad Zablith, and Chris Reed. Towards large scale ar-
gumentation support on the semantic web. In AAAI, pages 1446–1451,
2007. 29
[SBH08] Khalid Saleem, Zohra Bellahsene, and Ela Hunt. Porsche: Performance
oriented schema mediation. Inf. Syst., pages 637–657, 2008. 15
[SDH08] Anish Das Sarma, Xin Dong, and Alon Y. Halevy. Bootstrapping pay-
as-you-go data integration systems. In SIGMOD, pages 861–874, 2008.
18
[SE81] Yossi Shiloach and Shimon Even. An on-line edge-deletion problem. Jour-
nal of the ACM (JACM), pages 1–4, 1981. 49
[SMH+ 10] Len Seligman, Peter Mork, Alon Halevy, Ken Smith, Michael J. Carey,
Kuang Chen, Chris Wolf, Jayant Madhavan, Akshay Kannan, and Doug
Burdick. Openii: an open source information integration toolkit. In
SIGMOD, pages 1057–1060, 2010. 17, 20
137
BIBLIOGRAPHY
[SMM+ 09a] Ken Smith, Michael Morse, Peter Mork, Maya Li, Arnon Rosenthal, David
Allen, Len Seligman, and Chris Wolf. The role of schema matching in large
enterprises. In CIDR, 2009. 3, 52
[SMM+ 09b] Kenneth P. Smith, Michael Morse, Peter Mork, Maya Li, Arnon Rosen-
thal, David Allen, Len Seligman, and Chris Wolf. The role of schema
matching in large enterprises. In CIDR, 2009. 2
[SMM+ 09c] Kenneth P. Smith, Michael Morse, Peter Mork, Maya Hao Li, Arnon
Rosenthal, M. David Allen, and Len Seligman. The role of schema match-
ing in large enterprises. In CIDR, 2009. 21
[SN98] Timo Soininen and Ilkka Niemelä. Developing a declarative rule language
for applications in product configuration. In practical aspects of declara-
tive languages, pages 305–319. Springer, 1998. 24
[SP08] Victor S Sheng and Foster Provost. Get Another Label? Improving Data
Quality and Data Mining Using Multiple, Noisy Labelers. In KDD, 2008.
101
[SSC10a] Barna Saha, Ioana Stanoi, and Kenneth L. Clarkson. Schema covering:
a step towards enabling reuse in information integration. In Proceedings
of the 26th International Conference on Data Engineering, ICDE 2010,
pages 285–296, 2010. 15, 18, 52, 54
[SSC10b] Barna Saha, Ioana Stanoi, and Kenneth L Clarkson. Schema covering: a
step towards enabling reuse in information integration. In ICDE, pages
285–296, 2010. 48
[SSC10c] Barna Saha, Ioana Stanoi, and Kenneth L. Clarkson. Schema covering: a
step towards enabling reuse in information integration. In ICDE, pages
285–296, 2010. 52, 53
[ST97] Horst D. Simon and Shang-Hua Teng. How good is recursive bisection?
SIAM J. Sci. Comput., 18(5):1436–1445, September 1997. 52
[Sur05] James Surowiecki. The wisdom of crowds. Random House Digital, Inc.,
2005. 36
[SWL06] Weifeng Su, Jiying Wang, and Frederick Lochovsky. Holistic schema
matching for web query interfaces. In EDBT, pages 77–94, 2006. 16
138
BIBLIOGRAPHY
[ubu] https://one.ubuntu.com. 4
[VAD04] Luis Von Ahn and Laura Dabbish. Labeling images with a computer
game. In Proceedings of the SIGCHI conference on Human factors in
computing systems, pages 319–326. ACM, 2004. 36
[VdVE11] J. Vuurens, A.P. de Vries, and C. Eickhoff. How much spam can you
take? an analysis of crowdsourcing results to increase accuracy. In CIR,
2011. 32, 106, 114
[WES04] Wei Wei, Jordan Erenrich, and Bart Selman. Towards efficient sampling:
exploiting random walk strategies. In AAAI, pages 670–676, 2004. 46
[WRW09] Jacob Whitehill, Paul Ruvolo, and Tingfan Wu. Whose vote should count
more: Optimal integration of labels from labelers of unknown expertise.
In NIPS, 2009. 35
[YAA08] Jiang Yang, Lada A. Adamic, and Mark S. Ackerman. Crowdsourcing and
knowledge sharing: strategic user behavior on taskcn. In Proceedings of
the 9th ACM conference on Electronic commerce, EC ’08, pages 246–255,
New York, NY, USA, 2008. ACM. 31
[YEN+ 11] Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouz-
zani, and Ihab F. Ilyas. Guided data repair. In VLDB, pages 279–289,
2011. 18
[YK10] Tingxin Yan and Vikas Kumar. CrowdSearch: exploiting crowds for accu-
rate real-time image search on mobile phones. 8th international conference
on Mobile, pages 77–90, 2010. 97, 101
[YKG10] Tingxin Yan, Vikas Kumar, and Deepak Ganesan. Crowdsearch: ex-
ploiting crowds for accurate real-time image search on mobile phones. In
Proceedings of the 8th international conference on Mobile systems, appli-
cations, and services, pages 77–90. ACM, 2010. 36
139
BIBLIOGRAPHY
[ZLL+ 09] Qian Zhong, Hanyu Li, Juanzi Li, Guotong Xie, Jie Tang, Lizhu Zhou,
and Yue Pan. A gauss function based approach for unbalanced ontology
matching. In SIGMOD, pages 669–680, 2009. 16
140
EPFL IC LSIR, BC 142, Station 14
1015 Lausanne
Nguyen Quoc Viet Hung Switzerland
T +41 (21) 693 7573
B [email protected]
Ph.D., EPFL, Switzerland Í people.epfl.ch/quocviethung.nguyen
Research Interests
Data Integration, Spatio-temporal Data Management, Spatio-temporal Data Mining,
Meta-heuristics, and Constraint Programming
Education
2010-now Ph.D., Ecoly Polytechnique Federale de Lausanne (EPFL), Switzerland.
2008–2010 Master of Computer Science, Ecoly Polytechnique Federale de Lausanne (EPFL),
Switzerland, (Specialization: Internet Computing).
2000–2005 Bachelor of Computer Science and Engineering, HCMUT, Vietnam.
Work Experience
2008 Lecturer, Faculty of Computer Science and Engineering (CSE), HCMUT, Vietnam.
2005-2007 Assistant lecturer, CSE, HCMUT, Vietnam.
Honors
2013 Best student paper award - DASFAA 2013
2010 PhD fellowship at EPFL
2008 & 2009 "Excellence Scholarship" for master study at EPFL
2000 The first honor rank in the entrance examination to HCMC University of Technology
(perfect score 30/30 and more than 100.000 examinees)
Research Projects
2010-2013 FP7 NisB - The Network is the Business
2010-2014 FP7 PlanetData - A European Network of Excellence on Large-scale Data Man-
agement
2011-2015 FP7 EINS - Network of Excellence on Internet Science
Publications
1. Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Zoltan Miklos, Karl Aberer,
Avigdor Gal and Matthias Weidlich, Pay-as-you-go Reconciliation in Schema
Matching Networks. In ICDE 2014.
2. Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Lam Ngoc Tran, Karl Aberer, An
Evaluation of Aggregation Techniques in Crowdsourcing, In WISE 2013.
3. Nguyen Quoc Viet Hung, Xuan Hoai Luong, Zoltan Miklos, Tho Quan Thanh,
Karl Aberer, Collaborative Schema Matching Reconciliation, In CoopIS 2013.
4. Avigdor Gal, Michael Katz, Tomer Sagi, Karl Aberer, Zoltan Miklos, Nguyen
Quoc Viet Hung, Eliezer Levy, Victor Shafran. Completeness and Ambiguity
of Schema Cover , In CoopIS 2013.
5. Nguyen Quoc Viet Hung, Tri Kurniawan Wijaya, Zoltan Miklos, Karl Aberer,
Eliezer Levy, Victor Shafran, Avigdor Gal and Matthias Weidlich, Minimizing
Human Effort in Reconciling Match Networks, In ER 2013.
6. Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Lam Ngoc Tran, Karl Aberer, A
Benchmark for Aggregation Techniques in Crowdsourcing, In SIGIR 2013.
7. Nguyen Quoc Viet Hung, Xuan Hoai Luong, Zoltan Miklos, Tho Quan Thanh,
Karl Aberer, An MAS Negotiation Support Tool for Schema Matching, In
AAMAS 2013.
8. Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Zoltan Miklos, Karl Aberer, On
Leveraging Crowdsourcing Techniques for Schema Matching Networks ,
In DASFAA 2013 (Best Student Paper Award)
9. Avigdor Gal, Tomer Sagi, Matthias Weidlich, Eliezer Levy, Victor Shafran, Zoltan
Miklos, Nguyen Quoc Viet Hung, Making Sense of Top-K Matchings. A Unified
Match Graph for Schema Matching , In IIWeb 2012.
10. Nguyen Quoc Viet Hung, Hoyoung Jeung, Karl Aberer, An Evaluation of Model-
Based Approaches to Sensor Data Compression, In TKDE 2013.