A Framework For Content Based Semantic Information Extraction From Multimedia Contents

C OMPUTER S CIENCE FACULTY.
C OMPUTER S CIENCE AND A RTIFICIAL

I NTELLIGENCE D EPARTMENT
A F RAMEWORK FOR C ONTENT B ASED

S EMANTIC I NFORMATION E XTRACTION
FROM
M ULTIMEDIA C ONTENTS
A thesis submitted in fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY by:
Igor Garca Olaizola
Supervised by Prof. Basilio Sierra Araujo
&
Dr. Julin Flrez Esnal
C OMPUTER S CIENCE FACULTY. C OMPUTER S CIENCE AND A RTIFICIAL

I NTELLIGENCE D EPARTMENT
A F RAMEWORK FOR C ONTENT B ASED

S EMANTIC I NFORMATION E XTRACTION
M ULTIMEDIA C ONTENTS
FROM
A thesis submitted in fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY by:
Igor Garca Olaizola
Supervised by Prof. Basilio Sierra Araujo

&
Dr. Julin Flrez Esnal
El doctorando
El director
El director
Donostia San Sebastian, Wednesday 11th September, 2013
A Framework for Content Based Semantic Information Extraction from Multimedia

Contents
Author: Igor Garca Olaizola
Advisor: Basilio Sierra Araujo
Advisor: Julin Flrez Esnal
SVN Version Control Data:

D at e : 2013 09 1101 : 13 : 04 + 0200(W ed , 11Sep2013)
Aut hor : i ol ai zol a
Revi si on : 80M
DRAFT 1.7
The following web-page address contains up to date information about this dissertation and related topics:
http://www.vicomtech.org/
Text printed in Donostia San Sebastin
First edition, September 2013
Zuretzat aita.
Abstract
One of the main characteristics of the new digital era is the media
big bang, where images (still images or moving pictures) are one of
the main type of data. Moreover, this is an increasing trend mainly
pushed by the easy of capturing given by all new mobile devices that
include one or more cameras.
From a professional perspective, most content related sectors are
facing two main problems in order to operate efficient content management systems: a) need of new technologies to store, process and
retrieve huge and continuously increasing datasets and b) lack of
effective methods for automatic analysis and characterization of
unannotated media.
More specifically, the audiovisual and broadcasting sector which is
experiencing a radical transformation towards a fully Internet convergent ecosystem, requires content based search and retrieval systems
to browse in huge distributed datasets and include content from
different and heterogeneous sources.
On the other hand, earth observation technologies are improving the
quantity and quality of the sensors installed in new satellites. This
fact implies a much higher input data flow that must be stored and
processed.
In general terms, the aforementioned sectors and many other media related activities are good examples of the Big Data phenomenon
where one of the main problem relies on the semantic gap; the inability to transform mathematical descriptors obtained by image
processing algorithms into concepts that humans can naturally understand.
This dissertation work presents an applied research activity overview
along different R&D projects related with computer vision and multimedia content management. One of the main outcomes of this
research activities is the Mandragora framework where the main goal

is to reduce the semantic gap and create automatic annotation based
on a previously defined ontology.
As on of the main problems of the Mandragora framework is the initial characterization process where no prior knowledge is available, a
novel domain characterization method (DITEC) has been designed
in order to meet this requirements. The good results obtained in the
experimental tests have lead the research to analyze if the feature extraction method based on the trace transform can also be adapted to
its use as a local descriptor. The local DITEC method is also presented
in this dissertation and even if its implementation is still in a preliminary status, experimental results show that its performance is really
competitive when compared with the most popular local descriptors
available on the literature.
Resumen
Una de las caractersticas principales de la nueva era digital es el la

gran explosin producida alrededor de los contenidos multimedia
donde las imgenes (tanto estticas como en movimiento) suponen
el tipo principal de dato. Adems, esta tendencia sigue siendo creciente debido principalmente a la facilidad de captura ofrecida por
los dispositivos mviles que incluyen una o ms cmaras.
En lo referente a los diferentes sectores profesionales relacionados
con los medios digitales, el crecimiento tan exagerado de los datos
est causando dos problemas principales. Por un lado, se requieren
nuevas tecnologas que permitan el almacenamiento, proceso y la
recuperacin de contenidos de una manera efectiva en conjuntos
enormes y crecientes de datos. Por otro lado, tambin son necesarios
mtodos automticos de anlisis y caracterizacin de contenidos sin
anotacin previa.
De una forma ms especfica, podemos destacar el sector audiovisual en el que se est produciendo una profunda transformacin
provocada principalmente por el proceso de convergencia con Internet. En esta situacin, se hacen cada vez ms necesarios sistemas de
bsqueda y recuperacin de contenidos que permitan navegar en
conjuntos masivos de datos que cada vez son ms distribuidos y de
una procedencia ms heterognea.
En el caso de la observacin de la Tierra, los sistemas de adquisicin de datos son cada vez ms numerosos y ms precisos, hecho
que genera flujos de contenido cada vez mayores que deben ser
continuamente procesados y almacenados.
En general, los sectores mencionados previamente y otras actividades relacionadas con contenidos multimedia son claros ejemplos
del fenmeno Big Data que est produciendo y donde uno de los
problemas principales consiste en eliminar la brecha semntica (ms
conocida como semantic gap). Podemos definir la brecha semntica

como la diferencia no salvada por el momento entre los conceptos
matemticos derivados de las diferentes tcnicas de procesamiento
de imgenes y los conceptos que los humanos manejamos para describir estos mismos contenidos.
La presente memoria de tesis presenta una revisin sobre la actividad de investigacin aplicada que se ha realizado mediante varios
proyectos relacionados con la visin por computador y la gestin
de contenido multimedia. Uno de los resultados principales de esta
actividad investigadora ha sido el modelo Mandrgora, un diseo
de arquitectura con el objetivo de minimizar la brecha semntica
y crear anotaciones automticas basadas en una ontologa previamente definida.
Debido a que uno de los problemas principales a los que se enfrenta la implementacin de Mandrgora es el hecho de que la falta
de conocimiento previo sobre el contenido limita el anlisis inicial,
hemos propuesto un nuevo mtodo (DITEC) para la caracterizacin
semntica de imgenes. Los buenos resultados obtenidos en las pruebas experimentales realizadas han resultado en una adaptacin del
mtodo original basado en un descriptor global de manera que una
variante de dicho descriptor global resulte eficaz como descriptor
local. En este documento tambin se describe la variante DITEC local en la que los resultados de las pruebas experimentales realizadas
(an con una implementacin en fase de desarrollo) han mostrado
un comportamiento altamente competitivo al ser comparadas con
los descriptores locales ms populares en la literatura cientfica.
Laburpena
Aro digital berriaren berezitasun nagusienetako bat media edukien

big bang edo ugaritze neurrigabekoa da, irudiak (bai argazki eta
baita bideoak ere) eduki mota nagusia direlarik. Joera hau gainera
oraindik ere gehiagora doa batik bat kamera bat edo bi dakartzaten
gailu mugikorrek edukia jasotzeko eskaintzen duten erraztasunak
lagunduta.
Ikuspegi profesionaletik begiratuz gero, edukiekin lotuta dauden
sektoreak edukien kudeaketa efiziente bat egiteko garaian bi arazo nagusiren aurrean aurkitzen dira. Alde batetik, datu kopuru itzel hauek
metatu, prozesatu eta norberaganatzeko teknologia berriak behar
dira. Bestalde, alde aurretik inongo anotaziorik ez duen edukiaren
analisi eta bereizte automatikorako metodo eraginkorrak garatzeke
daude oraindik.
Gehiago zehaztuz, ikus-entzunezko edukien eta irrati uhin bitarteko
hedabideen sektorea, une hauetan Internetekin bat egitera doan
bideari ekinda bizitzen ari den eraldatze prozesu sakonean inoizko
eduki kopuru handienak kudeatzen ari da. Gainera, edukiak gero
eta jatorri ezberdin eta izaera heterogeneoagoa azaltzen dute, garai
bateko eredu zentralizatu eta trinkoekin lan egiten zuten sistemen
eraginkortasuna erabat urrituz. Horregatik, datu hauetan bilaketa eraginkorrak egiteko edukia bera aztertzeko gai diren egitura malguko
sistemen garapena behar beharrezkoa da.
Bestalde, Lurraren behatze jarduerarako teknologiak gero eta kopuru
eta doitasun handiagoko instrumentazioa erabiltzen dute belaunaldi
berriko sateliteetan. Hau dela eta gero eta datu jario handiagoa
igortzen dute behaketa sistemek eta eduki guzti horiek gorde eta
prozesatzeko beharrak gero eta zailagoak dira betetzen.
Oro har arestian aipatutako sektoreak eta multimedia edukiekin
dabiltzan halako beste hainbat Big Data gertakariaren adibide
nabariak dira, bertan arrail semantikoa (semantic gap bezela

ezagutzen dena), hots, irudi prozesamenduko algoritmoen bidez erauzitako bereizgarri matematikoak gizakioi ulergarri egiten zaizkigun
kontzeptuetan bihurtzeko ezina, arazo nagusienetariko bat bihurtu
delarik.
Tesi dokumentu honetan, konputagailu bidezko ikusmenari lotutako ikerkuntza aplikatuko hainbat proiektutan lortutako emaitza
orokorrak azaltzen ditugu. Emaitza nagusienetako bat Mandragora
arkitektura da. Mandragoraren xede nagusia arrail semantikoa txikitzeko ontologi batetan oinarrituta dagoen irudien anotazio sistema
automatiko bat sortzea da.
Mandragoraren arazo nagusienetako bat hasierako ezagutzarik izan
gabe lehen prozesamendua itsuka egin beharra denez, irudien
domeinu semantikoa karakterizatzeko metodo berri bat aurkezten
dugu, DITEC izenekoa. Saiakera esperimentaletan lortutako emaitza
onak ikusirik, DITEC metodoaren muinean dagoen deskribatzaile
globala era lokalean erabiltzeko egokitzeko ahalegina egin dugu.
DITEC bereizgarri lokala ere azaltzen da beraz dokumentu honetan.
Metodoaren inplementazioa oraindik ere garapen egoeran dagoen
arren, lortutako emaitza esperimentalak oso onak izan dira zientzia
literaturan dauden deskribatzaile lokal ezagunenekin alderatuta.
Acknowledgements
This is not the story of a self-made man. Instead, all the achievements
presented in this work have a long chain behind, a chain composed
by people that have supported my entire professional career and
something that cannot be separated from personal experiences. At
this point, it is worth to acknowledge all these people.
In this sense, my both supervisors, Basilio Sierra and Julin Flrez
have been an essential part of this work, with an unconditional
commitment and a highly valuable scientific guidance. Dudarik
gabe, esan liteke, Julian, nire bide profesionalaren lehen hastapenak
zurekin eman nituela. Hasieratik zugandik sentitu nuen konfidantza
eta babesa ez dira hamarkada oso batetan gutxiagora joan, eta hori
bada zerbait. Denbora guzti honetan zugandik ikasia nire eguneroko
lanaren oinarri nagusienetako bat izanik, lan honetan ere halaxe isladatzen da. Bestalde, unibertsitatean irakasle egoki bat aukeratzeko
bidean, zorte izugarria izan nuen Basi ezagututa. Hasieratik jakin
izan du nire egoera profesionalak sortzen dizkidan etenaldi eta jarraipen faltara egokitzen. Aholku eta zuzendaritza zientifiko ezin hobe
bat egin dituela esango nuke eta era atsegin eta gogotsuan gainera,
gogor eta astuna izan litekeen prozesu bat, gustora egiten den lana
bihurtuz.
Dentro de Vicomtech, entorno en el que se ha movido la mayor parte
de mi actividad profesional y donde se enmarca esta tsis, he contado
con innumerables apoyos. Seguro que dejar alguno sin mencionar
(desde aqu mis disculpas) pero no por ello quiero dejar de citar algunos tales como Jorge Posada, Director Adjunto, que me ha apoyado
en todo momento con nimos y consejos prcticos que vienen muy
bien cuando uno se centra demasiado en su problema. Amalia y
David, compaeros de fatigas que me demostraron que s es posible
hacer una tesis doctoral compaginada con la actividad profesional
en un centro tecnolgico. Shabs, this interesting man that always

shows me that things might have a non obvious point of view that it is
worth to observe. Edurne, gauza zailak errez eginaz behin eta berriz
bidea lautzen didan lankide eta laguna. Por supuesto, merece una
mencin especial el departamento de TV Digital y Servicios Multimedia del que soy parte y del que he sentido un apoyo enorme durante
todo el proceso. Espero realmente poder corresponder en la misma
proporcin.
This work has been carried out in a strong collaboration with some
colleagues that deserve a special mention. Marco Quartulli, cientfico renacentista con el que juego y aprendo cada da. Naiara
Aginako, irudiekin lan egiten hasi ginen egunetik bide lagun, lanean
zorrotz bezain atsegin tratuan, zure txanda noiz iritsiko desiatzen
nago. Gorka, quien con su tesis marc el punto de partida de este
trabajo. Cuntas discusiones interesantes que derivaron en buenas
ideas o. . . en ms discusiones :-) Espero poder seguir disfrutando de
tu contrapunto. Y por supuesto Iigo Barandiaran, un investigator
meticuloso, creativo y con un gusto por el trabajo bien hecho que
para m sigue siendo un ejemplo. De las inmumerables horas que
hemos pasado juntos en este proceso, no ha habido un minuto en el
que no haya disfrutado. Me ilusiona saber que todava nuestro trabajo require de mejoras y ms investigacin porque ser la manera
de que continuemos colaborando. Al final parece que nos vamos a
tomar esa cerveza!
Echando un poco la vista atrs, tampoco puedo olvidarme de otros
viejos amigos que aunque de una manera ms lejana han sido fundamentales para que este trabajo se haya realizado. Haritz, mein
Brder, karrera hasi genianetik horrenbeste ordu elkarrekin, hainbat proiektu eta diskusio, beti elkarlanean laguntzeko prest. Gero
urtebeteko abentura elkarrekin Wichernstrasse inguruetan. Nolako
injinerua nauken, hik baduk bai zeresana. Jaizki, aspaldiko lagun, karrera garaitik eta gure Alemaniko abenturan triangelu zarauztarraren
beste erpina. Lehen bezela, orain ere ez didak gutxi lagundu. Esan
eta egin, artikuluaren zuzenketak eskatu orduko eta doitasun handiz gainera. Hitzez laguntza eskaintzea erreza dek, hik egitez erakutsi
didak.
Profesionalki lana buru belarri egin ahal izateko, pertsonalki oreka

lortzea behar beharrezkoa dela uste dutenetakoa naiz ni, eta horretarako bizitza zurekin elkarbanatuz, goizero egunari ekitea gauza
zoragarri bat izatea zuri zor dizut, Myriam, batzutan garabi bat behar
baduzu ere.
Izan zirelako gara, eta garelako izango dira, Naroa eta Maddi, zuek
zarete nire benetako proiektua, txikiak izango zarete baina zuei begira
beste guztia geratzen zait niri txiki.
Esan bezala, izan zirelako gara, eta ni naizena baldin banaiz familiari
eta batik bat gurasoei zor diet (onetik dudana behintzat). Osaba Joxe,
nire bizitzan dauden oinarrizko printzipioen erakusle, horrenbeste
urte eta gero ez dira aldatu. Ama, zuk erakutsiak dira ahalegintzearen
balioa, lanean gustoa jartzearena, txukun ibiltzearena. Oraindik
ere halaxe erakusten didazu egunero. Aita, tesi lan honekin zuri
egin nahi dizut aipamenik garrantzitsuena. Ingenioa, irudimena eta
jakintzaren arteko konbinazio bezela maisutasunez erabiliz, zeuk
zuzendu ninduzun ingeniaritzara. Nik zure eredu hori jarraitzen
jartzen dut ahalegina. Lanean lanetik kanpo bezela, zuzen eta pulamentuz, ingurukoei laguntzen saiatuz eta zailak badira ere erabaki
zuzenei koherentzia osoz eutsiz. Egunero saiatzen naiz zuk hitzez eta
egitez hain garbi erakutsitako bidea duintasunez betetzen.
Eskerrik asko guztioi
Igor Garca Olaizola
September 2013
Contents
List of Figures
xvii
List of Tables
xxi
I Work Description
1 Introduction
1.1 Context of this research activity . . . . . . . . . . . . . . . . . . . . . .
1.1.1 VicomtechIK4 . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1.1
Relation with other VicomtechIK4 PhD. processes
1.1.2 Computer Science and Artificial Intelligence Department of

the Computer Engineering Faculty . . . . . . . . . . . . . . . .
1.2 R&D Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Begira . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1.2
Conclusions . . . . . . . . . . . . . . . . . . . . . . . .
1.2.2 Skeye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.2.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.2.2
Conclusions . . . . . . . . . . . . . . . . . . . . . . . .
1.2.3 SiRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.3.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.3.2
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.4 SIAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.4.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.4.2
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.5 Cantata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.5.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . 12
xiii
CONTENTS
1.2.5.2
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.6 RUSHES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.6.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.6.2
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.7 Grafema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.7.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.7.2
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.8 IQCBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.8.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.8.2
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.9 Relationship of projects and scientific activity . . . . . . . . . 18

2 Computer Vision from a Cognitive Point of View
21
2.1 Mandragora Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2 Image Processing and AI Approach . . . . . . . . . . . . . . . . . . . 23
3 Domain Identification
27
3.1 Domain characterization for CBIR . . . . . . . . . . . . . . . . . . . . 29

3.1.1 Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1.1
Alternative methods for massive content annotation 30
3.1.2 Earth Observation, Meteorology . . . . . . . . . . . . . . . . . 30

3.1.2.1
Meteorology . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Local features vs. global features in domain identification . . . . . . 34

4 Proposed Method: DITEC
37
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 General description of the DITEC method . . . . . . . . . . . . . . . 38
4.2.1 Sensor modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.2 Data transformation . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.2.1
Functionals . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2.2
Geometrical constraints . . . . . . . . . . . . . . . . . 42
4.2.2.3
Quantization effects . . . . . . . . . . . . . . . . . . . 44
4.2.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.3.1
Statistical descriptors . . . . . . . . . . . . . . . . . . 48
4.2.3.2
Cauchy Distribution . . . . . . . . . . . . . . . . . . . 51
4.2.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.4.1
Feature Subset Selection in Machine Learning . . . 51
4.2.4.2
Attribute contribution analysis . . . . . . . . . . . . . 53
xiv
CONTENTS
4.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Case study 1: Corel 1000 dataset . . . . . . . . . . . . . . . . . 54
4.3.2 Case study 2: Geoeye satellite imagery . . . . . . . . . . . . . 57
4.4 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.1 Computational complexity of the trace transform . . . . . . . 61
4.4.2 Computational complexity of attribute selection and classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 Conclusion of the presented method . . . . . . . . . . . . . . . . . . 63
4.6 Modified DITEC as local descriptor . . . . . . . . . . . . . . . . . . . 64
4.7 Implementation of DITEC as local descriptor . . . . . . . . . . . . . 64
4.7.1 Trace Transformation . . . . . . . . . . . . . . . . . . . . . . . 66
4.7.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.7.3 DITEC parameters . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.7.3.1
Angular and radial sampling . . . . . . . . . . . . . . 68
4.7.3.2
Effects of sampling in the computational cost . . . . 70
4.7.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . 70

4.7.4.1
Geometric Transformations . . . . . . . . . . . . . . 71
4.7.4.2
Photometric Transformations . . . . . . . . . . . . . 72
4.7.5 Current status of the local DITEC algorithm design . . . . . . 72

5 Main Contributions
75
5.1 Mandragora framework . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2 DITEC method as global descriptor . . . . . . . . . . . . . . . . . . . 75
5.3 DITEC feature space analysis . . . . . . . . . . . . . . . . . . . . . . . 76
5.4 DITEC method as local descriptor . . . . . . . . . . . . . . . . . . . . 76
5.5 Other contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6 Conclusions and Future Work
77
6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.1.1 Collaborative filtering, Big Data and Visual Analytics . . . . . 78
II Patents & Publications
81
7 Publications
83
7.1 Weather analysis system based on sky images taken from the earth
83
7.2 A review on EO mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
xv
CONTENTS
7.3 Acc. Obj. Tracking and 3D Visualization for Sports Events TV Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.4 DITEC: Experimental analysis of an image characterization method
based on the trace transform . . . . . . . . . . . . . . . . . . . . . . . 84
7.5 Image Analysis platform for data management in the meteorological domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.6 Architecture for semi-automatic multimedia analysis by hypothesis
reinformcement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.7 Trace transform based method for color image domain identification 85
7.8 On the Image Content of the ESA EUSC JRC Workshop on Image
Information Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.9 Authors other publications . . . . . . . . . . . . . . . . . . . . . . . . 86
8 Selected Patents
93
8.1 Method for detecting the point of impact of a ball in sports events . 93
8.2 Authors Other Related Patents . . . . . . . . . . . . . . . . . . . . . . 93
III Appendix and Bibliography
95
A Consideration on the Implementation Aspects of the trace transform
97
A.1 Development platforms . . . . . . . . . . . . . . . . . . . . . . . . . . 97

A.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
B Calculation of the clipping points in a circular region
103
Bibliography
107
xvi
List of Figures
1.1 Begira scene definition. . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Example of the cloud segmentation process . . . . . . . . . . . . . .
1.3 Rushes content analysis workflow . . . . . . . . . . . . . . . . . . . . 13

1.4 Grafema Assets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Grafema System Workflow . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.6 Grafema System Architecture . . . . . . . . . . . . . . . . . . . . . . . 16
1.7 IQCBM System Architecture . . . . . . . . . . . . . . . . . . . . . . . . 17
1.8 Screenshots of the IQCBM user interface . . . . . . . . . . . . . . . . 18
1.9 Relationship between R&D projects and scientific activity in multimedia content analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Information Retrieval Reference Model . . . . . . . . . . . . . . . . . 22
2.2 Mandragora Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 DIKW Pyramid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Watson DeepQA High-Level Architecture . . . . . . . . . . . . . . . . 28
3.2 Idealized query process decomposition on EO image mining . . . . 31
3.3 Envisat instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 General architecture of the meteorological information management system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1 DITEC System workflow . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Trace transform, geometrical representation . . . . . . . . . . . . . . 41
4.3 Trace transform contribution mask at very high resolution parameters (Image resolution:100x100px. n = 1000, n = 1000, n = 5000).
44
4.4 Pixels relevance in trace transform scanning process with different

parameters (n , n , n ). Original image resolution = 384x256. . . . . 45
xvii
LIST OF FIGURES
4.5 Trace Transform and subsequent Discrete Cosine Transform of
Lenna. (Y channel of YCbCr color space) . . . . . . . . . . . . . . . . 48
4.6 Conceptual scheme: DCT matrix transformation into , k pair vector. 49
4.7 Statistical properties of all Kurtosis measurements made on the
distributions obtained by processing Corel 1000 dataset . . . . . . . 50
4.8 Examples of probability density distribution and histograms obtained by the samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.9 Samples of Corel 1000 dataset. The dataset includes 256x384 or
384x256 images.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.10 Distance among classes in the Corel 1000 dataset according to misclassified instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.11 Distance among most inter-related classes in the Corel 1000 dataset
according to misclassified instances. . . . . . . . . . . . . . . . . . . . 57
4.12 Corel 1000 picture corresponding to class Architecture and classified
as Mountain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.13 Corel 1000 precision results with different feature extraction algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.14 Samples of satellite footage dataset. 256x256px patches at different
scales.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.15 Distance among classes in the Geoeye dataset according to misclassified instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.16 Time performance behavior. . . . . . . . . . . . . . . . . . . . . . . . 62
4.17 System workflow for DITEC as local feature . . . . . . . . . . . . . . . 65
4.18 Matching accuracy depending on the number of angular samples . 68
4.19 Matching accuracy depending on the number of radial samples . . 69
4.20 Matching accuracy depending on the number of simultaneous increase of angular and radial sampling . . . . . . . . . . . . . . . . . . 69
4.21 Computation time depending on the simultaneous increase of angular and radial sampling . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.22 In-plane Rotation Transformation matching results. . . . . . . . . . 71
4.23 Scale Transformation matching results. . . . . . . . . . . . . . . . . . 71
4.24 Projective Transformation matching results. . . . . . . . . . . . . . . 72
4.25 Exposure change photometric Transformation matching results. . . 73
4.26 Trace transform row and column analysis . . . . . . . . . . . . . . . . 73
A.1 DITEC development platform . . . . . . . . . . . . . . . . . . . . . . . 98
A.2 Circular patch image . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
xviii
LIST OF FIGURES
A.3 Result of (, ) space exploration with Bresenham . . . . . . . . . . . 99
A.4 First half of the source image is sampled (blue regions) while areas
around vertical and horizontal axes are not considered. . . . . . . . 100
A.5 Second half of the source image is sampled (red and green). These
3 5 7
4 , 4 , 4 ,
regions are moved to 4 ,
areas in order to be sampled
with the Bresenham algorithm. . . . . . . . . . . . . . . . . . . . . . . 100

A.6 Result of (, ) sampling with Bresenham algorithm and a single
image rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
A.7 Result of (, ) pixelwise sampling with image rotation for each
angular iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
A.8 Result of different sampling strategies of (, ) space.
. . . . . . . . 102
B.1 Scanline defined in terms of and . . . . . . . . . . . . . . . . . . . 104
xix
List of Tables
4.1 List of Trace Transform functionals proposed in [KP01] . . . . . . . . 42
4.2 Quantization effects of the trace transform . . . . . . . . . . . . . . . 46
4.3 Corel 1000 dataset confusion matrix. . . . . . . . . . . . . . . . . . . 56
4.4 Geoeye dataset confusion matrix. . . . . . . . . . . . . . . . . . . . . 59
xxi
Part I
Work Description
If our brains were simple enough

for us to understand them, wed be
so simple that we couldnt.
Ian Stewart
CHAPTER
Introduction
Artificial Intelligence (AI) is probably one of the most exciting knowledge fields
where even the definition of the term becomes controversial due to the manifold
understanding of the intelligence that remains as a hard epistemological problem. Learning, reasoning, understanding, abstract thought, planning, problem
solving and other related topics are all different aspects that imply intelligence.
The emergence of programmable digital computers in the late 1940 offered a
revolutionary way to experimentally explore new methods for formal reasoning
and logic. However, the initial great expectation of AI did not come into reality
and the prediction made by Herbert A. Simon1 machines will be capable, within
twenty years, of doing any work a man can do still remains as a Science Fiction
topic.
The fashions of AI over the years have moved from automated theorem proving to expert systems that later on where substituted by behaviour-based robotics
and now seem to find the solution on learning from big data[Lev13]. All these
trends have not been able to meet the expectation that the founders of AI put
on the field[Wan11]. Patrick Winston( director of the MIT Artificial Intelligence
Laboratory from 1972 to 1997) cited the problem of mechanistic balkanization,
with research focusing on ever-narrower specialties such as neural networks or
genetic algorithms. When you dedicate your conferences to mechanisms, theres a
1
Herbert Alexander Simon (June 15, 1916 February 9, 2001), ACMs Turing Award for making
basic contributions to artificial intelligence, the psychology of human cognition, and list processing
in (1975) and considered one of the founders of AI
1. INTRODUCTION
tendency to not work on fundamental problems, but rather [just] those problems
that the mechanisms can deal with[Cas11].
However, there has been a great scientific and technological advance in many
AI related domains formal logic, reasoning, statistics and data mining, genetic
programming, knowledge representation, etc. that without satisfying the foundations proposed by Winston or Chomsky, has enabled the creation of technological
solutions for different application fields such as natural language processing,
computer vision, drug design, medical diagnosis, genetics, finance & economy,
user recommendation systems and many others.
1.1 Context of this research activity

The research activity described in this dissertation has been mainly carried out
withing the applied research perspective given by both, basic and applied research
projects developed in VicomtechIK41 . VicomtechIK4 is an applied research
center located in San-Sebastian (Basque Country, Spain) and combines the excellence in pure basic research with its application and transfer to the industry. In
this sense, some of these projects have been transferred to the industry and their
intellectual property has been protected by applying patents. In other cases, the
scientific progress carried out within the project has been published in journals
or conferences.
The knowledge of the market needs, technological state of the art and real
integration which introduces many constraints coming from the real world are
combined with the scientific method and basic research activities in collaboration with Universities. In this case, the collaboration with the University of the
Basque Country2 and more specifically with the Computer Science and Artificial
Intelligence Department of the Computer Engineering Faculty has been a key
element for the balanced applied/scientific progress of the research work.
1.1.1 VicomtechIK4
VicomtechIK4 as an applied research centre is focused on all aspects related with
multimedia and visual communication technologies along the entire content production pipeline, from generation, through processing and transmission until
1
2
http://www.vicomtech.org
http://www.ehu.es
1.1 Context of this research activity

rendering, interaction and reproduction. VicomtechIK4 is structured in 6 departments that offer different views and specific technological solutions around
the aforementioned research activity. These departments are the following:
Digital Television and Multimedia Services
Speech and Natural Language Technologies
eTourism and Cultural Heritage
Intelligent Transport Systems and Engineering
3D Animation and Interactive Virtual Environments
eHealth and Biomedical Applications
The research described in this dissertation has been carried out within the
Digital TV & Multimedia Services department with a strong collaboration with
the department of eHealth and Biomedical Applications. In fact, the problem of
computer vision and multimedia content understanding is one of the main research lines of VicomtechIK4. In this case, one of the main problems addressed
in this dissertation is aligned with one of the main difficulties of current multimedia management systems in diverse sectors such as broadcasting, remote sensing,
medical imaging, etc.: The huge amount of unannotated data and extremely
broad domains that cannot be explicitly defined.
1.1.1.1 Relation with other VicomtechIK4 PhD. processes
The research and technological activity performed in this work has been carried
out in a strong collaboration with other two PhD processes made in Vicomtech
IK4. These two works analyze and develop other aspects related with the analysis
and understanding of multimedia. More specifically Marcos[Mar11] studied the
multimedia retrieval problem from a semantic point of view, creating a semantic
middleware based approach as an intermediary layer between users high level
queries and systems low-level annotations. Some of the final considerations of
the work carried out by Marcos and the requirements identified as future work to
feed this semantic middleware from a bottom-up approach are the basis of the
initial context of this dissertation.
On the other hand, Barandiaran [Bar13] focused his work in the analysis of
local descriptors. The collaboration with Barandiaran has resulted into a local
adaptation of the global descriptor as one of the main contributions proposed
in this dissertation (see Section 4.6). This novel local feature has demonstrated
highly robust characteristics as a local descriptor .
1. INTRODUCTION
1.1.2 Computer Science and Artificial Intelligence Department

of the Computer Engineering Faculty
The Robotics and Autonomous Systems Group which belongs to the Computer
Science and Artificial Intelligence Department of the Computer Engineering
Faculty is very active in two main areas:
Mobile Robotics
Behavior-based control architectures for mobile robots.
Bio-inspired robot navigation.
Use of visual information for navigation.
Machine Learning
Dynamic learning mechanisms
Classifier combination
New paradigms for supervised classification
Optimization problems
The deep knowledge of this group on machine learning science and technologies has provided the scientific foundations for the more technological work
developed in VicomtechIK4. This combination provides a high potential context
for scientific research.
1.2 R&D Projects

This dissertation work has been carried out based on R&D projects with common
underlying scientific needs and customer specific requirements. The knowledge
and experience acquired during these projects has driven the general framework
presented as one of the main contributions of this work.
1.2.1 Begira
Title: Diseo y Desarrollo de un Sistema Seguimiento Preciso de Objetos
en Transmisiones Deportivas (Design and development of a high accuracy
object tracking system for sports broadcasting).
Project typology: Industrial project partially supported by the Gaitek programme.
Company name: G93.
1.2 R&D Projects

Period: 2005-2009.
1.2.1.1 Summary
Augmented reality projects require a deep knowledge of the scene that has to be
extracted/updated in real time. In order to ensure the accuracy and real-time
performance of the system, the knowledge must be explicitly defined.
The goal of of Begira project was to develop a single-camera system to track
the ball trajectory and position the bouncing point for Basque Pelota live TV
transmissions. The main constraints of the system were:
Single camera.
Broadcasting camera (720p@50).
Tracking, positioning and virtual reconstruction under 20 seconds.
Single standard computer for processing purposes.
From an Artificial Intelligence perspective, we can consider it as a system
where the knowledge domain is reduced to a single scene (the Basque Pelota
court) and thus can be explicitly defined. The main elements that define this
domain are:
3D environment: A court composed by 3 plane surfaces (front wall, side wall
and ground).
The relative position of the camera to the court is obtained during a
calibration process by putting a checkerboard on the ground.
Once the camera is calibrated, its position is fixed during the entire
match.
Dynamic objects: There are only 2 types of dynamic objects in the scene:
Players: There can be two or 4 players. Their size is much bigger than the
ball and most of the time their lowest part is touching the ground.
Ball: It is white, round and much smaller than the players. Sudden trajectory changes are due to the hit of the players or a bounce. The ball is
so rigid that the bounce can be considered elastic.
According to the domain defined with the aforementioned concepts, a homography matrix H is calculated to obtain cameras extrinsic parameters. Then the
1. INTRODUCTION
(Camera Origin)
(Camera Origin)
((xi,, yi))
(xi, yi)
R
(xi, yi)
(Z=0 Plane Origin)
(Z=0 Plane Origin)
(a)
(b)
Figure 1.1: Scene definition: Ball trajectory samples used to estimate the parametric curves and the calculation of the bouncing point on the ground once the center
position of the ball is obtained (crossing point of the two curves).
ball is initially detected and the tracking system follows its trajectory. Abrupt trajectory changes define the limit between the instant before and after the bounce.
Once the two parametric curves are estimated, their crossing point is calculated
on the image. This two-dimensional position (in pixels) is then converted to the
3D space using the inverse of the homography matrix (H 1 ). To solve the uncertainty of the 3D position obtained by the 2D projection, the condition Z = 0 is
established for the bouncing point. More details of the project can be found in
Section 7.3.
1.2.1.2 Conclusions
The Begira project is a good example of expert systems applied to image processing and computer vision. The technical goals were successfully achieved and the
results of the project were exploited by the Basque public broadcaster ETB and
the TV content producer G93. However, the knowledge acquired by the system
was so hardcoded that it is very difficult to extend or integrate it in other more
general solutions. The good performance and accuracy results rely on its reduced
domain definition and rigid nature.
1.2.2 Skeye
Title: Sistema de anlisis meteorolgico basado en imgenes del cielo
tomadas desde tierra (Meteorological analysis system based on images
taken from the earth).
Project typology: Industrial project supported by the Gaitek programme.
1.2 R&D Projects

Company name: Dominion.
Period: 2007-2008.
1.2.2.1 Summary
Meteorological stations provide multiple sensor data as well as some more subjective information such as the cloudiness. The goal of Skeye was to provide
an automatic system to accurately estimate the cloudiness factor, avoiding any
human intervention.
As mentioned in Begira, the semantic domain was small and quite straight
forward to model. The four classes that compose the domain are: cloud, sun, blue
sky and earth. The project was carried out by analyzing the features that were
characteristic for each class and the scene was defined in terms of a dome with
normalized illumination conditions.
(a)
(b)
Figure 1.2: Example of the cloud segmentation process
1.2.2.2 Conclusions
Similarly to Begira, in this case, the feature extraction process provided all the
information need for a further class assignment by applying specific thresholds.
However, the further integration of the developed system in more domains or
scenes would be a difficult task since all the development and the selected features
totally depend on the domain definition and scene conditions. More information
about this work can be found in Section 7.1.
1.2.3 SiRA
Title: Diseo y Desarrollo de un Sistema de Reconocimiento de Marcas
Comerciales en Emisiones Televisivas (Design and development of a system
for commercial brand recognition in TV broadcasts).
1. INTRODUCTION
Project typology: Industrial project supported by the Gaitek programme.
Company name: Vilau.
Period: 2007-2008.
1.2.3.1 Summary
This project is another example of a system based on a reduced semantic domain,
but in this case the approach was more general and some higher abstraction level
elements were introduced. The goal of SiRA was to detect logos in TV content in
order to automatize advertisement monitoring tasks. This project was also supported by the Basque Government and its industrial application was envisioned
by Vilau, a media communication company.
In this case, the constraints in terms of real time behavior and equipment
were lighter than in Begira. However, the domain was broader: any type of logos
embedded in any type of content taken from different perspectives.
The approach followed in this case was to firstly detect a logo candidate assuming that a logo would be typically surrounded by a regular shape (square,
circle, triangle, etc.) and composed of very few colors. Once the logo was detected,
different feature extraction algorithms could be applied in order to compare the
results with those features corresponding to the target logo dataset. Depending
on the extracted features, different distance metrics were applied.
1.2.3.2 Conclusions
The results of SiRA can be integrated as a new feature in other content analysis
systems. In this case, SiRA would provide information about potential logos existing in a specific video or still image. Moreover, even if the process itself is carried
out by using low-level operators, it can be considered that the result of SiRA is a a
set of high level features with valuable semantic content as in general terms the
presence of a logo means that there is a product or and advertisement related to
it.
1.2.4 SIAM
Title: Diseo y Desarrollo de un Sistema de Anlisis Multimedia de Contenido Audiovisual en Plataformas Web Colaborativas (Design and development of a system for multimedia analysis of audiovisual content in
collaborative web platforms).
10
1.2 R&D Projects

Project Typology: Industrial project supported by the Gaitek programme.
Company name: Hispavista (http://hispavista.com/).
Period: 2009-2010.
1.2.4.1 Summary
First ideas of this work related with a semantic analysis of multimedia content
were developed in SIAM. The goal of this project was to create content analysis tools to improve the exploitation of large amounts of user generated content.
The context of the project was www.tu.tv, a YouTube like video sharing platform
owned by Hispavista. According to this approach, the semantic labels can be obtained from unstructured user comments. Then, by finding similar contents, new
non tagged content can be assigned to a previous label.
As the type of content analyzed in SIAM were any kind of videos, the semantic
domain was too broad and complex to be defined where one of the main problems was the definition of a semantic unit in a video. The assumption of a video
as a semantic unit is to inconsistent in many cases as the elements on it can be
changing along the time. Therefore, each video was decomposed in shots and
each shot was analyzed and labeled. Finally, the entire video would be labeled as
the composition of each shot label.
1.2.4.2 Conclusions
The main outcome of SIAM was the shot based content analysis model and a shot
boundary detector that has been later used for semantic analysis purposes. Moreover, the potential of user generated metadata was addressed in this project. We
identified the potential of this amount of unstructured data that could be complementary to the perfectly organized but expensive to populate professional
taxonomies.
1.2.5 Cantata
Title: Content Aware Networked systems Towards Advanced and Tailored
Assistance
Project typology: ITEA
Period: 2007-2009
11
1. INTRODUCTION
Consortium: Bosch Security Systems,Philips Electronics Netherlands,Philips
Medical Systems,Philips Consumer Electronics,TU/e, TU Delft, Multitel,
ACIC, Barco, Traficon, VTT, Solid, Hantro, Capacity Networks, I&IMS, Telefonica VicomtechIK4, University Pompeu Fabra, CRP, Henri Tudor, Codasystem, Kingston University, University of York, INRIA.
1.2.5.1 Summary
The goal of Cantata was to create a distributed service for content analysis. The application field included medical imaging, entertainment an security. Our activity
was focused in the entertainment sector where the content analysis modules were
connected to user profiles in order to create content recommendation systems.
In this case, the logo detection system was used to provide content information to the main content analysis and recommendation system.
1.2.5.2 Conclusions
The recommendation system was intended to combine user activity information,
content metadata and low-level feature based information. However, the broad
domain definition required an unfordable amount of low-level descriptors and
even the combination of all these descriptors would be a very complex issue. Due
to this complexity, most recommendation systems rely basically on metadata.
1.2.6 RUSHES
Title: Retrieval of mUltimedia Semantic units for enHanced rEuSability.
Project typology: FP6-2005-IST-6.
Period: 2007-2009
Consortium: Heinrich-Hertz-Institut (DE), University of Surrey (UK),
Athens Technology Centre (GR), Vcomtech (ES), Queen Mary University of
London (UK), Telefonica I+D (ES), FAST Search & Transfer (NO), University
of Brescia (IT), ETB (ES).
1.2.6.1 Summary
The overall aim of RUSHES was to design, implement, validate, and trial a system
for both delivery of, access to raw media material (rushes) and the reuse of that
12
1.2 R&D Projects

content in the production of new multimedia assets, following multimodal approach that combines knowledge extracted from the original content, metadata,
visualization and highly interactive browsing functionality.
The core research issues of RUSHES were:
Knowledge extraction for semantic inference
Automatic semantic-based content annotation
Scalable multimedia cataloging
Interactive navigation over distributed databases
Non-linear querying and retrieval techniques using hierarchic descriptors.
Figure 1.3: Rushes content analysis workflow
1.2.6.2 Conclusions
The RUSHES consortium tried to address the semantic gap by creating a powerful
architecture composed of low-level operators. The workflow designed in Rushes
1.3 was able to combine multiple low-level features and multiple types of sources
13
1. INTRODUCTION
(video, audio, text). Moreover, the shot was considered as a semantic unit of a
video. Due to the fact that different shot boundary operators provide different
shots, an extra complexity was added to the metadata model where each feature
could define its temporal boundaries.
All the low-level operators were applied to every content in the database. This
fact introduced a strong limitation in the scalability of the domain. In order to
identify new concepts, more low-level operators might be needed and as the size
of the feature-space dimensionality increased, the system became both computationally too demanding and unfordable for the data mining and ontology
management processes. We presented a potential solution to this problem in
[OMK+ 09] by splitting the domain in sub-domains that only apply those lowlevel feature extraction operators suggested by the domain definition (ontology).
However it requires a prior knowledge of the content that should be obtained by
applying low-level operators. This chicken-egg problem will be one of the key
topics of this research work.
1.2.7 Grafema
Title: Grafema: Multimodal content search platform
Project typology: Basic research project.
Period: 2012.
1.2.7.1 Summary
The goal of the Grafema project was to create a base platform to store, annotate
and retrieve multimedia content of diverse nature. More than focusing on the
algorithms to obtain content descriptors or methods for automatic content annotation, Grafema was focused on the architectural aspects and the design of a
generic solution to deal with different types of content. In this sense, an asset
could be either text, image, audio, video, 3D or even a combination of these previous elementary units. According to this generic description of a digital asset,
similarity metrics must also adapt to each case or combination. As it can be observed in Figure 1.4, assets containing the label tiger can be considered as similar
if they include this information in the metadata or if this label is found in any of
the elementary units that compose the content.
The workflow designed for Grafema (Figure 1.5) is based in low-level operators that are independently processed. The information obtained from these
14
1.2 R&D Projects
Figure 1.4: Grafema Assets
operators is then ported to a higher level of abstraction by using data mining

techniques. The obtained information is then introduced in a semantic model
and stored in a database. The similarity of two assets can be then computed according to this semantic model, but it is not limited to this metric. An iterative
process starts and enables the calculation of similarity metrics between assets
of the same type that belong to different instances. This iterative process, is the
basis of the Grafema architecture (Figure 1.6) and provides a new paradigm of
content search a retrieval based more in a browsing process than in a pure text
based search.
Figure 1.5: Grafema System Workflow
15
1. INTRODUCTION
1.2.7.2 Conclusions
The results of Grafema have shown the big potential of iterative processes for multimedia searching. Even if the tests have been carried out with limited datasets in
terms of size and domain complexity, the results show that text based search can
be dramatically improved datasets include high volumes of multimedia content.
Figure 1.6: Grafema System Architecture
Regarding the state of the art, the annotation and individual metrics as well
as the unsuitability of most common database solutions for multimedia data are
still the main drawbacks that limit the potential of these kind of systems.
1.2.8 IQCBM
Title: Image Query by Compression Based Methods
Project typology: Industrial project.
Period: 2011.
Consortium: DLR (German Aerospacial Agency).
1.2.8.1 Summary
The goal of this project was to create low-level operators and define distance metrics for satellite imagery that would be applied during the ingestion process of the
16
1.2 R&D Projects

delivered streams. The main idea behind this operators was to gather prior characteristic information that could be useful for later retrieval operations. The lack
of knowledge regarding the queries that may be applied during the retrieval process made difficult the definition of low-level features that might not be focused
an adaptable/extensible
experimentation framework
in any specific aspect.
The domain of Remote Sensing is not as broad as those related with the audio-
visual sector, but are still too big and complex to be explicitly defined. Moreover,
new definitions and relationships could be dynamically introduced.
Q
64x64
RGB
to
HSV
TIFF LZW
Guesebroek
PAMI
2001
Your
preprocessor
here
Single
query
measures
UI
user classes
FCD
Ranked
docs
nanocodebooks
Statistical
measures
PRDC
JPEG/MPEG DCT
Your post-processor here
JPEG 2K WLTs
Your feature extractor here
Preprocess.
framew.
Ranked
docs
Random
Codebook
analysis
MPEG-7 VST
Input
image
patch
complexity
Compute
Dict
Your
distance
measure
here
MonetDB
Django
adapter
Ranked
docs
Your
performance
measures
here
MonetDB
storage + execution
Analysis framework
Indexing framework
Query/ranking framework
Comparison
framework
Figure 1.7: IQCBM System Architecture
In order to address the lack of prior knowledge, global features were considered more adequate than the local ones. The first algorithm implemented in this
project was based on the codewords provided by a Lempel-ziv compressor as suggested by Watanabe et al. [WSS02]. The L0 distance (Equation 1.1) was used as
a metric for the codewords related to each element (in this case an element is
represented by each of the patches obtained after a tiling process applied to the
multi-resolution satellite imagery).
d L0 =
n
X
|x i y i |0
where:
00 = 0
(1.1)
i =1
1.2.8.2 Conclusions
The developed system (Figure 1.8) was tested against Corel 1000 dataset [Cor]
and a subset of the Geoeye imagery [Glo] obtained good accuracy characteristics.
The length of the feature vector for each item was variable and was an attribute by
17
1. INTRODUCTION
itself as it provides a measure of the complexity of the image. However, in terms
of scalability, the average length (several thousands of codewords) obtained by
this algorithm might become a limitation.
Figure 1.8: Screenshots of the IQCBM user interface
As a result of this project, a deep study of the current trends of the community
was carried out [QGO13]. Moreover, a new global feature extraction algorithm
was developed based on the ideas of Kadyrov et al. [KP98, KP01, KP06].
1.2.9 Relationship of projects and scientific activity

Figure 1.9 shows a summary of the main scientific activity around the aforementioned projects. It can be observed that different projects share the same research
activities and scientific background. However, this activity sharing does not imply
the reusability of previous developments as different semantic domains require
specific low-level features, metrics, mining techniques, etc.
In order to minimize these barriers to a higher practical re-usability, a new
architecture is proposed in this work.
18
1.2 R&D Projects
Figure 1.9: Relationship between R&D projects and scientific activity in multimedia
content analysis.
19
By three methods we may learn

wisdom: First, by reflection, which
is noblest; Second, by imitation,
which is easiest; and third by experience, which is the bitterest.
CHAPTER
Confucius
Computer Vision from a

Cognitive Point of View
The main approach for computer vision tasks has been based on the identification of low-level features that can be used to segment, identify or determine
higher abstraction levels. Following the three learning methods stated by Confucious in the previous quotation, we can say that this approach provides knowledge
to the system by imitation. By giving this explicit knowledge based on existing
datasets or contents that have been used to allow researchers to understand the
relationship between the identification of real world phenomena and specific
low-level features, the computer vision system just reproduces the experiments
with different datasets. This process offers high performance results for narrow
domains but it cannot be reused in other contexts and it is not a scalable approach. The feature space created by these low-level operators easily gets too
complex for manual (and typically linear) thresholding techniques. For those
cases where the behavior of a set of low-level feature extractors is too complex
to model, data mining techniques are applied. In those cases, we can move to
the third level of Confucious statement. The experience of dealing with this data
(training for supervised classification and other specific metrics or criteria for
clustering) provides an adaptive behavior within the system to create the regions
or hyperplanes that fit best for each specific problem.
The use of ontologies introduces a new way of adding formal explicit knowledge to the system. This is typically carried out by establishing concepts and
21
2. COMPUTER VISION FROM A COGNITIVE POINT OF VIEW
SEARCH LINE
STORAGE LINE
USER
QUERIES
RF
PROCESSING
DOCUMENTS
QUERY
NORMALIZATION
RULES
of
the
Game
NORMALIZED
QUERY
COMPARISON/
MATCHING
RANKING
RELEVANT
DOCUMENT
DOCUMENT
NORMALIZATION
INDEX
CREATION
INDEXES
DOCUMENT
STORAGE
BROWSING LINE
Worklow of Data
Service
Figure 2.1: Information Retrieval Reference Model [Mar11]
relationships among them, defining a domain in this way. One common use of
ontologies is to establish shared vocabularies and taxonomies between scientist
or professionals. However, from a cognitive system perspective, the most powerful characteristic of ontologies is the capability of inference that creates new
rules that where not explicitly defined. The main drawback of ontologies comes
from the fact that broad complex domains such as those related to the common
vision understanding cannot be specifically defined, mainly because the size, the
complexity and the fuzziness of this kind of domains.
Content Based Image Retrieval (CBIR) systems can be considered as one of the
branches of cognitive vision since they require the four functionalities considered
as the pillars of a cognitive vision system: detection, localization, recognition and
understanding[Ver06]. Marcos et al propose a reference model that addresses the
use of ontologies for multimedia retrieval purposes [MIOF11]. This work presents
a reference model (Figure 2.1) based on a semantic middleware. The main goal
of this approach is to create a layer to deal with semantic functionalities (e.g.:
knowledge extraction, semantic query expansion,. . . ).
Marcos proposes in his PhD work[Mar11] the use of the semantic middleware
22
2.1 Mandragora Framework

to automatically generate annotations of the multimedia assets. This approach,
initiated in the Rushes project by using a set of low-level features and applying
fuzzy reasoning to the information provided by those modules offered good results for narrow domains, but the system was unable to deal with a big number
of different low-level features and broad complex domains did not show a good
performance. One of the main drawbacks of this architecture was the fact that all
low-level features were considered at the same level when no prior information
was given.
2.1 Mandragora Framework

In order to overcome this scalability drawback, we presented a novel architecture called Mandragora[OMK+ 09]. This architecture enhances the metadata with
new labels that can be ported to the semantic layer by using a two step iterative approach. The implicit and explicit knowledge about a certain domain can
be introduced in the system with a combination of classifiers and the semantic
middleware. This combination allows the modeling of bigger and more complex
domains[SASK08] and reduces the semantic gap by connecting low-level features
with high-level hypothesis and reinforcement factors. The reinforcement factors
allow to extend the dimensionality of the domain and provide the framework for
specific analysis methods.
The main idea behind this two step approach is to break big domains in subdomains that are more homogeneous both semantically and in terms of low-level
features. Then, specific feature extractors and semantic definitions can be used
with much higher precision. One of the key aspects of this framework is the initial
domain estimation, the hypothesis that will be considered by the next layer to
launch domain specific analyzers that afterwards will feed the semantic middleware. If the results of this second step confirm the characteristics of the estimated
domain, the hypothesis will be accepted and the elements identified in the content will be considered as descriptors of this specific asset. Otherwise, the process
will be restarted with a different hypothesis.
2.2 Image Processing and AI Approach

From an Artificial Intelligence perspective we can consider the climbing on the
DIKW Pyramid (Figure )2.3) as the process that our system has to follow from raw
23
2. COMPUTER VISION FROM A COGNITIVE POINT OF VIEW
Figure 2.2: Mandragora Architecture for automatic video annotation:[OMK+ 09]
unannotated images to structured content with semantic information. The main

issue consists of the semantic gap between the mathematical representation obtained by the developed operators and the high abstraction level concepts that
are intended to be discovered by using such low-level features. Smeulders et al.
define the Semantic Gap as: . . . the lack of coincidence between the information
that one can extract from the visual data and the interpretation that the same data
have for a user in a given situation. According to this definition, we can consider
the semantic gap as the distance between data and wisdom in the DIKW pyramid.
The typical approaches are both top-down, (ontologically driven approaches
24
2.2 Image Processing and AI Approach

that build domain definitions by creating relationships between high level concepts) and bottom-up, automatic annotation or labeling approaches that try to
discover correspondences between high level annotations and automatically extracted features [HSL+ 06]. These both approaches can also be combined in the
same process.
Wisdom
Knowledge
Information
Data
Figure 2.3: DIKW Pyramid
Most bottom-up approaches relay on data-mining techniques to move from

low-level mathematical representation to classes that will be at a higher abstraction level. There is a enormous diversity of different supervisedunsupervised
classification or regression techniques, methods for feature space analysis, algorithms for attribute selection etc. Thus, each specific problem requires the set
of tools and algorithms that suits best for each characteristics and requirements
(type of attributes and classes, dimensionality, dataset size, computational cost,
etc.).
25
The difference between stupidity

and genius is that genius has its
limits.
Albert Einstein
CHAPTER
Domain Identification
As it has been stated in the previous section, the domain identification is one
of the key issues of cognitive vision as it allows the use of contextual information.
Current best performing systems are mainly those where the size and the complexity of the domain are relatively low. Deng et al. [DBLFF10] perform a study
the effects of dealing with more than 10,000 categories. The results show that:
Computational issues become crucial in algorithm design.
Conventional wisdom from a couple of hundred image categories on relative performance of different classifiers does not necessarily hold when the
number of categories increases.
There is a surprisingly strong relationship between the structure of the
WordNet and the difficulty of visual categorization.
Classification can be improved by exploiting the semantic hierarchy.
The process carried out by Deng et al. is based on state of the art descriptors
such as GIST[OT01] and SIFT[Low99]. The classification process uses Support
Vector Machines and the dataset includes more than 9 million assets.
Popular AI development results such as Deep Blue against Kasparov [Dee]
commonly considered as a great step in AI where machines are able to beat human minds are clear cases where the domain and the rules that define it are rather
simple, while combinatorial space derived from it becomes huge. For those cases,
27
3. DOMAIN IDENTIFICATION
Answer
sources
Question
Question
analysis
Primary
search
Query
decomposition
Evidence
sources
Supporting
evidence
retrieval
Candidate
answer
generation
Deep
evidence
scoring
Hypothesis
generation
Soft
filtering
Hypothesis and
evidence scoring
Hypothesis
generation
Soft
filtering
Hypothesis and
evidence scoring
Synthesis
Final Merging
and Ranking
Trained
Models
Answer
and
confidence
Figure 3.1: Watson DeepQA High-Level Architecture [FBCC+ 10]
brute force algorithms can defeat human experience and heuristics capabilities.
In the case of Deep Blue its domain dependence was so high that even some
hardware components where specifically designed for chess playing purposes.
A step forward was done by Watson [Wat] in 2011 that won the Jeopardy! prize
against former winners. In these cases, Watson was able to process natural language by identifying keywords and accessing 200 million pages of structured and
unstructured content. As it is stated in the IBM DeepQA Research Team (developers of Watson) when they refer to Watson This is no easy task for a computer,
given the need to perform over an enormously broad domain, with consistently
high precision and amazingly accurate confidence estimations in the correctness of
its answers. However, even if the constraints to perform this task are much harder
than for chess playing, apart from the natural language processing module, the
task of playing the Jeopardy! can be considered as an advanced text search engine that does not require prior contextual knowledge as it can be observed in its
architectural design (Figure 3.1).
The current state of the art is plenty of AI approaches that face the same limitation observed in these two examples. They obtain a very good performance in a
specific narrow domain but fail when it scales up or when the same system is applied for a different problem. Current multimedia information retrieval systems
are exactly in this situation where contents belonging to specific contexts can be
successfully managed but have strong limitations of flexibility and scalability.
28
3.1 Domain characterization for CBIR

The importance of semantic context is very well known in Content Based Image
Retrieval (CBIR) [SF91, TS01]. This is especially relevant for broad-domain data
intensive multimedia retrieval activities such as TV production and marketing or
large-scale earth observation archive navigation and exploitation. Most modeling
approaches rely on local low-level features, based on shape, texture, color etc. The
drawback of these methods is that the characterization of the context requires
prior contextual information, introducing a chicken-and-egg problem[TMF10]. A
possible approach to reduce this dependency involves the exploitation of global
image context characterization for semantic domain inference. This prior information on scene context could represent a valuable asset in computer vision
for purposes ranging from regularization to the pre-selection of local primitive
feature extractors [SWS+ 00].
3.1.1 Broadcasting
The broadcasting sector has experienced a deep transformation with the introduction of digital technologies. All internal work-flows have been affected by the
fact of representing content digitally. Regarding the Multimedia Asset Management (MAM) systems, before the content was digital, all assets were centralized
and managed by documentalists/librarians, professionals that following a rigid
taxonomy were responsible of annotating, storing and retrieving the content.
Therefore, the work-flow was organized in a manner that documentalists offer
the content management service to editors. Since the digitalization of ingesting
and delivery processes, editors can directly and concurrently access to the content they are looking for. It offers great advantages in terms of efficiency allowing
non-linear editing and minimizing access times. However, this new work style introduces much more inconsistencies since contents are concurrently annotated
by users that do not strictly follow a given taxonomy. in the metadata and in order
to create direct search and retrieval services, content annotations must be richer
and better since editors do not have the knowledge of documentalists to browse
among millions of assets. In order to get this improved metadata, manual annotations result too expensive for most cases and automatic annotation systems are
not able to characterize high abstraction level categories, specially due to the size
and complexity of the broadcasting context.
29
From a technical point of view, there are many industrial solutions and standards for metadata (SMEF, BMF, Dublin Core, TV Anytime, MPEG-7, SMPTE
Descriptive Metadata, PB Core, MXF-DMS1, XMP etc.) that offer good retrieval
characteristics. However, all these technologies and specifications rely on a previously annotated dataset that in most practical cases cannot be populated at an
affordable cost.
3.1.1.1 Alternative methods for massive content annotation

The explosion of prosumers and web video portals offer a new way of enriching
content with metadata. Most of these platforms offer the possibility of leaving
comments that can be used as annotations afterwards. However, these annotations are always unstructured and their confidence is much lower. Therefore, they
cannot be used directly as a source of metadata.
On the other hand, speech processing tools that nowadays are being used
to create subtitles, offer another source of textual information that is very representative of the content. The use of the audio channel to create metadata faces
the same problem of unstructured text as users comments. However, it offers a
very rich and highly related text that fits very well with current text based search
engines.
3.1.2 Earth Observation, Meteorology

An extensive review of the state of the art of content-based retrieval in Earth
Observation (EO) image archives is presented in Section 7.2. Compared with
the broadcasting application field, EO archive volumes deal with even bigger
data volumes (approaching the zettabyte)1 . The assets they contain are largely
under-exploited: the majority of records have never been accessed. Up to 95% of
records have never been accessed according to figures reported in conferences.
The situation is exacerbated by the growing interest in and availability of metric and sub-metric resolution sensors, due to the ever-expanding data volumes
and the extreme diversity of content in the imaged scenes at these scales. As it
happens in the broadcasting sector, interpreters to manually annotate archived
content are expensive and tend to operate in applicative domains with stable,
1
The data volume for the EOC DIMS Archive in Oberpfaffenhofen is projected to about 2
petabytes in 2013 (Christoph Reck, DLR-DFD, presentation during ESA EOLib User Requirements
workshop, ESRIN November 17, 2011)
30
Figure 3.2: Idealized query process decomposition into processing modules and
basic operations based on an adaptation of Smeulders et al.[SWS+ 00].
well-formalized requirements rather than on the open-ended needs of the remote

sensing community at large or of broad efforts like GEOSS [KYDN11].
Regarding the domain, the EO semantic space is much more focused than the
one required for broadcasting content. In fact, Domain-specific ontologies help
to define concepts in a finer granularity. For specific uses such as the context of
disaster management in coastal areas: ontologies for Landsat1 and MODIS2 imagery based on the Anderson classification system[And76] have been developed.
However, the semantic gap between the huge amount of data remains still as an
issue to automatically fulfill these specific ontologies. A general decomposition
of a theoretical query process is depicted in Figure 3.2.
1
2
http://landsat.gsfc.nasa.gov/
http://modis.gsfc.nasa.gov/
31
A special particularity of the EO domain is the diversity of type of data provided by the instruments installed in a satellite, where most of them are affected
by noise and distortions produced by the distance, atmosphere, etc.
Envisat (Environmental Satellite) launched on 2002 and operated by ESA
(European Space Agency) includes the following instruments1 (Figure 3.3):
ASAR: Advanced Synthetic Aperture Radar, operating at C-band, ASAR ensures
continuity with the image mode (SAR) and the wave mode of the ERS-1/2
AMI.
MERIS a programmable, medium-spectral resolution, imaging spectrometer operating in the solar reflective spectral range. Fifteen spectral bands can be
selected by ground command, each of which has a programmable width
and a programmable location in the 390 nm to 1040 nm spectral range.
AATSR: Advanced Along Track Scanning Radiometer, continuity of the ATSR-1
and ATSR-2 data sets of precise sea surface temperature (SST) levels of
accuracy (0.3 K or better).
RA-2 Radar Altimeter 2 (RA-2) is an instrument for determining the two-way
delay of the radar echo from the Earths surface to a very high precision:
less than a nanosecond. It also measures the power and the shape of the
reflected radar pulses.
MWR: microwave radiometer (MWR) for the measurement of the integrated
atmospheric water vapour column and cloud liquid water content, as correction terms for the radar altimeter signal. In addition, MWR measurement
data are useful for the determination of surface emissivity and soil moisture
over land, for surface energy budget investigations to support atmospheric
studies, and for ice characterization.
GOMOS: measures atmospheric constituents by spectral analysis of the spectral bands between 250 nm to 675 nm, 756 nm to 773 nm, and 926 nm to
952 nm. Additionally, two photometers operate in two spectral channels;
between 470 nm to 520 nm and 650 nm to 700 nm, respectively.
1
https://earth.esa.int/web/guest/missions/esa-operational-eo-missions/
envisat
32

AATSR
MIPAS
MERIS
SCIAMACHY
MWR
Ka-band
Antenna
GOMOS
DORIS
RA-2 Antenna
X-band
Antenna
LRR
ASAR
Antenna
Service Module
Solar Array (not shown)
Figure 3.3: Envisat instruments
MIPAS the Michelson Interferometer for Passive Atmospheric Sounding, is a

Fourier transform spectrometer for the measurement of high-resolution
gaseous emission spectra at the Earths limb. It operates in the near to mid
infrared where many of the atmospheric trace-gases playing a major role in
atmospheric chemistry have important emission features.
SCIAMACHY: an imaging spectrometer whose primary mission objective is to
perform global measurements of trace gases in the troposphere and in the
stratosphere.
DORIS: the Doppler Orbitography and Radio-positioning Integrated by Satellite
instrument is a microwave tracking system that can be utilized to determine
the precise location of the ENVISAT satellite.
LRR: The LRR is a passive device which is used as a reflector by ground-based
SLR stations using high-power pulsed lasers. In the case of Envisat, tracking using the LRR is principally accomplished by the International Laser
Ranging Service (ILRS).
33
Radar, sensors, cameras,
instrumentation,
Orchestration and
harmonization of
services and resources
Weather Station
Adaptation layer
(Other systems,
protocols)
Input data
adaptation layer
Knowledge management
platform
Data mining,
ontologies, physical
modelling,
Centralized data
management,
backup,
delivery, etc.
Presentation
layer
Alarm & event

monitoring and
management
Web
TV
Mobile devices
Professional Platforms
Analysis modules
Figure 3.4: General architecture of the meteorological information management

system [OAL09] .
3.1.2.1 Meteorology
Weather analysis combines satellite information with terrestrial instruments typically located in weather stations. The classical instruments such as thermometers,
hygrometers, anemometers, barometers, rain gauges, ceilometers, etc. include
also devices that provide more complex information (Doppler radars, wind profiles). Video cameras are also being used to get extra information. An extensive
analysis of the image data management in the meteorological domain is detailed
in Section 7.5. Due to the fact that the meteorological domain is affordable in
order to be explicitly defined, the image analysis process can be automatically
performed. Figure 3.4 shows an architecture to integrate multimedia information
into a meteorological information management system. The results of a project
for cloudiness estimation are presented in Section 7.1.
3.2 Local features vs. global features in domain identification

Local features have been used broadly for context categorization [SS10, vGVSG10].
SIFT[Low99] and SURF[BTG06], are among most popular choices in this respect.
34
3.2 Local features vs. global features in domain identification

A two step approach for the efficient use of local features has been proposed
by several authors such as Ravinovich et al. [RVG+ 07] and Choi et al.[CLTW10].
Olaizola et al. [OMK+ 09] have proposed an architecture for hypothesis reinforcement based on an initial analysis of low-level features for context categorization
and further hypothesis creation. This architecture can exploit context specific
feature extractors to validate or refuse the initial context hypothesis. This stresses
the value of global descriptors for initial domain categorization purposes. Once a
specific domain has been identified, different low-level features can be extracted.
However, these features cannot be combined in a simple way and the obtained
multi-attribute spaces must be normalized in order to be used in similarity search
or retrieval tasks [SKBB12].
Among different global descriptors such as histograms of several local features
[BH11], texture features, self similarity [SI07], there are some specific algorithms
in the literature which have shown great potential: GIST[OT01, TFW08] is probably one of the most popular approaches. Watanabe et al. [WSS02] have proposed
a global descriptor based on the codewords provided by Lempel-Ziv entropy
coders [AKS11, CMGD10], exploiting the relationship between the complexity
of an image and the context in which it may belong. The Ridgelet transform
[MAMD09, MAMD10, NC11] has been successfully used as a global feature for
image categorization and handwritten character recognition. In typical operational implementations, all these algorithms are typically combined with other
global or local features.
The trace transform has already been proposed for several computer vision
applications. Indeed, a method based on this transformation has been included
in the MPEG-7[MPE04] standard specification for image fingerprinting[BO07,
OBO08]. Other applications (mostly with monochrome images) include face
recognition[Fah06, SPKK03, LW07, LW09], character recognition[NPK10] and sign
recognition[TBFO05]. The proposed approach based on a recursive application
of the trace transform to reduce the dimensionality of the obtained feature space
(known as the triple feature), offers an excellent performance for image fingerprinting, but does not offer good discriminative characteristics as a method for
domain characterization due to the high data loss incurred by the diametrical
and circus functionals [KP01].
The approach proposed by Liu and Wang [LW07] reduces the number of attributes using Principal Component Analysis (PCA) to select the most relevant
coefficient and reduce the dimensionality of the feature space. However, this
35
approach does not take into account the frequency relationships among the different coefficients and increases the feature extraction complexity as it requires
the covariance matrix information of all previous samples. Moreover, the feature
relevance of each individual DCT coefficient is too low and also sensitive to noise
and variations.
Li et al. [LZC09] have proposed a generalization of the Radon transform and
trace transform by introducing prior knowledge of specific identification or fingerprinting tasks and extending the geometric sets from straight lines to arbitrary
choices. This approach provides a complete set of resources for non-rigid object identification and has been successfully tested for pedestrian recognition,
segmentation and video retrieval. However, the broad set of configuration parameters and pre-processing tasks are not suitable for domain identification purposes
where the lack of a priori knowledge is one of the main issues.
36
If you tell the truth, you dont have

to remember anything.
Mark Twain
CHAPTER
Proposed Method: DITEC

4.1 Introduction
In this section, a new method for global image description is presented. The
main objective of this method is to characterize content domains when no prior
information is available.
The major contributions made in this section are:
A new methodology for global feature extraction based on a statistical
modeling of the trace transform in the frequency domain.
An analysis of the effects of the discrete trace transform in order to establish
the best sampling parameters.
An analysis of the resulting feature space to have an estimation of the results
that can be obtained with machine learning techniques
A demonstration that DITEC provides highly discriminative global descriptors at very low dimensionality, a key factor for efficient retrieval in massive
content databases [Haa11][LK10].
DITEC is very suitable for domain classification, especially for those cases
where the lack of prior knowledge does not allow the effective use of specific
local features.
37
4. PROPOSED METHOD: DITEC
4.2 General description of the DITEC method

We introduce a hierarchical probabilistic model in terms of random variables D,
I , T , E and C . The fundamental objective of DITEC is to derive an appropriate
estimate C of the unknown global image semantic concept C from an observed
data set D (Figure 4.1). Geometric and radio/colorimetric indeterminacies are
treated by introducing the concept of an unknown pre-processed image I whose
parameters depend on the elementary scene descriptors T that depend on scene
content E that in turn depends on context C . The conditional probabilistic links
between the different layers in the workflow correspond to the main processing
steps of the DITEC method.
The DITEC method is composed of four main steps where an observed scene
D is estimated as C of the unknown global image semantic concept C (Figure
4.1).
The four DITEC steps are thus the following:
Sensor modeling: image acquisition and pre-processing (radiometric
noise, color space, geometric quantization and image lattice finiteness
effects).
Data transformation: trace transform (detailed in 4.2.2) applied to the
pre-processed image I . The result will depend on the chosen functional
(e.g: (4.14)) and on the selected geometric parameters (detailed in Section 4.2.2.3). The outcome T of the trace transform of an image is a
two-dimensional signal represented by means of sinusoids with a particular
amplitude, phase, frequency and intensity. This characterization process
represents one of the key steps in the overall information extraction process.
Feature extraction: summarization of the extracted features T , compressed and adapted into a manageable set E of object-based descriptors.
The wave features contained in the resulting image must be characterized.
In order to do this,the 2D trace signal Tk is transformed to the frequency
domain. To concentrate the signal energy to the lowest spatial frequencies,
a two-dimensional DCT (Discrete Cosine Transform) is applied. Then, the
DCT is compressed to a vector of two components (average value and kurtosis of all the orthogonal elements of the main diagonal, Figure 4.6). This
transformation considers the DCT space as representable by a superposition of leptokurtic distributions. It aims to reduce the considered descriptor
38
image DB
{D}
Sensor modeling
rgb2hsv
load image
rgb2YCbCr
Statistical
descriptors
(, )
Pre-
param.
processing
{I}
Data transformation
Trace
Transform
n , n , n , (L)
{T}
Object extraction
DCT 2
, kurtosis
extraction
{E}
Class assignment
Attribute
Training
selection
Supervised
classifi-
{C}
cation
Figure 4.1: DITEC System workflow
space dimensionality while preserving essential information in order to allow a good performance in the subsequent classification process. The last
n values from the obtained data pair vector can be disregarded due to the
empirical reason that given the low-pass filtering for most natural images
the DCT concentrates the highest values in the lowest coefficients [BYR10].
Class assignment: vectors obtained in the previous step are processed to
improve the performance of classifiers in the defined feature space. All the
obtained vectors are statistically analyzed to select their most representative attributes. Then the supervised classification process is carried out to
obtain an estimate C of the unknown global image semantic concept C .
39

By applying the probability chain decomposition rule, the probability (4.1) of
an asset to belong to a given class can be decomposed in terms of the different
layers of the model. Estimates for C i , E j , Tk I l , and D m are the obtained results
for p(C |E ), p(E |T ), p(T |I ) and p(I |D) processes, given the usual conditional
independence assumptions implied by a hierarchical model:
p(C i |D m ) = p(C i |E j , Tk , I l , D m ) = p(C i |E j )p(E j |Tk )p(Tk |I l )p(I l |D m )p(D m )

(4.1)
where:
0
0
0
0
0
< i
< j
< k
< l
< m
n cl asses
i N
j N
k N
n (or i g . i mag es) l N
n (or i g . i mag es) m N
(4.2)
p(C i |E j ) is the probability of the data mining processes to determine correctly

the class to which the image belongs, given E as a set of features. The second element p(E j |Tk ) can be understood as
p(Tk |E j )p(E j )
p(Tk )
following the Bayes theorem. It
shows that this model layer is linked to the information representativeness of the
extracted features. p(Tk |I l ) implies the trace transform. It is a deterministic process with a slight denoising effect. The quality of data D m and the pre-processed
I l image will be fundamental for an effective feature extraction process. In fact,
the joint inference/estimation process depends on the trace transform which can
be regarded as a data re-orderingcompressionfeature space optimization
process.
4.2.1 Sensor modeling

The first pre-processing step transforms the RGB color space into Y C b C r [Poy96].
The luminance channel (Y ) will be used as the most relevant channel to encode
shape related features. Color distribution information is encoded by processing
the chrominance channels (C b ,C r ).
In order to reduce effects introduced by radiometric noise, image lattice and
quantization, a low-pass filter is applied to each channel.
HSV [Poy96] color space information is encoded by obtaining mean and variance values (, ) of the corresponding intensity distributions in each H,S,V
channel. In the Attribute Selection process, this (, ) information is introduced
into the obtained descriptor E .
40
(0,0)
Figure 4.2: Trace transform, geometrical representation
4.2.2 Data transformation

The data transformation process is carried out through the trace transform, a
generalization of the Radon transform (4.3) where the integral of the function is
substituted for any other functional [KP98, KP01, PK04, TBFO05, BB08].
R(, ) =
f (x, y)(x cos + y sin )d x d y
(4.3)
The trace transform (originally proposed by Fedotov et. al1 [FK95]) consists of
applying a functional along a straight line (L in Figure 4.2). This line is moved
tangentially to a circle of radius covering the set of all tangential lines defined
by . The Radon transform has been used to characterize images [PG92] in well
defined domains [LLL10], in image fingerprinting [SHKY04] and as a primitive
feature for general image description. The trace transform extends the Radon
transform by enabling the definition of the functional and thus enhancing the
control on the feature space. These features can be set up to show scale, rotation/affine transformation invariance or high discriminance for specific content
domains.
The outcome T of the trace transform of a 2D image is another 2D signal
composed by a set of sinusoidal shapes that vary in amplitude, phase, frequency,
intensity and thickness. These sinusoidal signals encode the pre-processed image
1
In this contribution Fedotov et. al proposed an approach based on image transformation
as a solution of a pattern recognition problem, for the identification of different types of blood
cells, such as erithrocytes. They proposed to convert the image space S in a parameter space, by
intersecting several lines l 0 with S, represented in polar coordinates
41

I with a given level of distortion depending on the functional and quantization
parameters.
4.2.2.1 Functionals
A functional of a function (x) evaluated along the line L will have different
properties depending on the features of function (x) (e.g.: invariance to rotation,
translation and scaling[FKT09]). Kadirov et al. [KP06] propose several functionals
with different invariance or sensitiveness properties. These invariant functionals have been used for expert systems for traffic sign recognition [TBFO05], face
authentication[SPKK03, SDH10] or fingerprinting [KP01] purposes. Clearly, the
definition and combination of different Trace functional and Circus functionals
respectively results in different properties of the final descriptor.
Name
Functional
R
(t )d t
R
( |(t )|q d t )r
R
|(t )0 |d t
IF1
IF2
IF3
IF4
R
(t ( t (t )d t /I F 1))2 (t )d t
IF5
(I F 4/I F 1)1/2
IF6
max((t ))
IF7
I F 6 mi n((t ))
IF8
Amplitude of 1st harmonic of (t )
IF9
Amplitude of 2nd harmonic of (t )
IF10
Amplitude of 3rdt harmonic of (t )
IF11
Amplitude of 4th harmonic of (t )
Table 4.1: List of Trace Transform functionals proposed in [KP01]
4.2.2.2 Geometrical constraints

The result of the discrete trace transform strongly depends on the selected geometrical parameters. The three resolution parameters denoted by , , (L)
respectively for angle, radius and the sampling rate along the line L, establish distortions and aliasing effects that will affect the final result of the trace transform.
The final resolution of the sinogram T obtained by applying the trace transform will be defined by n and n where:
42
(4.4)
min(X , Y )
(4.5)
n =
n =
with X and Y denoting the horizontal and vertical resolutions of the image I l .
Low (n , n , n ) values will have a non-linear downsampling effect on the
original image, where n is defined as:
1
L
n =
(4.6)
The set of points used to evaluate each functional is described (assuming (0,0)
as the center of the image) by:
y = 2 sin()
x
tan()
(4.7)
A singularity can be observed at = 0 and = . For these cases it can be

assumed that:
x=
y i f
x = y i f
=0
=
(4.8)
The range of the parameters is :

[0, 2]
[r, r ] , r = min
"
X
2
Y
2
,
cos() sin()
(4.10)
X
X
2, 2
h
i h
i
3 , 5

,
4 4
4 4
Y
Y
2 , 2
tan() tan()
3 5 , 7

,
4 4
4 4
(4.9)
(4.11)
Equation (4.7) shows a symmetrical result since the same lines are obtained
for [0, ] and [, 2]. However this is only true for functionals that are not
considering the position (like the Radon transform). Depending on the selected
functional and on the desired properties of the trace transform(e.g: rotational
invariance), the ranges of and can be modified to: [0, ] or [0, r ].
43
Figure 4.3: Trace transform contribution mask at very high resolution parameters
(Image resolution:100x100px. n = 1000, n = 1000, n = 5000).
4.2.2.3 Quantization effects

Digital images are affected by two main effects during trace transformation:
Some pixels might never be used by the functional given the geometrical
setup of the transformation, and to its integration nature.
There may be some pixels that have much higher cumulated effect than the
others into the functional.
In this section we will analyze the effects that need to be taken into account
in order to preserve the homogeneity of the results, avoiding pixels or areas with
higher relevance than others. Even for very high (n , n , n ) values in relation
to the original image resolution, the trace transform introduces a contribution
intensity map that encodes the relevance of the different regions of the input
picture. As shown in Figure 4.3, high resolution values of the trace transform parameters tend to create a convex contribution intensity map. Therefore, high
parameter values do not necessarily imply optimal image content representation
on the trace transform.
High values of n improve the rotational invariance of the trace transform
(although in such a manner that it is dependent on the selected functional) while
44

very low values of n cannot be considered as producing a valid trace transform
since there is not enough angular information.
(a) Original
(b) (64,64,15)
(c) (64,64,45)
(d) (64,64,185)
(e) (5,300,45)
(f ) (5,300,151)
(g) (300,5,45)
(h) (300,5,151)
Figure 4.4: Pixels relevance in trace transform scanning process with different parameters (n , n , n ). Original image resolution = 384x256.
Ideally, the trace transform should keep the following constraints (considering M as the matrix that contains the number of repetitions of each pixel during
the trace transform):
Coverage: all pixels of the image (including those located at the corners of
the image) have to be included in at least one functional. min(M ) > 0.
Homogeneity: all pixels are used the same number of times. Var(M ) = 0 .
High pixel repetition degree: each pixel has to be included in as many
traces as possible (high values of mean(M )).
Table 4.2 shows some example values for coverage, homogeneity and repetition degree at different n , n , n resolutions. Note that the best ratios are
obtained for lower variations in as the angle is the main factor to increase the
45

Table 4.2: Quantization effects of the trace transform
% pixels
n
n (L)
64
64
15
16.60
0.63
15.71
64
64
45
44.30
1.88
32.72
64
64
85
67.53
3.54
53.61
64
64
185
93.40
7.71
52.51
300
45
28.62
0.69
10.28
300
151
69.84
2.30
31.80
300
45
40.59
0.68
0.20
300
151
88.43
2.30
0.42
300
218
97.34
3.33
0.40
300
251
99.18
3.83
0.30
384
256
15
83.76
15.00
1.2106
100
100
85
85.55
8.65
872.47
100
100
185
98.72
18.82
708.64
100
100
218
99.55
22.18
511.61
100
100
2,185
100.00
222.27
3.6106
42
75
12,000
99.77
384.52
38.6106
used
Mean
Var
variance. The pixel repetition degree is also strongly conditioned by the angular resolution. This fact makes n the main factor to balance the homogeneity
and repetition degree (e.g: low repetition degrees show weaker rotational invariance). Once n is set, n can be adjusted to ensure the optimal coverage. n has
an almost asymptotic behavior once the other two parameters are set and can
be optimized ensuring a minimum pixelwise sampling. However, these different
sampling techniques (e.g.: fixed sampling step or Bresenham algorithm[Bre65])
can also introduce some distortions produced by the different number of samples for each (, ) combination. Figure 4.4 shows some cases applied to a real
image and the convex contribution intensity mask effect for different values of
n .
46
4.2.3 Feature extraction

Diametric and Circus transform have been used mostly in the literature [PK04,
KP06, BB08, BO07] to reduce the dimensionality descriptors . However, even if
this approach provides good results for similarity search or image hashing, the
diametric and circus transform do not preserve the information. In fact, there is
no inverse transform for these two operators.
In order to characterize the sinograms obtained from the trace transform
(T ), we propose the frequency analysis of the obtained signal and a representation based on statistical descriptor of the frequency distribution. To do this, DFT
Discrete Fourier Transform (DFT) or Discrete Cosine Transform (DCT) can be applied. The DCT [ANR74], which has become one of the most popular transforms
for audio and image coding, has two main properties which make it more suitable
than DFT for the feature extraction process: energy compaction and decorrelation
[Ric02, BYR10]. The energy compaction means that the signal energy is accumulated in a small number of coefficients and that these coefficients are typically the
lowest coefficients of the DCT transform. Taking into account that the trace transform does not introduce high frequencies into the transformed image, the DCT
provides a good method to efficiently represent the wave-like signal information
contained in the resulting images. The decorrelation property of the DCT implies
that there is a very low interdependency among the coefficients. This property
matches with the common needs of a number of data mining algorithms whose
performance has a strong dependency on input attribute correlation. Moreover,
the coefficients obtained by applying a DCT are real values while the DFT provides coefficients in the complex domain. The DCT thus allows us to encode
information in lower dimensionality code spaces with better compaction characteristics. Moreover, from the computational cost point of view, there are efficient
DCT implementations that make it suitable for real-time applications without
high computing performance requirements.
The 2D forward DCT is given by:
NX
1 1 NX
2 1
k 1 (2n 1 + 1)
k 2 (2n 2 + 1)
X k1 k2 = k1 k2
x n1 n2 cos
cos
2N1
2N2
n 1 =0 n 2 =0
47
(4.12)

where:
Ni
s
Ni
ki = 0
(4.13)
k i 6= 0
4.2.3.1 Statistical descriptors

As a consequence of the properties of the DCT and of the nature of the 2D signals
resulting from the trace transform, the 2D DCT stores more energy in its lower
frequencies.
50
50
100
100
150
150
200
200
250
(a) Original
50
100
150
200
250
(b) Trace transform (F3)
50
100
150
200
(c) DCT2 6 level quantization
Figure 4.5: Trace Transform and subsequent Discrete Cosine Transform of Lenna. (Y
channel of YCbCr color space)
Figure 4.5 shows the process of trace transform evaluation and its 2D DCT
where the intensity is quantized into 6 different levels. The functional used is the
one enumerated by Kadyrov et al. [KP01] as invariant functional IF2 (4.14).
Z
TI F 2 =
|(t )| d t
(4.14)
This functional has invariance properties for independent variable and function
scaling (4.15):
((ax) = (a)((x)) a > 0
(c(x) = (c)((x)) c > 0
(4.15)
where:
(a) = a and (c) = c
48
(4.16)

In the particular case of the IF2 functional, the invariance relation is given by:
= r and = qr . Experimental tests demonstrate that the best performance is
obtained by r = 0.5 and r = 2. These values match those proposed by Kadirov et
al. [KP06].
In order to reduce the dimensionality of the obtained coefficients the n first
orthogonal straight lines to the main diagonal of the transformed signal T are
statistically characterized (Figure 4.6). These coefficients which correspond to
similar frequency bands can be computed very efficiently and provide a high
dimensionality reduction ratio.
(, k)5
a 11 a 12 a 13 a 14 a 15
a a a a a
21 22 23 24 25
a a a a a
31 32 33 34 35
a a a a a
41 42 43 44 45
a a a a a
51 52 53 54 55
..
.. . .
.
..
. ..
.
.
.
a a a a a
m1 m2 m3 m4 m5
. . . a 1n
. . . a 2n
. . . a 3n
. . . a 4n
. . . a 5n
..
..
.
.
. . . a mn
(, k)
(, k)
(, k)
(, k)
(, k)
..
(, k)
Figure 4.6: Conceptual scheme: DCT matrix transformation into , k pair vector.
To study these statistical properties, over 50000 sample vectors have been analyzed using the 1000 sample images of Corel 1000 dataset (described in section
4.3.1). The analysis of obtained histograms shows strong leptokurtic distributions for all samples. Equation (4.17) defines the kurtosis of a distribution which
is represented by (4.18) for a discrete set of elements. A distribution is considered leptokurtic when k > 3. For all analyzed distributions the minimum kurtosis
value has been greater than 30. More detailed statistical properties are shown in
Figure 4.7.
k=
1
n
k=
1
n
E (x )4
4
n
P
i =1
n
P
i =1
(4.17)
4
(x i x)
2
(x i x)
49
(4.18)
150
200
50
100
Minimum
Maximum
Mean
Std..deviation
Figure 4.7: Statistical properties of all Kurtosis measurements made on the distributions obtained by processing Corel 1000 dataset
Assuming the leptokurtic nature of the obtained distributions, the list of values can be represented by the mean value and the kurtosis of each vector. This
pair of descriptors (, k) of the first element (corresponding to the DC value of
the DCT) is substituted by the mean and variance of the original image in HSV
space. Considering that the mean and kurtosis values encode the information of
coefficients corresponding to approximately similar frequencies. The obtained
dimensionality of the transformed (, k) pairs is given by (4.19).
nDi ms =
n 2 + n 2 n c n f
(4.19)
where n c is the number of channels of the original image and n f the number
of features extracted from each vector (2 in the case of using [, k]). Thus, the
dimensionality reduction is given by (4.20).
rf = q
n n
n 2 + n 2 n f
(4.20)
For square resolutions and considering n f = 2 the reduction factor increases

linearly with the resolution (4.21).
rf =
n2
n
p = p
n nf 2 2 2
50
(4.21)

4.2.3.2 Cauchy Distribution
Due to the high leptokurtic characteristics of the data distribution, the
Kolmogorov-Smirnov test[KSt] applied to the obtained data shows a better fitting
with Cauchy distribution 4.22 than with a Gaussian model. Therefore, Cauchy
distribution parameters can be used as an alternative of mean value and kurtosis.
f (x; x 0 , ) =
(1 +
xx 0 2
(4.22)
where: x 0 is known as the location parameter and is equal to the median and
represents the scale parameter. Moreover, is equal to the half of the interquartile
range.
Experimental results have demonstrated that the median is not a representative value of the distribution for short vectors. Therefore, the Hodge-Lehman
estimator [HJL63] (Equation 4.23) has been introduced instead of the median
value.
hl (X ) = med i an
xi + x j
2
with 1 i < j n
(4.23)
In general, experiments show similar results for [mean value, kurtosis] and
[hl (X ),
i qr
2 ]
value pairs, while in some cases the assumption of the Cauchy distri-
bution behaves more robustly for different combinations of resolution (n , n , n )

parameters.
4.2.4 Classification
After the feature extraction process explained in the previous section, a set of
descriptors E is obtained. The dimensionality of E can be reduced by attribute
selection strategies in order to improve the efficiency of subsequent classification
steps.
4.2.4.1 Feature Subset Selection in Machine Learning

Considering machine learning as a set of techniques to discover and extract
knowledge in an automated way [Mit97], the basic problem is concerned with
the induction of a model which classifies a given object into one of several known
classes. In order to induce the classification model, each element E i described
by a pattern of d features is simplified by applying the Feature Subset Selection
51
Figure 4.8: Examples of probability density distribution and histograms obtained by

the samples
(FSS) [LM98] approach. FSS can be reformulated as follows: given a set of candidate features, select the best subset in a classification problem. In our case, the
best subset will be the one with the best predictive accuracy.
Most of the supervised learning algorithms perform rather poorly when faced
with many irrelevant or redundant (depending on the specific characteristics of
the classifier) features. In this way, the FSS method proposes additional mechanisms to reduce the number of features so as to improve the performance of the
supervised classification algorithm.
There are two main approaches to tackle the Feature Subset Selection (FSS)
problem from the machine learning point of view, namely wrapper and filter
methods [ILRE00].
Wrapper approaches [BLIS04] try to identify the subset of variables that, given
a classification paradigm and a dataset, provide the best classification function.
The process consists on searching an optimal feature sub-space based on a performance measure (typically the accuracy, though other measures can be used).
Each subset is evaluated by testing the performance of the chosen paradigm in
52

the dataset, using only the variables in the subset for evaluation. The estimation
of the performance of the classifiers requires a validation scheme, such as cross
validation or bootstrap estimation. As a result, the evaluation of each subset involves the training and testing of several classification functions, increasing the
computational time required for the FSS process.
The filter approaches search for the best variable subset, independently of
the classification paradigm, considering the relationship between the predicting
variables and the class, and occasionally the relationship between the predicting
variables. One of the simplest approaches consists of ranking the variables according to their usefulness and selecting only those at the top of the ranking. The
usefulness of a variable is measured univariately by means of different metrics ??.
Once the features are ranked, a threshold must be set to obtain the final subset. The ranking methods are only concerned with the relevance of the features
considered and, thus, they do not filter out redundant variables.
In our study, Bayesian Networks [SLJI09] and Support Vector Machines (SVM)
[MLH03] have been used to perform the supervised classification during the FSS
process.
4.2.4.2 Attribute contribution analysis

In order to estimate the attribute contributions to the classification task,
different approaches can be taken; on the one hand, used attributes could be
considered as specifically constructed to the classification problem at hand, and
therefore perform a study similar to that of Kumar et al. [KBBN11]. On the other
hand, the individual contribution of each attribute could be analyzed, using the
same classifier and looking at the accuracies obtained [DLS11].
The study of Kumar et al. [KBBN11] is based on the analysis of the behavior
of binary classifiers, and therefore can hardly be adapted to multi-class problems,
although it is possible to perform a similar one based on a pair-wise division of
the classification task [DF02].
In order to do that, first the accuracy of each of the individual attributes
are computed. Following, a greedy experiment has been performed in order to
select the best attribute in each step, and continue selecting features until no
improvements have been reached.
53
4.3 Experimental results

The presented method has been tested with 2 different datasets. The first of them
(Corel 1000 [Cor]) is a standard dataset which will allow the comparison of the obtained validation data with other methods existing in the literature. The second
case (earth observation data), will be used to show the potential of the proposed
method under diverse conditions. For both cases, the obtained feature space will
be analyzed by establishing metrics that will help to predict the behavior of applied machine learning techniques. A 10-fold cross validation has been used in
both cases to split the data into training and testing sets.
4.3.1 Case study 1: Corel 1000 dataset

The Corel 1000 dataset is composed of 1000 images distributed in 10 classes (100
instances per class). The tags of the classes are: Africans, Beach, Architecture,
Buses, Dinosaurs, Elephants, Flowers, Horses, Mountains and Food. Figure 4.9
shows one sample per each class. Even though they are semantically separated,
visual similarities may be found among some of them. For example, people and
trees can be found under Africa, Beach, and Mountain categories.
(a) Africans
(b) Beach
(c)
Architec-
(d) Buses
(e) Dinosaurs
(i) Mountains
(j) Food
ture
(f ) Elephants
(g) Flowers
(h) Horses
Figure 4.9: Samples of Corel 1000 dataset. The dataset includes 256x384 or 384x256
images.
The following parameters have been selected: n = 71, n = 71, n c = 3, n f = 2.

This choice results in 15,123 trace transform coefficients per image. By obtaining
the mean values and kurtosis as described in the previous section, the number of
attributes is reduced to 606 (by a factor of 25).
54

Based on the fact that the DCT gathers signal energy in the lower frequencies (see Figure 4.5c), highest coefficients are removed. Moreover, it can be
assumed that chrominance channels (C b and C r ) contain less visual information ([Poy03, Poy96]) and therefore more coefficients can be removed from these
channels than from the luminance signal (Y ). Experimental results carried out
with different combination of Y C b C r coefficients, demonstrate that luminance
related attributes have more relevance than chrominance related ones. The selected parameters for this example result in 202 attributes per channel. After an
exhaustive experimental test with different number of attributes per channel, we
select the first 104 ones for Y and 60 for each C b C r signal, thus reducing the total
amount of attributes to 224.
The best performance has been obtained by applying a SVM classifier (accuracy = 84.8% in a k-fold 10 test). 117 attributes have been selected for the final
feature space by applying FSS. The information provided by the confusion matrix
(Table 4.3) can be represented graphically in order to represent the qualitative
behavior of the method. We have selected the Force Atlas 2 algorithm [JHVB11]
to distribute the classes on a 2D plane. Force Atlas 2 establishes a force directed
layout simulating a physical system where nodes (classes) repulse each other and
edges apply an attraction force. For the method presented, the repulsion force is
adjusted to scale the layout to a convenient size while edge forces are represented
by the error information stored in the confusion matrix. Thus, the attraction force
of two nodes will be proportional to the mutual miss-classifications.
For the Corel 1000 dataset, it can be observed in Figure 4.10 that Dinosaurs,
Flowers and Horses are clearly separated from the rest of the categories. This result can also be verified via the precision and recall data. Precision is above 94%
and there are very few instances for other classes estimated as Dinosaurs, Flowers
or Horses.
A deeper analysis of class distribution can be performed by removing the aforementioned three categories. Figure 4.11 shows that there is a group formed by
Beach, Mountains and Architecture and other by Africans which links to Elephants
and Food although these two are not directly connected.
Figure 4.12 shows an example of one of the classification errors. As it can be
seen, the presence of vegetation and trees associates the image to the Mountain
class even if it belongs to Architecture. These semantic overlays of Corel 1000
categories put some visually similar images in different classes.
55

Table 4.3: Corel 1000 dataset confusion matrix. Ground truth represented in rows,
predicted labels in columns. Labels correspond to the assignment in Figure 4.9.
pr eci si onr ecal l
FMeasure is the harmonic mean: F = 2 pr eci si on+r ecal l .

Precision Recall
FMeasure
75
0.75
0.75
0.75
79
0.752
0.79
0.771
78
0.772
0.78
0.776
81
0.9
0.81
0.853
100 0
0.98
0.99
83
0.806
0.83
0.818
95
0.941
0.95
0.945
97
0.942
0.97
0.956
14
78
0.813
0.78
0.796
82
0.828
0.82
0.824
0.848
0.848
0.848
Average
Mountains
Architecture
Beach
Buses
Horses
Elephants
Food
Africans
Flowers
Dinosaurs
Figure 4.10: Distance among classes in the Corel 1000 dataset according to misclassified instances.
Comparing the obtained results with other feature extraction approaches

(Mean-Shift and Gaussian Mixtures based on Weighted Color Histograms[BH11],
Reduced Feature Vector with Relevance Feedback[ZKRR08] and SIFT based Gaus-
56
Buses
Food
Africans
Elephants
Architecture
Beach Mountains
Figure 4.11: Distance among most inter-related classes in the Corel 1000 dataset
according to misclassified instances.
sian Nave Bayesian Network [BKB10]), DITEC shows the best performance for
most categories (Figure 4.13) and the highest mean precision value. Other performance parameters (such as recall, FMeasure) have not been compared since
they have not be indicated in the papers related with the rest of the methods.
4.3.2 Case study 2: Geoeye satellite imagery

The Geoeye [Glo] dataset is composed by 1003 multi-resolution patches of Digital Globe Earth observation satellite imagery with up to 1m spatial resolution.
The dataset is categorized in 7 classes corresponding to different geographical
locations (Figure 4.14). All the resolutions have been processed with the same
trace transform parameters.
57
Figure 4.12: Corel 1000 picture corresponding to class Architecture and classified as
Average
Food
Mountain
Horses
Flowers
Elephants
Dinosaurs
Buses
Architecture
Beach
Africa
Precision %
Mountain
Figure 4.13: Corel 1000 precision results with different feature extraction algorithms.
WHMSGM: Mean-Shift and Gaussian Mixtures based on Weighted Color Histograms,
FVR: Reduced Feature Vector with Relevance Feedback, Gaussian NBN: SIFT based
Gaussian Nave Bayesian Network.
During the data mining process Bayesian networks provide the best performance, reaching an accuracy of 94.51% in a k-fold 10 test. The final dimensionality
of the feature space has been reduced to 61 attributes. Table 4.4 shows the confusion matrix of the classification results.
Applying the Force Atlas 2 method to Geoeye classification errors, we obtain the distribution shown in Figure 4.15. It can be observed that Risalpur and
Rome are the categories with the highest mutual similarity (2 cities). The Davis-
58
(a) Athens
(b) Davis
(e) Nyragongo
(c) Manama
(f ) Risalpur
(d) Midway
(g) Rome
Figure 4.14: Samples of satellite footage dataset. 256x256px patches at different

scales.
Table 4.4: Geoeye dataset confusion matrix. Ground truth represented in rows,
predicted labels in columns. Labels correspond to the assignment in Figure 4.9.
pr eci si onr ecal l
FMeasure is the harmonic mean: F = 2 pr eci si on+r ecal l .
(a) Athens
74
0.961
0.961
0.961
(b) Davis
183 0
0.943
0.971
(c) Manama
193 0
0.97
0.995
0.982
(d) Midway
62
0.954
0.976
(e) Nyragongo
77
0.939
0.906
0.922
(f) Risalpur
177 17
0.898
0.912
0.905
(g) Rome
11 182
0.897
0.938
0.917
0.946
0.845
0.945
Average
Precision Recall
Measure
Monthan aircraft boneyard has shown a remarkable similarity with Risalpur due
to the fact that wide areas of bare soil are a common element in both Risalpur
59

Midway
Athens
Manama
Nyragongo
Rome
Risalpur
Davis
Figure 4.15: Distance among classes in the Geoeye dataset according to misclassified
instances.
and Davis.
The Midway atoll is the most distinguishable category of the Geoeye dataset.
It contains special color, texture and shapes that make it singular within the
60
4.4 Computational complexity

dataset. All these characteristics have been successfully detected by the method
(Precision = 100%, Recall = 0.954%).
4.4 Computational complexity

From the complexity point of view, there are two critical steps along the DITEC
pipeline: The calculation of the trace transform and the machine learning process.
The computational requirements of the rest of the steps (pre-processing, DCT,
(, k) extraction) are not significant when compared with the trace transform or
with the different techniques employed by the attribute selection, training and
classification.
4.4.1 Computational complexity of the trace transform

The computational complexity of the trace transform has been previously analyzed in the literature[KP01, Fah06]. In general terms, the trace transform presents
a complexity given by the pixel extraction process determined by (n , n , n ) and
the computation of the trace functionals NT . Thus, the number of operations
needed is: n p hi n n NT .
From an experimental point of view different approaches have been tested to
get the pixel values.
Sequential extraction of (, ) by using bilinear interpolation
Sequential extraction of (, ) by using nearest pixel value
Sequential extraction of (, ) by using Bresenham algorithm (OpenCV
implementation)
Image rotation for block-wise data access
The analyzed datasets have been processed using sequential extraction of
(, ) and bilinear interpolation. In order to give an idea of the time spent in a
real execution, performance tests have been carried out by using a computer with
an Intel Core i7-740QM processor at 1.73GHz and 8GB RAM. For Corel 1000, the
average time to perform the trace transform of each image is 42ms with n = 71,
n = 71 and n (L) = 251. There is a very low variance in this processing time as
once the parameters are set, this mode of performing the trace transform does
not depend on the input image resolution. In the case of Geoeye, the average time
61

is 21ms and the parameters employed are n = 51, n = 71 and n (L) = 151. Figure 4.16 shows the behavior of the trace transform implementation for different
values of (n , n , n (L) ).
50
Time (ms)
40
30
20
10
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
260
280
300
Number of samples per functional vector
(a) n ( L)
4000
Time (ms)
3000
2000
1000
20
40
60
80
100
120
140
160
180
200
220
240
Angular and radial resolution (both have the same value)
(b) n = n
Figure 4.16: Time performance behavior depending on applied sampling parameters. Experiments have been carried out using a computer with an Intel Core
i7-740QM processor at 1.73GHz and 8GB RAM
62
4.5 Conclusion of the presented method
4.4.2 Computational complexity of attribute selection and classification

Attribute selection is the most computationally expensive task of the process. The
training time for Corel 1000 and Geoeye datasets is under one second (609ms average for Corel 1000 and 370ms for Geoeye) and the classification time per sample
is around 1s.
The attribute selection task can last for some hours and as part of the training
process, this fact can lead to strong limitations for some specific uses. Therefore,
it can be cut down (as an iterative optimization process, it can interrupted at any
point) or even removed. Experimental tests have demonstrated that the attribute
selection process can improve the final accuracy by up to 10%.
4.4.3 Scalability
Each of the datasets used during the validation process contains around 1,000
items. The scalability of the presented framework to larger datasets can be carried out by parallelizing the most critical part of the process which is the trace
transform. In fact, the calculation of each functional can be independently executed as it is demonstrated by Meena et al. [MPL11] where an implementation
on FPGA of the trace transform operator obtained a throughput of 2725 images
per second.
4.5 Conclusion of the presented method

We have shown that the DITEC method provides highly discriminant features
for context categorization purposes that can be encoded as considerably short
feature vectors. We have presented the geometrical constraints of the trace transform that can be optimized to efficiently represent the information contained in
the original images. We also have demonstrated that the dimensionality reduction in terms of mean and kurtosis value pair of frequency coefficients results in a
very robust set of features in terms of precision. For most resolution (n , n , L(n))
settings maintaining acceptable coverage, homogeneity and redundancy conditions, the accuracy has maintained around 82% for the Corel 1000 dataset and
92% for Geoeye.
Moreover, the method has successfully identified visual similarities within
the datasets, and as shown in the validation section, some incorrectly classified
63

instances are in fact visually similar to those pointed out by the classifier. The
error analysis has also shown some semantic proximity between visually similar
categories, a fact that can be used for context modeling and automatic ontology
building.
4.6 Modified DITEC as local descriptor

The good results of DITEC when applied globally have shown that it keeps inherent information to the content. Therefore, the suitability of the DITEC method as
a local feature will be discussed in this section. To adapt the DITEC method to
fit into a local feature extraction framework some modification have to be done
both in the global method and in the inner part of the algorithm. While the goal
of DITEC as a global descriptor is to keep the semantical information and present
it in an adequate feature space, global features must act as unique descriptors
that are robust for geometric and photometric transformations.
DITEC method uses a supervised classification process for domain classification. This process relies on the training of the domain with different content
examples that in turn build the implicit definition of the class in the feature space.
However, many applications of local features (tracking, stereo disparity calculation, stitching, etc.) require real time pairing where no prior data are available
and therefore data cannot be trained. It implies the definition of distance metrics
and the creation of feature vectors that are suitable for these metrics.
In order to fulfill these new constraints, a new framework has been defined
for the local version of DITEC (Figure 4.17).
4.7 Implementation of DITEC as local descriptor

Since DITEC does not perform feature point detection, this task is carried out by
using other state of the art detectors such as: detection of the extrema of a Laplacian based function applied in a space-scale framework, similar to SIFT ([Low99])
or BRISK ([LCS11]). While the scale estimation is used to improve the robustness
of the descriptor, its high rotational invariance does not require any prior angular
estimation.
Once the interest point has been identified, a patch around it is extracted according to the scale estimation provided by the detector. We propose to use an
64

candidate
refererence
image
image
input
input
load image
load image
rgb2gray
rgb2gray
Interest
point
Interest
param.
extraction
point
extraction
n , n , n , (L),
Trace
normal-
Trace
Transform
ize circular
Transform
patch
DFT-1 rows
DFT-1 rows
n first
module
n first
Metrics
coefficients
module
coefficients
Pairing
Figure 4.17: System workflow for DITEC as local feature
approach similar to SIFT [Low99] for scale normalization. After scale normalization we proceed to a histogram equalization in order to normalize the dynamic
range of the patch. This normalization improves the performance of the descriptor against light intensity photometric transformation. It is worth mentioning
than many functionals, such as integral function used in Radon transform, are
not invariant to exposure or light intensity photometric transformation.
Once the patch is normalized, the trace transform is applied within a circular
area contained within the obtained rectangular patch. This process improves the
rotational invariance since no new elements are introduced or lost in the area
when the image is rotated. The use of circular patch is also applied in other similar approaches like ORB([RRKB11]) where a mask is used for central moments
65

computation for dominant orientation estimation. It is important to note that
the masking of a circular patch before performing the trace transform only works
for those functionals that have an neutral or identity element defined. For the
rest of functionals, the trace transform scanning area must be limited to a circumference. Thus, the clipping points of the scan line have to be calculated by solving
the equation defined by the maximum circumference contained in the patch and
the equation of the scan line. Equation system 4.24 has to be solved in order to
extract the set of clipping points corresponding to each (n , n ) position.
C (x, y)(,)
x 2 + y 2 = R 2
y = tan () x +
sin
3 R = min hei g ht (I ), wi d t h(I ) (4.24)
Solving the previous equation system for x and y (demonstration in Appendix

B), we obtain the solution depicted by Equation 4.25.
q
C (, ) = cos sin R 2 2
(4.25)
The obtained clipping points are scanned in the following range:

[0, 2], (R, R).
4.7.1 Trace Transformation

As it has been observed for the global descriptor, the core of the process resides in
the trace transformation. The trace transformation is formed by the computation
of all trace projections, corresponding to the number of samples of parameter
space for and . This process is carried out in a similar way to the global case
explaind in Section 4.2.2.
4.7.2 Feature Extraction

The final stage of DITEC descriptor computation is the extraction of the features
that finally will represent the descriptor. The reduced size of the patches obtained
from the original images compared with those used in global feature extraction,
decreases the statistical significance of the previously used descriptors (mean
value, kurtosis,
i qr
2 , median and Hodge-Lehman estimator).
Moreover, the dimen-
sionality reduction is not a critical factor in this case as the number of elements
in the trace transform resulting matrix is much lower.
66

Therefore, a new approach has been carried out in order to build the local
descriptor. We propose to use of the unidimensional Discrete Fourier Transformation applied to the rows of the image T of the trace transform, as described in
Equation 4.26, where F (k, l ) represents the frequency domain transformed image
of time domain or spatial domain imageT .
F (n) =
NX
1
T(k) e
i 2kn
N
(4.26)
k=0
As the result of the DFT belongs to the complex space mat hbbC the obtained
result is split in its phase and magnitude components. The horizontal representation of the phase contains the information related to the orientation of the
image, and thus, the magnitude is normalized with respect to the rotation. This
phase-normalized signal characterization will be the set of coefficients that will
compose the descriptor.
Finally the descriptor is constructed by taking the n first magnitude coefficients of each DFT vector of the trace transforms rows. To have more control over
the length of the final descriptor and the number of coefficients taken in each
row, the trace transform image T can be vertically down-sampled, creating in this
way bands instead of single rows. By avoiding the inclusion of the DC value of the
DFT (first element), we obtain a stronger luminosity invariance.
4.7.3 DITEC parameters

The main parameters of the local DITEC method are:
Number of samples: Number of angular samples obtained from a given
image when computing the trace transform.
Number of samples: Number of radial samples obtained from a given
image when computing the trace transform.
nkres scanline sampling strategy: Method for obtaining pixel values that
correspond to each specified scanline between the computed clipping
points.
Size of image patches: Square size of local image patches extracted around
detected interest points.
Dimensionality of DITEC local descriptor: Number of features extracted
from the DFT of the trace transform.
Functional: the mathematical operation performed along the scanline.
67

While the patch size and descriptor length can be considered as external factors (parameters that do not directly depend on the DITEC process), the angular
and radial sampling, the scanline sampling and the functional are the core factors
for the success of the method. The accuracy and computational cost will directly
depend on this parameters.
4.7.3.1 Angular and radial sampling

From a mathematical point of view, in order to ensure a symmetric trace transform, the angular increment has to be a multiple of . Otherwise, the second
half of the angular scanning (which has a radial symmetric with the first half)
does not start at but slightly later, introducing an asymmetry in the final result of the trace transform. However, since we only consider the rows of the trace
transform, this symmetry is not considered as a constraint for the accuracy of the
method. Experimental results probe that good performance can be obtained for
15 to 20 angular samples.
Matching Accuracy for Phi Parameter
100
% Correct Matches
90
80
70
60
50
40
30
20
10
0
10
15
20
25
30
35
40
45
50
Phi Value
Figure 4.18: Matching accuracy depending on the number of samples. Experiments have been performed for 6 different image sub datasets of 5 to 10 images each,
covering different image sizes, as well as geometric and photometric transformations, such as image translation, rotation, and projection, or image blurring, noise,
or light exposure changes. n = 16, patch sizes equal to 20, descriptor dimensionality
equal to 128 and sampling strategy based on a single rotation approach.
Regarding the radial sampling, experimental results have shown that the accuracy convergence starts at around 15 samples. The similarity of the values for
68

n and n allows us to use same values for both sampling parameters, giving as a
result a square trace transform.
Matching Accuracy for Rho Parameter
100
% Correct Matches
90
80
70
60
50
40
30
20
10
0
10
15
20
25
30
35
40
45
50
Rho Value
Figure 4.19: Matching accuracy depending on the number of samples. Experiments have been performed for 6 different image sub datasets of 5 to 10 images each,
covering different image sizes, as well as geometric and photometric transformation,
such as image translation, rotation, and projection, or image blurring, noise, or light
exposure changes. n = 16, patch sizes equal to 20, descriptor dimensionality equal
to 128 and sampling strategy based on two single rotation approach.
Matching Accuracy for Phi and Rho Parameters

100
90
% Correct Matches
80
70
60
50
40
30
20
10
0
10
15
20
25
30
35
40
45
50
Phi, Rho Value
Figure 4.20: Matching accuracy depending on the number of simultaneous increase

of angular and radial sampling
We also conducted an experiment where n and n are changed simultaneously, and in the same quantity. Figure 4.20 shows that convergence is reached
69

about 15 to 20 and that the performance does not degenerate until both values
are around 40. As it can be observed this degeneration of the accuracy follows the
behavior of n as n keeps invariant for higher sampling values.
4.7.3.2 Effects of sampling in the computational cost

The computational cost increases linearly with n and n since they determine
the two main iteration processes of the algorithm. Therefore, the parallel increase
of both has a quadratic effect on the computational cost as it can be observed in
Figure 4.21.
Computation Time for Phi and Rho Parameters together
Time per Image (ms)
1.2
1
0.8
0.6
0.4
0.2
0
10
15
20
25
30
35
40
45
50
Phi, Rho Value
Figure 4.21: Computation time depending on the simultaneous increase of angular

and radial sampling
4.7.4 Experimental results

In this section we show an evaluation carried out comparing the most relevant
feature descriptors in the state-of-the-art.
The evaluation is performed against several geometric and photometric transformations. In this case, performance is evaluated as the ratio between wrong and
correct matches,i.e. accuracy, between several real and generated images, given
the ground truth data. The evaluation method has been designed and developed
by Barandiaran [Bar13, BCN+ 13].
70

4.7.4.1 Geometric Transformations
Figure 4.22 shows the results obtained in the evaluation of in-plane rotation geometric transformation of an input image. The best performance is obtained by
DITEC descriptor obtaining a mean around 90%, but when rotation is at 180that
decreases to 85%.
Normalized Number of Matches
1
SIFT
SURF
DAISY
BRISK
ORB
BRIEF
DITEC
FREAK
% Correct Matches
0.8
0.6
0.4
0.2
50
100
150
250
200
300
350
400
Rotation Angle
Figure 4.22: In-plane Rotation Transformation matching results.

1
SIFT
SURF
DAISY
BRISK
ORB
BRIEF
DITEC
FREAK
% Correct Matches
0.8
0.6
0.4
0.2
0
0
0.5
1.5
Scale Value
2.5
Figure 4.23: Scale Transformation matching results.
Figure 4.23 shows the results obtained in the evaluation of scale transformation of an input image. In this case, even if DITEC shows a very good performance,
71

it is worth to say that the algorithm is relying on the scale estimation given by the
interest point detector (SIFT). Nevertheless, DITEC shows a better result than the
SIFT descriptor when both are based on the same scale estimation.
Figure 4.24 shows the results over the first four images of Graffiti data set
([MS02]). In this test mainly robustness against perspective transformation are
measured. It can be observed that for not severe perspective transformations
DITEC shows overall the best performance.
1
SIFT
SURF
DAISY
BRISK
ORB
BRIEF
DITEC
FREAK
0.9
% Correct Matches
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1.5
2.5
Image
3.5
Figure 4.24: Projective Transformation matching results.
4.7.4.2 Photometric Transformations

Figure 4.25 shows the results obtained in the evaluation of the exposure change
data set. As it can be seen, DITEC shows a robust behavior against exposure
variations.
4.7.5 Current status of the local DITEC algorithm design

As shown in the previous section, we have designed a very robust local descriptor.
The main drawback right now is the need of a previous scale estimation mechanism during the interest point extraction process. At the moment of the writing
of this document, we are developing an approach based on the column analysis
of the trace transform image. Based on the idea that row analysis characterizes
angular information and columns encode radial description of the image, we
72

1
SIFT
SURF
DAISY
BRISK
ORB
BRIEF
DITEC
FREAK
% Correct Matches
0.8
0.6
0.4
0.2
0.5
1.5
2.5
3.5
4.5
f-stops light reduction
Figure 4.25: Exposure change photometric Transformation matching results.
consider that an effective characterization of this radial data can provide both, a
scale estimation method and a descriptor with a higher scale invariance. Figure
4.26 shows the relationship of the trace transforms rows and columns with the
angular and radial representation in the original image.
DFT (scale invariance)
t 11 t 12 t 13 t 14 t 15 . . . t 1n
DFT (rotational invariance)
t
21
t
31
41
51
..
t
m1
t 22 t 23 t 24 t 25 . . . t 2n
t 32 t 33 t 34 t 35 . . . t 3n
t 42 t 43 t 44 t 45 . . . t 4n
t 52 t 53 t 54 t 55 . . . t 5n
..
..
..
.. . .
.
. ..
.
.
.
.
t m2 t m3 t m4 t m5 . . . t mn
Figure 4.26: Trace transform row and column analysis
73
Both optimists and pessimists contribute to society. The optimist invents the aeroplane, the pessimist
the parachute.
George Bernard Shaw
CHAPTER
Main Contributions
The main contributions of this research work are described in this section.
Most of these contributions have also been presented in journals and conferences
as it can be seen in the Section II second part of this document. Moreover, some
of the technological results derived from the R&D actvitiy have been applied for
patents (see Section 8). The single camera ball tracking system has already been
accepted and the local DITEC method is in process at this moment.
5.1 Mandragora framework

One of the main contributions of this work is a framework design that integrates
solutions for automatic content analysis into the same system. The Mandragora
approach tries to overcome the limitation of targeted low-level operators when
no prior knowledge is provided and also allows the split of big domains (such as
those employed in the broadcasting sector) into separated sub-domains that can
be more efficiently modeled.
5.2 DITEC method as global descriptor

The first step of Mandragora is to determine the sub-domain to which a specific asset may belong. Based on the trace transform, we present a new method
(DITEC) for domain categorization. This method, uses the trace transform operator as a global feature and builds a descriptor by statistically modeling the
75
5. MAIN CONTRIBUTIONS
obtained coefficients. Experimental results have shown very good results in terms
of accuracy and robustness (see Section 4).
5.3 DITEC feature space analysis

The study of the feature space created by DITEC descriptors is also a contribution
of this work. This analysis has provided the foundation for the proper understanding of the behavior of the trace transform and its properties as a new space where
data are better ordered for classification purposes. In some way, we have shown
that the semantic information contained in images can be ported to sinograms
that facilitate their modeling in features spaces that can be separated by using
mathematical tools such as SVM hyperplanes, Bayesian probabilistic approaches,
etc. In essence, this is a step towards the semantic gap bridging.
5.4 DITEC method as local descriptor

All the knowledge acquired around the trace transform has been used to adapt the
global DITEC to perform as a local descriptor. The main differences of this two
cases have been analyzed and the feature extraction method has been adapted
to the new constraints introduced by the local patches. As a result, we have developed a new method that in experimental tests is behaving as good as the best
state of the art local descriptors such as SIFT or SURF.
5.5 Other contributions

There are other contributions mainly reflected in the papers that propose new
specific methods for image analysis in different applications methods (broadcast,
earth observation, meteorology), or intensive reviews of image analysis trends in
EO mining (Section 7.2).
76
Experience is simply the name we

give our mistakes.
Oscar Wilde
CHAPTER
Conclusions and Future

Work
This work has been devoted to make a step forward in the cognitive analysis
of visual content. In particular, the approach to the semantic gap from a contextual perspective have been pursued in order to create common foundations
for different application fields and types of content. The experience in industrial
projects and their scientific analysis has been on the basis of the ideas developed
in this dissertation. In this sense, besides the scientific methodology, other aspects such as technical feasibility, potential benefit on industry and experimental
validation have been key drivers of this work.
Following the requirements and constraints of these real projects, we end
up with the need of a prior knowledge extractor that in most cases will not have
the support of human beings. The idea of the Mandragora architecture tries to
integrate the process of content analysis by introducing a kind of divide and conquer approach, breaking down the problem into a more affordable sub-domains
since as stated by Deng et al. [DBLFF10], the number of classes itself has a very
negative impact on the content classification process.
One of the critical parts of the Mandragora architecture is in fact the creation
of this prior information. In other words, the way of breaking the chicken-egg
situation where feature extractors or analyzers require an a priori knowledge to decide which of them does it make sense to launch or to set some basic parameters
that they might include.
77
6. CONCLUSIONS AND FUTURE WORK

As local-feature extractors are much more dependent of the particular characteristic of a specific asset, we have created a new global feature extractor based on
the trace transform. Our main contribution in this case has been to modify the
algorithm to keep more semantic/global information and to avoid dimensionality reduction operations that might remove too much information. While the
original trace transform is very suitable for hashing and fingerprinting purposes,
our DITEC method can include wider class definitions that the system can learn
in a training process.
The experience acquired in the development of the DITEC method has lead us
to adapt it as a local descriptor. The statistical considerations made for the global
version of DITEC cannot be applied for much smaller input images (patches
around interest points) and the main modification (apart from the work-flow)
has been the feature extraction method once the trace transform has been performed. The local DITEC has been tested against the best methods known in the
state of the art and the obtained results have been highly competitive.
Finally, all this work has been not only theoretically developed but really implemented in order to achieve real-time performance. Current implementations
have not been optimized yet, but the obtained performance results are not far
away from real-time. Therefore, we expect to have a high performance version of
both global and local DITEC in the near future.
6.1 Future work

This PhD process finishes with the contributions presented above. However,
rather than filling holes in the scientific community knowledge, we feel that more
and more new research lines arise from our work. The most direct ones are related with a further development of Mandragora and a better understanding of
DITEC, specially in those aspects related with the functionals. However, there are
also elements that have not been considered in this work such as the integration
of different types of media (text, 3D, audio, etc.) and a deeper development of
temporal data series and videos.
6.1.1 Collaborative filtering, Big Data and Visual Analytics

One of the main trends during last years is the massification of the Internet, in
terms of audience and active users, data/content and resources for analysis/exploitation. All these concepts together with other buzzwords such as cloud
78
6.1 Future work

computing, open linked data, Visual Analytics etc. are creating a new way of understanding the digital universe. We are still in the initial era of this future Internet
but it is clear that one of the big demands will be a revolutionary method to browse
and search for multimedia content overcoming text based search engines and
integrating collaborative filtering and recommendation tools and advanced and
intuitive visual representation of plethora of content. According to this futuristic
view, we consider the integration of collaborative filtering, massive unstructured
user generated metadata, big data technologies and resources for visual analytics
as the next steps towards cognitive vision systems.
79
Part II
Patents & Publications
81
CHAPTER
Publications
7.1 Weather analysis system based on sky images
taken from the earth
Title: Weather Analysis System Based on Sky Images Taken from the Earth
Authors:Mikel Labayen and Naiara Aginako and Igor Garca
Booktitle: Proceedings of VIE 2008 - The fifth International Conference on
Visual Information Engineering.
Conference Location: Xian (China)
Year: 2008
DOI: http://dx.doi.org/10.1049/cp:20080299
7.2 A review on EO mining

Title:A review on EO mining
Authors:Marco Quartulli and Igor Garca
Journal: ISPRS Journal of Photogrammetry and Remote Sensing
Volume: 75
Pages: 11-28
Publisher: Elsevier
Year: 2013
DOI: http://dx.doi.org/10.1016/j.isprsjprs.2012.09.010
83
7. PUBLICATIONS
7.3 Accurate Object Tracking and 3D Visualization for

Sports Events TV Broadcast
Title:Accurate Object Tracking and 3D Visualization for Sports Events TV
Broadcast
Authors:Mikel Labayen, Igor Garca Olaizola, Julin Flrez, Naiara Aginako
Journal: Multimedia Tools and Applications
Publisher: Springer
Year: 2013
DOI: http://dx.doi.org/10.1007/s11042-013-1558-x
7.4 DITEC: Experimental analysis of an image characterization method based on the trace transform
Title: DITEC: Experimental analysis of an image characterization method
based on the trace transform
Authors: Igor Garca Olaizola, Iigo Barandiaran, Basilio Sierra, Manuel
Graa
Conference: VISAPP 2013, 9th International Conference on Computer Vision Theory and Applications.
Conference Location: Barcelona (Spain)
Year: 2013
URL: http://www.visapp.visigrapp.org/?y=2013
7.5 Image Analysis platform for data management in

the meteorological domain
Title: Image Analysis platform for data management in the meteorological
domain
Authors:Igor Garca Olaizola, Naiara Aginako, Mikel Labayen
Journal: ISPRS Journal of Photogrammetry and Remote Sensing
Conferene: Proc. 4th Int. Workshop Semantic Media Adaptation and Personalization SMAP 09
Pages: 11-28
Publisher: Elsevier
84
7.6 Architecture for semi-automatic multimedia analysis by hypothesis

reinformcement
Year: 2013
DOI: http://doi.ieeecomputersociety.org/10.1109/SMAP.2009.29
7.6 Architecture for semi-automatic multimedia analysis by hypothesis reinformcement

Title: Architecture for semi-automatic multimedia analysis by hypothesis
reinformcement
Authors:Igor Garca Olaizola, Gorka Marcos, Petra Krmer, Julin Flrez
and Basilio Sierra
Proceedings: IEEE International Symposium on Broadband Multimedia
Systems and Broadcasting, BMSB 2009
Pages: 1-6
Publisher: IEEE
Year: 2013
DOI: http://dx.doi.org/10.1109/ISBMSB.2009.5133780
7.7 Trace transform based method for color image domain identification
Title: Trace transform based method for color image domain identification
Authors:Igor Garca Olaizola, Marco Quartulli, Basilio Sierra, Julin Flrez
Journal: IEEE Transactions on Multimedia
Status: Under review after major changes.
7.8 On the Image Content of the ESA EUSC JRC Workshop on Image Information Mining
Title: On the Image Content of the ESA EUSC JRC Workshop on Image
Information Mining
Authors: Marco Quartulli, Igor Garca Olaizola, Mikel Zorrilla
Proceedings: Proceedings of ESA-EUSC-JRC 8th Conference on Image
Information Mining: Knowledge Discovery from Earth Observation Data
Pages: 70-73
85
7. PUBLICATIONS
Publisher: JRC, Joint Research Center (European Commission)
Year: 2012
DOI: http://dx.doi.org/10.2788/49465
7.9 Authors other publications

1.
Title: Reference Model for Hybrid Broadcast Web3D TV

Authors: Igor Garca Olaizola, Josu Prez, Mikel Zorrilla, ngel Martn,
Maider Laka
Proceedings: Web3D 13 Proceedings of the 18th International Conference on 3D Web Technology
Pages: 177-180
Publisher: ACM
Year: 2013
DOI: http://dx.doi.org/10.1145/2466533.2466560
Abstract: 3DTV can be considered as the biggest technical revolution
in TV content creation since the black and white to color transition.
However, the big commercial success of current TV market has been
produced around the Smart TV concept. Smart TVs connect the TV
set to the web and introduce the main home multimedia display in
the app world, social networks and content related interactive services.
Now, this digital convergence can become the driver for boosting the
success of 3DTV industry. In fact, the integration of stereoscopic TV
production and Web3D seems to be the next natural step of the hybrid
broadband-broadcast services.
We propose in this paper a general reference model to allow the convergence of 3DTV and 3D Web by defining a general architecture and
some extensions of current Smart TV concepts as well as the related
standards.
2.
Title: Visual processing of geographic and environmental information in the basque country: two basque case studies
Authors: Alvaro Segura, Aitor Moreno, Igor Garca, Naiara Aginako,
Mikel Labayen, Jorge Posada, Jose Antonio Aranda, Rubn Garca De
Andoin
86

Proceedings:GeoSpatial Visual Analytics NATO Science for Peace
and Security Series C: Environmental Security
Publisher: Springer
Pages: 199-207
Year: 2009
URL: http://dx.doi.org/10.1007/978-90-481-2899-0_16
Abstract: The Basque Meteorology Agency is conducting an initiative
to improve the collection, management and analysis of weather information from a large array of sensing devices. This chapter presents
works carried out in this context proposing the application of 3D geographical visualization and image processing for the monitoring of
meteorological phenomena. The tools described allow users to analyze
visually the state of the atmosphere and its interaction with the topography, and process live outdoor images to automatically infer weather
conditions. This kind of systems can be applied in the surveillance of
other environmental events and enable better decision making for several purposes, including important issues related with environmental
security.
3.
Title: HTML5-based System for Interoperable 3D Digital Home Applications

Authors: Mikel Zorrilla, Angel Martin, Jairo R. Sanchez, Iigo Tamayo,
Igor Garca Olaizola
Journal: Multimedia tools and applications
Publisher: Springer
Year: 2013
DOI: http://dx.doi.org/10.1007/s11042-013-1516-7
Abstract: Digital home application market shifts just about every
month. This means risk for developers struggling to adapt their applications to several platforms and marketplaces while changing how people
experience and use their TVs, smartphones and tablets. New ubiquitous
and context-aware experiences through interactive 3D applications on
these devices engage users to interact with complex 3D scenes in virtual applications. Interactive 3D applications are boosted by emerging
87
7. PUBLICATIONS
standards such as HTML5 and WebGL removing limitations, and transforming the Web into a horizontal application framework to tackle
interoperability over the heterogeneous digital home platforms. Developers can apply their knowledge of web-based solutions to design
digital home applications, removing learning curve barriers related to
platform-specific APIs. However, constraints to render complex 3D environments are still present especially in home media devices. This paper
provides a state-of-the-art survey of current capabilities and limitations
of the digital home devices and describes a latency-driven system design
based on hybrid remote and local rendering architecture, enhancing
the interactive experience of 3D graphics on these thin devices. It supports interactive navigation of sophisticated 3D scenes while provides
an interoperable solution that can be deployed over the wide digital
home device landscape.
4.
Title: A middleware to enhance current multimedia retrieval systems

with content-based functionalities.
Authors: Gorka Marcos, Arantza Illarramendi, Igor Garca. Olaizola,
Julin Flrez
Journal: Multimedia Systems
Volume: 17
Issue: 2
Pages: 149-164
Publisher: Springer
Year: 2012
DOI: http://dx.doi.org/10.1007/s00530-010-0217-6
Abstract: Nowadays the retrieval of multimedia assets is mainly performed by text-based retrieval systems with powerful and stable indexing mechanisms. Migration from those systems to content-aware
multimedia retrieval systems is a common aim for companies from
very diverse sectors. In this paper we present a semantic middleware
designed to achieve a seamless integration with existing systems. This
middleware outsources the semantic functionalities (e.g. knowledge
extraction, semantic query expansion,. . . ) that are not covered by
88

traditional systems, thereby allowing the use of complementary contentbased techniques. We include a list of key criteria to successfully deploy
this middleware, which provides semantic support to many different
steps of the retrieval process. Both the middleware and the design criteria are validated by two real complementary deployments in two very
different industrial domains.
5.
Title:Ontology Based Middleware for Ranking and Retrieving Information on Locations Adapted for People with Special Needs
Authors: Kevin Alonso, Naiara Aginako, Javier Lozano, Igor Garca
Olaizola
Journal: Lecture Notes in Computer Science, Computers Helping
People with Special Needs
Volume: 7382
Pages: 351-354
Publisher: Springer
Year: 2012
DOI: http://dx.doi.org/10.1007/978-3-642-31522-0_53
Abstract: Current leisure or touristic services searching tools do not
take into account the special needs of large amount of people with
functional diversities. However, the combination of different semantic,
web and storage technologies make possible the enhancement of such
search tools, allowing more personalized searches. This contributes to
the provision of better and more suitable results. In this paper we propose an innovative ontology driven solution for personalized tourism
directed to people with special needs.
6.
Title: DMS-1 Driven Data Model to Enable a Semantic Middleware for

Multimedia Information Retrieval in a Broadcaster
Authors: Gorka Marcos, Kevin Alonso, Igor Garca Olaizola, Julin
Flrez, Arantza Illarramendi
Proceedings: SMAP 09. 4th International Workshop on Semantic
Media Adaptation and Personalization, 2009.
Pages: 33 - 37
89
7. PUBLICATIONS
Publisher: Springer
Year: 2009
DOI: http://dx.doi.org/10.1109/SMAP.2009.16
Abstract: This article presents the motivation and the implementation
of a semantic model developed to support diverse semantic services in
a multimedia asset management system in a broadcaster. The model
is mainly driven by DMS-1 (descriptive metadata scheme) standard,
which is part of the multimedia exchange format standard defined by
the broadcast industrial community and according to our knowledge
we propose the first implementation of it using the OWL language. This
model has been complemented with other models coming from the
academia in order to cover the diverse nature of the different semantic
needs identified in the whole workflow.
7.
Title:TV Sport Broadcasts: Real Time Virtual Representation in 3D

Terrain Models
Authors: Maider Laka Inurrategi, Igor Garca Olaizola, Alex Ugarte,
Ivan Macia
Proceedings: 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video
Pages: 33 - 37
Publisher: Springer
Year: 2008
DOI: http://dx.doi.org/10.1109/3DTV.2008.4547894
Abstract: To enhance the understanding of sports competition broadcast, or to give viewers more information than just images taken from
cameras, graphic representations are more intuitive and have greater
impact than alpha-numeric information. This paper presents the characteristics and real television broadcast results of a project aimed to
improve the understanding of an outdoor sports competition using
virtual reality techniques. The system developed does not require any
special hardware, which would have made the prototype much more
expensive. This goal has been achieved using some new techniques and
technological adoptions that are explained in the paper.
90

8.
Title: MHP Oriented Interactive Augmented Reality System for Sports

Broadcasting Environments
Authors: Igor Garca Olaizola, Iigo Barandiaran Martirena, Tobias D.
Kammann
Proceedings: First presented at the 4th European Interactive TV Conference EuroITV 2006, extended and revised for JVRB.
Pages: 33 - 37
Publisher: JVRB
Year: 2006
URL: http://www.jvrb.org/past-issues/3.2006/786
Abstract: Television and movie images have been altered ever since it
was technically possible. Nowadays embedding advertisements, or incorporating text and graphics in TV scenes, are common practice, but
they can not be considered as integrated part of the scene. The introduction of new services for interactive augmented television is discussed in
this paper. We analyse the main aspects related with the whole chain
of augmented reality production. Interactivity is one of the most important added values of the digital television: This paper aims to break
the model where all TV viewers receive the same final image. Thus, we
introduce and discuss the new concept of interactive augmented television, i. e. real time composition of video and computer graphics - e.g.
a real scene and freely selectable images or spatial rendered objects edited and customized by the end user within the context of the users
set top box and TV receiver.
9.
Title: User Interfaces Based on 3D Avatars for Interactive Television

Authors: Alex Ugarte, Igor Garca, Amalia Ortiz, David Oyarzun
Proceedings: 5th European Conference, EuroITV 2007
Pages: 33 - 37
Publisher: Springer
Year: 2007
DOI: http://dx.doi.org/10.1007/978-3-540-72559-6_12
Abstract: Digital TV has brought interactivity to television. Even though
actual capabilities of interactive applications are quite limited due to
immaturity of the sector and technical restrictions in the standards, the
91
7. PUBLICATIONS
potential of interactive Television (iTV) as a multimedia and entertainment platform is enormous. The existing gap between PC world and
iTV concerning graphics capabilities, may restrain the development
of iTV platform in favour of the former one. Support for 3D graphics applications in iTV would boost this new platform with plenty of
possibilities to be exploited.
92
CHAPTER
Selected Patents
8.1 Method for detecting the point of impact of a ball
in sports events
Publication number: EP2455911 B1

Publication type: Grant
Application number: EP20100382310
Publication date: Mar 13, 2013
Filing date: Nov 23, 2010
Priority date: Nov 23, 2010
Also published as: EP2455911A1
Inventors: Olaizola Igor Garca, Esnal Julin Flrez, Romn Otegui Juan Carlos
San, Bengoa Naiara Aginako, Esnaola Mikel Labayen,
Applicant: Fundacion Centro De Tecnologias De Interaccion Visual Y Comunicaciones Vicomtech
8.2 Authors Other Related Patents

1. Title: Method for detection and recognition of logos in a video data stream
Publication number: EP2259207 B1
Publication type: Grant
93
8. SELECTED PATENTS
Publication date: Oct 17, 2012
Filing date: Jun 2, 2009
Priority date: Jun 2, 2009
Also published as: EP2259207A1, EP2259207B8
Inventors: Bengoa Naiara Aginako, Olaizola Igor Garcia, Esnaola Mikel
Labayen
Applicant: Vicomtech-Visual Interaction and Communication Technologies Center
2. Title: Method and system for analyzing multimedia files

Publication number: WO2011089276 A1
Publication type: Application
Application number: PCT/ES2010/070024
Publication date: Jul 28, 2011
Filing date: Jan 19, 2010
Priority date: Jan 19, 2010
Inventors: Olaizola Igor Garca, Bengoa Naiara Aginako, Ortego Gorka
Marcos
Applicant: Vicomtech-Visual Interaction and Communication Technologies Center
3. Title: Portable television platform

Publication number: EP2538663 A1
Publication type: Application
Publication date: Dec 26, 2012
Filing date: Jun 20, 2011
Priority date: Jan 19, 2010
Inventors: Garcia Olaizola Igor, inurrategi maider Laka, Leunda Julen Garcia, colmenar jesus M Perez
Applicant: Asociacion Centro de Tecnologias de Interaccion Visual y Comunicaciones Vicomtech
94
Part III
Appendix and Bibliography
95
APPENDIX
A
Consideration on the
Implementation Aspects of
the trace transform
A.1 Development platforms
The first implementation of the trace transform has been developed in Octave/Matlab. Both platforms provide the Radon transform as a built in function but do
not include the generalization to other functionals. First implementations based
on a double loop and a scanline function where too slow for real applications.
Therefore, the algorithm was transformed to matrix operations that allowed us to
remove one of the loops (radial scanning). This approach improved the performance in more than 10 times since all radial samples of a specific angle where
calculated by the same matrix operations.
In order to speed up the trace transform operator and also to integrate this
function in the local feature descriptor evaluation platform (which is based on
C++), a C++ implementation has been developed. The C++/OpenCV1 version includes different approaches for rotation and sampling that make it much more
flexible for different requirements such as speed, quality and control over the
distortions produced by each different approach.
1
http://opencv.org/
97
A. CONSIDERATION ON THE IMPLEMENTATION ASPECTS OF THE TRACE

TRANSFORM
This C++ implementation has also been wrapped as a dynamically linked
shared object in order to make it compatible with OpenCV/Python. This implementation can directly share OpenCV objects between C++ and Python taking
the advantage of the C++ performance and Python interoperability and ease
of scripting. This platform is nowadays being used as one of the development
environments for R&D projects in VicomtechIK4. Figure A.1 shows the basic infrastructure and the tools used to perform the global DITEC process. The local
case is exclusively based on the C++ implementation.
Figure A.1: DITEC development platform
A.2 Sampling
The sampling process is one of the key aspects of the discrete trace transform
operation. The sampling is determined by 3 parameters , and (L) as described in Section 4.2.2.2. Aliasing and distortion effects strongly depend on these
parameters.
and determine the two main loops of the trace transform calculation.
When translated to matrix operations (in Octave/Matlab), the radial loop () is
performed by creating a set of clipping points that define the scanlines that will
be performed. It implies that all scanlines have to be included in the same matrix
and therefore must have the same length (key aspect for performance). However, this process introduces some distortions due to the fact that short scanlines
include the same number of samples as the longer ones.
98
A.2 Sampling
The C++ implementation avoids this limitations and 3 different strategies are
followed to perform the scanline sampling.
1. Fixed step:
2. Fixed number of samples: as it is done in the Octave/Matlab implementation
3. Bresenham algorithm [Bre65]
The Bresenham implementation of OpenCV performs as the fastest option for scanline sampling. However, it introduces a distortion at around = (0, 2 , , 3
).
2
This is produced when the scanline becomes vertical or
horizontal and thus the number of required neighbor
pixels is lower. In order to appreciate this effect, we can
R
apply the trace transform with functional (t )d t (with
a circular patch) to a homogeneous white image (Figure
Figure A.2: Circular
A.2). As it can be observed in Figure A.3 there is a radial
patch image
degradation of the signal (produced by the shorter scanlines) that can be considered as inherent to the algorithm, but there is also a
decrease of the signal in the vertical and horizontal limits.
Figure A.3: Result of (, ) space exploration with Bresenham
In order to minimize this effect and still keep the performance benefits of the
Bresenham algorithm, we have implemented a variant based on a
image rota-
tion. The goal of this rotation is to move the horizontal and vertical scanlines to a
position where Bresenham algorithm will include neighbor values in the same
way as they are included in the rest of the regions.
99

TRANSFORM
Figure A.4: First half of the source image is sampled (blue regions) while areas
around vertical and horizontal axes are not considered.
Figure A.5: Second half of the source image is sampled (red and green). These re 3 5 7
4 , 4 , 4 , areas in order to be sampled with the Bresenham
gions are moved to 4 ,

algorithm.
Figure A.6 shows the result of applying the same functional to the same image
but making a single rotation of ( 4 ). As it can be observed, the number of areas
with angular distortion is the double of the previous approach (each
n
).
4
How-
ever, the gradients observed in these distortions are much smoother than the
previous ones.
Extending the idea of rotating the image for accounting the differences in
100
A.2 Sampling
Figure A.6: Result of (, ) sampling with Bresenham algorithm and a single image
rotation
Figure A.7: Result of (, ) pixelwise sampling with image rotation for each angular
iteration.
sampling due to line orientations, we performed the experiment of rotating the

image each degree before each scan line is processed. In this way, each scan line
given a value of correspond with image columns. Therefore, all scan lines, i.e.
trace projections, are computed by just rastering image columns, thus each line
is equally sampled. As shown in Figure A.7, the angular distortions are removed
in this way. However, as an image rotation implies the access to all pixel values of
the input image, this process might be much slower than the previously indicated
methods.
Figure A.8 shows a single line corresponding to = 0 in order to compare in
a quantitative manner the angular distortion introduced by each method. As it
can be observed, the Bresenham algorithm without image rotation, introduces a
maximum distortion of around 25% of the original signal, while the single rotation based method decreases this distortion, providing a maximum value of 7%.
101

TRANSFORM
250
200
150
Full image rotation

1 rotation
No rotation
100
50
0
0
50
100
150
200
250
Figure A.8: Result of different sampling strategies of (, ) space.
For full rotation based method, it can be considered that there is no distortion
(the observed minor variations are basically due to numeric errors during the
rotation).
102
APPENDIX
B
Calculation of the clipping
points in a circular region
The circular region is defined as a circumference contained within the rectangular patch. Therefore, the radius of the circumference will be equivalent to the
minimum of the patch axes.
In order to simplify the calculation, we will locate the (0, 0) coordinate at the
center of the patch. Once the clipping points are found, a mere translation of the
center will be enough to get the real clipping point positions.
The equation system that has to be solved is composed of the aforementioned
circumference and a straight line defined as the orthogonal to the straight line
defined by the center of the image and the , position.
(
C (x, y)(,)
x2 + y 2 = R2
3 R = min hei g ht (I ), wi d t h(I )
y = ax + b
(B.1)
x 2 + (ax + b)2 = R 2
(B.2)
(a 2 + 1)x 2 + 2abx + (b 2 R 2 ) = 0
(B.3)
a 2 b 2 (a 2 + 1)(b 2 R 2 )
a2 + 1
We can simplify the Equation B.4 to:
p
ab R 2 (a 2 + 1) b 2
x=
a2 + 1
x=
ab
103
(B.4)
(B.5)
B. CALCULATION OF THE CLIPPING POINTS IN A CIRCULAR REGION

Parameters a and b can be represented in terms of and as it can be seen
in Figure B.1.
cos
tan
sin
= 2
a = tan =
1
tan
Figure B.1: Scanline defined in terms of and
1
tan
cos
b = sin +
=
tan sin
a = =
(B.6)
(B.7)
We can now substitute a and b in Equation B.5 obtaining:

s
tan sin
R2
C (, ) =
1
tan
1
tan
2
+ 1 sin
(B.8)
+1
We can proceed simplifying the terms within the square root:
1
tan
2
R2
2
+1
= R2 +
=
sin
tan2 sin2
R 2 tan2 + R 2 cos2
tan2
R 2 tan2 = cos2 + R 2 cos2 2

sin
2
R sin + R cos2 2
2
sin2
104
R 2 2
sin2
(B.9)
Now we can substitute this term in Equation B.8.
C (, ) =
tan sin
1
tan
tan R 2 2
tan sin
R 2 2
sin2
+1
tan sin
1
tan
R 2 2
sin
+1
p
tan R 2 2
=
=
=
2
2
1
1
+ 1 tan sin
tan + 1
tan
p
p
tan R 2 2 tan R 2 2
=
=
=
sin
sin2
+
tan
sin
cos
+
tan
cos
p
q
tan R 2 2
= cos sin R 2 2
=
1
cos
105
(B.10)
Bibliography
[AKS11] Sultan Ahmed, Md. Khan, and Md. Shahjahan. A filter based feature selection approach using lempel ziv complexity. In Derong Liu,
Huaguang Zhang, Marios Polycarpou, Cesare Alippi, and Haibo He,
editors, Advances in Neural Networks ISNN 2011, volume 6676 of
Lecture Notes in Computer Science, pages 260269. Springer Berlin /
Heidelberg, 2011. 10.1007/978-3-642-21090-7_31. 35
[And76] James Richard Anderson. A land use and land cover classification system for use with remote sensor data, volume 964. US Government
Printing Office, 1976. 31
[ANR74] N. Ahmed, T. Natarajan, and K. R. Rao. Discrete cosine transform.
IEEE-J-C, (1):9093, 1974. 47
[Bar13] Inigo Barandiaran. Contributions to Local Feature Extraction, Description and Matching in 2D Images. PhD thesis, Department
of Computer Science and Artificial Intelligence , University of the
Basque Country, 2013. 5, 70
[BB08] P. Brasnett and M. Bober. Fast and robust image identification. In
Proc. 19th Int. Conf. Pattern Recognition ICPR 2008, pages 15, 2008.
41, 47
[BCN+ 13] I. Barandiaran, C. Cortes, M. Nieto, M. Grana, and O.E. Ruiz. A new
evaluation framework and image dataset for key point extraction and
feature descriptor matching. In VISAPP 2013 - International Conference on Computer Vision Theory and Applications, pages 252 257.
Scitepress, 2013. 70
107
BIBLIOGRAPHY
[BH11] M. A. Bouker and E. Hervet. Retrieval of images using mean-shift and
gaussian mixtures based on weighted color histograms. In Proc. Seventh Int Signal-Image Technology and Internet-Based Systems (SITIS)
Conf, pages 218222, 2011. 35, 56
[BKB10] W. Bouachir, M. Kardouchi, and N. Belacel. Fuzzy indexing for bag of
features scene categorization. In Proc. 5th Int I/V Communications
and Mobile Network (ISVC) Symp, pages 14, 2010. 57
[BLIS04] R. Blanco, P. Larraaga, I. Inza, and B. Sierra. Gene selection for cancer classification using wrapper approaches. International Journal of
Pattern Recognition and Artificial Intelligence, 2004. 52
[BO07] M Bober and R Oami. Description of mpeg-7 visual core experiments.
Technical report, ISO/IEC JTC1/SC29/WG11, 2007. 35, 47
[Bre65] Jack E Bresenham. Algorithm for computer control of a digital plotter.
IBM Systems journal, 4(1):2530, 1965. 46, 99
[BTG06] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up
robust features. In In ECCV, pages 404417, 2006. 34
[BYR10] Vladimir Britanak, Patrick C. Yip, and K. R Rao. Discrete Cosine and
Sine Transforms: General Properties, Fast Algorithms and Integer Approximations. Academic Press, 2010. 39, 47
[Cas11] Stephen Cass. Unthinking machines. Technical report, MIT Technology Review, 2011. 4
[CLTW10] Myung Jin Choi, J. J. Lim, A. Torralba, and A. S. Willsky. Exploiting
hierarchical context on a large database of object categories. In Proc.
IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages
129136, 2010. 35
[CMGD10] D. Cerra, A. Mallet, L. Gueguen, and M. Datcu. Algorithmic information theory-based analysis of earth observation images: An
assessment. IEEE_J_GRSL, 7(1):812, 2010. 35
[Cor] Corel.
Corel 1000 dataset.
related.shtml). 17, 54
108
(http://wang.ist.psu.edu/docs/
BIBLIOGRAPHY
[DBLFF10] Jia Deng, Alexander C Berg, Kai Li, and Li Fei-Fei. What does classifying more than 10,000 image categories tell us?
In Computer
VisionECCV 2010, pages 7184. Springer, 2010. 27, 77

[Dee] Deep blue. IBM http://www-03.ibm.com/ibm/history/ibm100/
us/en/icons/deepblue/. 27
[DF02] Gerald Dalley and Patrick Flynn. Pair-wise range image registration: A study in outlier classification. Computer Vision and Image
Understanding, 87(1-3):104 115, 2002. 53
[DLS11] F. Dornaika, E. Lazkano, and B. Sierra. Improving dynamic facial
expression recognition with feature subset selection. Pattern Recognition Letters, 32(5):740 748, 2011. 53
[Fah06] S. A. Fahmy. Investigating trace transform architectures for face
authentication. In Proc. Int. Conf. Field Programmable Logic and
Applications FPL 06, pages 12, 2006. 35, 61
[FBCC+ 10] David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David
Gondek, Aditya A Kalyanpur, Adam Lally, J William Murdock, Eric Nyberg, John Prager, et al. Building watson: An overview of the deepqa
project. AI magazine, 31(3):5979, 2010. 28
[FK95] NG Fedotov and Alexander A Kadyrov. Image scanning in machine
vision leads to new understanding of image. In Digital Image Processing and Computer Graphics: Fifth International Workshop, pages
256261. International Society for Optics and Photonics, 1995. 41
[FKT09] Rerkchai Fooprateepsiri, Werasak Kurutach, and Sutthipong Tamsumpaolerd. An image identifier based on hausdorff shape trace
transform. In Proceedings of the 16th International Conference on
Neural Information Processing: Part I, ICONIP 09, pages 788797,
Berlin, Heidelberg, 2009. Springer-Verlag. 42
[Glo] Digital Globe. Geoeye dataset. http://www.geoeye.com. 17, 57
[Haa11] Peter J. Haas. Sketches get sketchier. Commun. ACM, 54:100100,
August 2011. 37
109
BIBLIOGRAPHY
[HJL63] Joseph L Hodges Jr and Erich L Lehmann. Estimates of location based
on rank tests. The Annals of Mathematical Statistics, pages 598611,
1963. 51
[HSL+ 06] Jonathon S. Hare, Patrick A. S. Sinclair, Paul H. Lewis, Kirk Martinez,
Peter G.B. Enser, and Christine J. Sandom. Bridging the semantic gap
in multimedia information retrieval: Top-down and bottom-up approaches. In Paolo Bouquet, Roberto Brunelli, Jean-Pierre Chanod,
Claudia Niedere, and Heiko Stoermer, editors, Mastering the Gap:
From Information Extraction to Semantic Representation / 3rd European Semantic Web Conference, 2006. Event Dates: 12 June 2006.
25
[ILRE00] I. Inza, P. Larraaga, and B. Sierra R. Etxeberria. Feature subset selection by Bayesian networks based optimization. Artificial Intelligence,
123(12):157184, 2000. 52
[JHVB11] Mathieu Jacomy, Sebastien Heymann, Tomaso Venturini, and Mathieu Bastian. Forceatlas2, a graph layout algorithm for handy network
visualization. Draft, Gephi Web Atlas, 2011. 55
[KBBN11] Neeraj Kumar, Alexander Berg, Peter N Belhumeur, and Shree Nayar. Describable visual attributes for face verification and image
search. Pattern Analysis and Machine Intelligence, IEEE Transactions
on, 33(10):19621977, 2011. 53
[KP98] A. Kadyrov and M. Petrou. The trace transform as a tool to invariant feature construction. In Proc. Fourteenth Int Pattern Recognition
Conf, volume 2, pages 10371039, 1998. 18, 41
[KP01] A. Kadyrov and M. Petrou. The trace transform and its applications.
IEEE J PAMI, 23(8):811828, 2001. xxi, 18, 35, 41, 42, 48, 61
[KP06] A. Kadyrov and M. Petrou. Affine parameter estimation from the trace
transform. IEEE J PAMI, 28(10):16311645, 2006. 18, 42, 47, 49
[KSt] Kolmogorov-smirnov test.
Encyclopedia of Mathematics
http://www.encyclopediaofmath.org/index.php?title=
Kolmogorov-Smirnov_test&oldid=22659. 51
110
BIBLIOGRAPHY
[KYDN11] Roger King, Nicolas Younan, Mihai Datcu, and Ion Nedelcu. Innovative data mining techniques in support of geoss: A workshops
findings. In Space Technology (ICST), 2011 2nd International Conference on, pages 14. IEEE, 2011. 31
[LCS11] S. Leutenegger, M. Chli, and R.Y. Siegwart. Brisk: Binary robust invariant scalable keypoints. In Computer Vision (ICCV), 2011 IEEE
International Conference on, pages 25482555. IEEE, 2011. 64
[Lev13] Hector J. Levesque. On our best behaviour. International Joint Conference on Artificial Intelligence, IJCAI, 2013. 3
[LK10] Ping Li and Christian Knig. b-bit minwise hashing. In Proceedings
of the 19th international conference on World wide web, WWW 10,
pages 671680, New York, NY, USA, 2010. ACM. 37
[LLL10] Shuyang Lin, Shengrui Li, and Cuihua Li. A fast electronic components orientation and identify method via radon transform. In Proc.
IEEE Int Systems Man and Cybernetics (SMC) Conf, pages 39023908,
2010. 41
[LM98] H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and
Data Mining. Kluwer Academic Publishers, 1998. 52
[Low99] David G. Lowe. Object recognition from local Scale-Invariant features.
Computer Vision, IEEE International Conference on, 2:11501157
vol.2, August 1999. 27, 34, 64, 65
[LW07] Nan Liu and Han Wang. Recognition of human faces using siscrete
cosine transform filtered trace features. In Proc. 6th Int Information,
Communications & Signal Processing Conf, pages 15, 2007. 35
[LW09] Nan Liu and Han Wang. Modeling images with multiple trace transforms for pattern analysis. IEEE J SPL, 16(5):394397, 2009. 35
[LZC09] Jian Li, Shaohua Kevin Zhou, and Rama Chellappa. Appearance
modeling using a geometric transform. IEEE Trans Image Process,
18(4):889902, Apr 2009. 36
111
BIBLIOGRAPHY
[MAMD09] M. R. Mustaffa, F. Ahmad, R. Mahmod, and S. Doraisamy. Generalized ridgelet-fourier for mxn images: Determining the normalization
criteria. In Proc. IEEE Int Signal and Image Processing Applications
(ICSIPA) Conf, pages 380384, 2009. 35
[MAMD10] M. R. Mustaffa, F. Ahmad, R. Mahmod, and S. Doraisamy. Invariant
generalised ridgelet-fourier for shape-based image retrieval. In Proc.
Int Information Retrieval & Knowledge Management, (CAMP) Conf,
pages 7984, 2010. 35
[Mar11] Gorka Marcos. A Semantic Middleware to enhance current Multimedia Retrieval Systems with Content-based functionalities. PhD thesis,
University of the Basque Country, Computer Science Faculty, Computer Languages and Systems Deparment, Donostia - San Sebastian,
2011. 5, 22
[MIOF11] Gorka Marcos, Arantza Illarramendi, Igor G. Olaizola, and Julian Florez. A middleware to enhance current multimedia retrieval systems
with content-based functionalities. Multimedia Systems, 17(2):149
164, 2011. 22
[Mit97] T.M. Mitchell. Machine Learning. McGraw Hill, 1997. 51
[MLH03] D. Meyer, F. Leisch, , and K. Hortnik. The support vector machine
under test. Neurocomputing, 55:169186, 2003. 53
[MPE04] Mpeg-7 overview, October 2004. 35
[MPL11] M. Meena, K. Pramod, and K. Linganagouda. Optimized trace transform based feature extraction architecture for cbir. In Ajith Abraham, JaimeLloret Mauri, JohnF. Buford, Junichi Suzuki, and SabuM.
Thampi, editors, Advances in Computing and Communications, volume 192 of Communications in Computer and Information Science,
pages 444451. Springer Berlin Heidelberg, 2011. 63
[MS02] K. Mikolajczyk and C. Schmid. An affine invariant interest point
detector. Computer Vision,ECCV 2002, pages 128142, 2002. 72
[NC11] H. Nemmour and Y. Chibani. Handwritten arabic word recognition
based on ridgelet transform and support vector machines. In High
112
BIBLIOGRAPHY
Performance Computing and Simulation (HPCS), 2011 International
Conference on, pages 357 361, july 2011. 35
[NPK10] M. F. Nasrudin, M. Petrou, and L. Kotoulas. Jawi character recognition
using the trace transform. In Proc. Seventh Int Computer Graphics,
Imaging and Visualization (CGIV) Conf, pages 151156, 2010. 35
[OAL09] I. G. Olaizola, N. Aginako, and M. Labayen. Image analysis platform
for data management in the meteorological domain. In Proc. 4th Int.
Workshop Semantic Media Adaptation and Personalization SMAP 09,
pages 8994, 2009. 34
[OBO08] R. OCallaghan, M. Bober, and P. Oami, R. and. Brasnett. Information technology - multimedia content description interface - part 3:
Visual, amendment 3: Image signature tools, 01 2008. 35
[OMK+ 09] I. G. Olaizola, G. Marcos, P. Kramer, J. Florez, and B. Sierra. Architecture for semi-automatic multimedia analysis by hypothesis
reinforcement. In Proc. IEEE Int. Symp. Broadband Multimedia Systems and Broadcasting BMSB 09, pages 16, 2009. 14, 23, 24, 35
[OT01] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A
holistic representation of the spatial envelope. International Journal
of Computer Vision, 42:145175, 2001. 27, 35
[PG92] F. Peyrin and R. Goutte. Image invariant via the radon transform. In
Proc. Int Image Processing and its Applications Conf, pages 458461,
1992. 41
[PK04] M. Petrou and A. Kadyrov. Affine invariant features from the trace
transform. IEEE_J_PAMI, 26(1):3044, 2004. 41, 47
[Poy96] C.A. Poynton. A technical introduction to digital video. J. Wiley, 1996.
40, 55
[Poy03] Charles Poynton. Digital video and HDTV, algorithms and interfaces.
Morgan Kaufmann, 2003. 55
[QGO13] Marco Quartulli and Igor G Olaizola. A review of eo image information mining. ISPRS Journal of Photogrammetry and Remote Sensing,
75:1128, 2013. 18
113
BIBLIOGRAPHY
[Ric02] Iain E. Richardson. Video Codec Design: Developing Image and Video
Compression Systems. John Wiley & Sons, Inc., New York, NY, USA,
2002. 47
[RRKB11] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: an efficient
alternative to sift or surf. In Computer Vision (ICCV), 2011 IEEE
International Conference on, pages 25642571. IEEE, 2011. 65
[RVG+ 07] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In Proc. IEEE 11th Int. Conf. Computer
Vision ICCV 2007, pages 18, 2007. 35
[SASK08] N. Simou, Th. Athanasiadis, G. Stoilos, and S. Kollias. Image indexing
and retrieval using expressive fuzzy description logics. Signal, Image
and Video Processing, 2:321335, 2008. 23
[SDH10] Zhan Shi, Minghui Du, and Rongbing Huang. A trace transform
based on subspace method for face recognition. In Proc. Int Computer Application and System Modeling (ICCASM) Conf, volume 13,
2010. 42
[SF91] Thomas M. Strat and Martin A. Fischler. Context-based vision:
recognizing objects using information from both 2 d and 3 d imagery. IEEE Transactions on Pattern Analysis and Machine Intelligence,
13(10):10501065, 1991. 29
[SHKY04] Jin S. Seo, Jaap Haitsma, Ton Kalker, and Chang Dong Yoo. A robust
image fingerprinting system using the radon transform. Sig. Proc.:
Image Comm., 19(4):325339, 2004. 41
[SI07] E. Shechtman and M. Irani. Matching local self-similarities across
images and videos. In Proc. IEEE Conf. Computer Vision and Pattern
Recognition CVPR 07, pages 18, 2007. 35
[SKBB12] W.J. Scheirer, N. Kumar, P.N. Belhumeur, and T.E. Boult. Multiattribute spaces: Calibration for attribute fusion and similarity search.
In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 29332940, 2012. 35
114
BIBLIOGRAPHY
[SLJI09] B. Sierra, E. Lazkano, E. Jauregi, and I. Irigoien. Histogram distancebased bayesian network structure learning: A supervised classification specific approach. Decision Support Systems, 48(1):180190,
2009. 53
[SPKK03] S. Srisuk, M. Petrou, W. Kurutach, and A. Kadyrov. Face authentication using the trace transform. In Proc. IEEE Computer Society Conf.
Computer Vision and Pattern Recognition, volume 1, 2003. 35, 42
[SS10] Cees G. M. Snoek and Arnold W. M. Smeulders. Visual-concept search
solved? Computer, 43(6):7678, 2010. 34
[SWS+ 00] Arnold W. M. Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain. Content-based image retrieval at
the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell.,
22(12):13491380, December 2000. 29, 31
[TBFO05] J. Turan, Z. Bojkovic, P. Filo, and L. Ovsenik. Invariant image recognition experiment with trace transform. In Proc. 7th Int Telecommunications in Modern Satellite, Cable and Broadcasting Services Conf,
volume 1, pages 189192, 2005. 35, 41, 42
[TFW08] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image
databases for recognition. In Proc. IEEE Conf. Computer Vision and
Pattern Recognition CVPR 2008, pages 18, 2008. 35
[TMF10] A. Torralba, K. P. Murphy, and W. T. Freeman. Using the forest to see
the trees: exploiting context for visual object detection and localization. Commun. ACM, 53(3):107114, March 2010. 29
[TS01] Antonio Torralba and Pawan Sinha. Statistical context priming for
object detection. In Proceedings of the IEEE International Conference
on Computer Vision (ICCV), 2001. 29
[Ver06] David Vernon. The space of cognitive vision. In Cognitive Vision
Systems, pages 724. Springer, 2006. 22
[vGVSG10] J. C. van Gemert, C. J. Veenman, A. W. M. Smeulders, and J.-M. Geusebroek. Visual word ambiguity. IEEE_J_PAMI, 32(7):12711283, 2010.
34
115
BIBLIOGRAPHY
[Wan11] Fei-Yue Wang. A question for aaai: Does ai need a reboot? Intelligent
Systems, IEEE, 26(4):24, 2011. 3
[Wat] Watson. IBM http://www-03.ibm.com/innovation/us/watson/.
28
[WSS02] T. Watanabe, K. Sugawara, and H. Sugihara. A new pattern representation scheme using data compression. IEEE_J_PAMI, 24(5):579590,
2002. 17, 35
[ZKRR08] G. Zajic, N. Kojic, N. Reljin, and B. Reljin. Experiment with reduced
feature vector in cbir system with relevance feedback. IET Conference
Publications, 2008(CP543):176181, 2008. 56
116

A Framework For Content Based Semantic Information Extraction From Multimedia Contents

Uploaded by

Copyright:

Available Formats

A Framework For Content Based Semantic Information Extraction From Multimedia Contents

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Framework For Content Based Semantic Information Extraction From Multimedia Contents

Uploaded by

Copyright:

Available Formats

C OMPUTER S CIENCE FACULTY.

C OMPUTER S CIENCE AND A RTIFICIAL

A F RAMEWORK FOR C ONTENT B ASED

A thesis submitted in fulfillment of the requirements for the degree of

C OMPUTER S CIENCE FACULTY. C OMPUTER S CIENCE AND A RTIFICIAL

A F RAMEWORK FOR C ONTENT B ASED

A thesis submitted in fulfillment of the requirements for the degree of

Supervised by Prof. Basilio Sierra Araujo

Donostia San Sebastian, Wednesday 11th September, 2013

A Framework for Content Based Semantic Information Extraction from Multimedia

SVN Version Control Data:

Aut hor : i ol ai zol a

research activities is the Mandragora framework where the main goal

Una de las caractersticas principales de la nueva era digital es el la

conocida como semantic gap). Podemos definir la brecha semntica

Aro digital berriaren berezitasun nagusienetako bat media edukien

nabariak dira, bertan arrail semantikoa (semantic gap bezela

en un centro tecnolgico. Shabs, this interesting man that always

Profesionalki lana buru belarri egin ahal izateko, pertsonalki oreka

1.1 Context of this research activity . . . . . . . . . . . . . . . . . . . . . .

Relation with other VicomtechIK4 PhD. processes

1.1.2 Computer Science and Artificial Intelligence Department of

1.2 R&D Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2.9 Relationship of projects and scientific activity . . . . . . . . . 18

2.1 Mandragora Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1 Domain characterization for CBIR . . . . . . . . . . . . . . . . . . . . 29

Alternative methods for massive content annotation 30

3.1.2 Earth Observation, Meteorology . . . . . . . . . . . . . . . . . 30

3.2 Local features vs. global features in domain identification . . . . . . 34

4.2.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . 47

Feature Subset Selection in Machine Learning . . . 51

Attribute contribution analysis . . . . . . . . . . . . . 53

Angular and radial sampling . . . . . . . . . . . . . . 68

Effects of sampling in the computational cost . . . . 70

4.7.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . 70

4.7.5 Current status of the local DITEC algorithm design . . . . . . 72

5.1 Mandragora framework . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

II Patents & Publications

7.2 A review on EO mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

III Appendix and Bibliography

A Consideration on the Implementation Aspects of the trace transform

A.1 Development platforms . . . . . . . . . . . . . . . . . . . . . . . . . . 97

1.2 Example of the cloud segmentation process . . . . . . . . . . . . . .

1.3 Rushes content analysis workflow . . . . . . . . . . . . . . . . . . . . 13

4.4 Pixels relevance in trace transform scanning process with different

regions are moved to 4 ,

areas in order to be sampled

with the Bresenham algorithm. . . . . . . . . . . . . . . . . . . . . . . 100

B.1 Scanline defined in terms of and . . . . . . . . . . . . . . . . . . . 104

If our brains were simple enough

1.1 Context of this research activity

1.1 Context of this research activity

1.1.2 Computer Science and Artificial Intelligence Department

1.2 R&D Projects

1.2 R&D Projects

(Z=0 Plane Origin)

(Z=0 Plane Origin)