Discovering structure without labels

Author:

Damrich, Sebastian [claim]

Description:

The scarcity of labels combined with an abundance of data makes unsupervised learning more attractive than ever. Without annotations, inductive biases must guide the identification of the most salient structure in the data. This thesis contributes to two aspects of unsupervised learning: clustering and dimensionality reduction. The thesis falls into two parts. In the first part, we introduce Mod Shift, a clustering method for point data that uses a distance-based notion of attraction and repulsion to determine the number of clusters and the assignment of points to clusters. It iteratively moves points towards crisp clusters like Mean Shift but also has close ties to the Multicut problem via its loss function. As a result, it connects signed graph partitioning to clustering in Euclidean space. The second part treats dimensionality reduction and, in particular, the prominent neighbor embedding methods UMAP and t-SNE. We analyze the details of UMAP's implementation and find its actual loss function. It differs drastically from the one usually stated. This discrepancy allows us to explain some typical artifacts in UMAP plots, such as the dataset size-dependent tendency to produce overly crisp substructures. Contrary to existing belief, we find that UMAP's high-dimensional similarities are not critical to its success. Based on UMAP's actual loss, we describe its precise connection to the other state-of-the-art visualization method, t-SNE. The key insight is a new, exact relation between the contrastive loss functions negative sampling, employed by UMAP, and noise-contrastive estimation, which has been used to approximate t-SNE. As a result, we explain that UMAP embeddings appear more compact than t-SNE plots due to increased attraction between neighbors. Varying the attraction strength further, we obtain a spectrum of neighbor embedding methods, encompassing both UMAP- and t-SNE-like versions as special cases. Moving from more attraction to more repulsion shifts the focus of the embedding from continuous, global to ...

Year of Publication:

2023

Document Type:

Dissertation ; info:eu-repo/semantics/doctoralThesis ; NonPeerReviewed ; [Doctoral and postdoctoral thesis]

Language:

eng

Subjects:

004 ; 004 Data processing Computer science ; 500 ; 500 Natural sciences and mathematics ; 510 ; 510 Mathematics

DDC:

004 Data processing & computer science ; 500 Natural sciences & mathematics ; 510 Mathematics ; 004 Data processing & computer science (computed)

Rights:

info:eu-repo/semantics/openAccess ; http://archiv.ub.uni-heidelberg.de/volltextserver/help/license_urhg.html

Relations:

https://archiv.ub.uni-heidelberg.de/volltextserver/33875/1/thesis_Damrich_final.pdf ; doi:10.11588/heidok.00033875 ; urn:nbn:de:bsz:16-heidok-338754 ; Damrich, Sebastian (2023) Discovering structure without labels. [Dissertation]

URL:

https://archiv.ub.uni-heidelberg.de/volltextserver/33875/
https://archiv.ub.uni-heidelberg.de/volltextserver/33875/1/thesis_Damrich_final.pdf
https://doi.org/10.11588/heidok.00033875
https://nbn-resolving.org/urn:nbn:de:bsz:16-heidok-338754

Content Provider:

Universität Heidelberg: HeiDok (Heidelberger Dokumentenserver)
Heidelberg University: HeiDok

URL: http://archiv.ub.uni-heidelberg.de/
Research Organization Registry (ROR): Heidelberg University
Continent: Europe
Country: de
Latitude / Longitude: 49.409731 / 8.705869 (Google Maps | OpenStreetMap)
Number of documents: 14,380
Open Access: 14,380 (100%)
Type: Academic publications
System: Eprints 3
Content provider indexed in BASE since: 2007-04-23
BASE URL: https://www.base-search.net/Search/Results?q=coll:ftunivheidelb

Citations Loading ...

For full functionality of this site it is necessary to enable JavaScript.
Here are the instructions for enabling JavaScript in your web browser.

Cited by Loading ...

For full functionality of this site it is necessary to enable JavaScript.
Here are the instructions for enabling JavaScript in your web browser.

More Versions Loading ...

Email this
Add to Favorites
In Google Scholar
- RefWorks
- EndNote
- RIS
- BibTeX
- MARC
- RDF
- RTF
- JSON
- YAML