Open Access
Description:
The scarcity of labels combined with an abundance of data makes unsupervised learning more attractive than ever. Without annotations, inductive biases must guide the identification of the most salient structure in the data. This thesis contributes to two aspects of unsupervised learning: clustering and dimensionality reduction. The thesis falls into two parts. In the first part, we introduce Mod Shift, a clustering method for point data that uses a distance-based notion of attraction and repulsion to determine the number of clusters and the assignment of points to clusters. It iteratively moves points towards crisp clusters like Mean Shift but also has close ties to the Multicut problem via its loss function. As a result, it connects signed graph partitioning to clustering in Euclidean space. The second part treats dimensionality reduction and, in particular, the prominent neighbor embedding methods UMAP and t-SNE. We analyze the details of UMAP's implementation and find its actual loss function. It differs drastically from the one usually stated. This discrepancy allows us to explain some typical artifacts in UMAP plots, such as the dataset size-dependent tendency to produce overly crisp substructures. Contrary to existing belief, we find that UMAP's high-dimensional similarities are not critical to its success. Based on UMAP's actual loss, we describe its precise connection to the other state-of-the-art visualization method, t-SNE. The key insight is a new, exact relation between the contrastive loss functions negative sampling, employed by UMAP, and noise-contrastive estimation, which has been used to approximate t-SNE. As a result, we explain that UMAP embeddings appear more compact than t-SNE plots due to increased attraction between neighbors. Varying the attraction strength further, we obtain a spectrum of neighbor embedding methods, encompassing both UMAP- and t-SNE-like versions as special cases. Moving from more attraction to more repulsion shifts the focus of the embedding from continuous, global to ...
Year of Publication:
2023
Document Type:
Dissertation ; info:eu-repo/semantics/doctoralThesis ; NonPeerReviewed ; [Doctoral and postdoctoral thesis]
Language:
eng
Subjects:
004 ; 004 Data processing Computer science ; 500 ; 500 Natural sciences and mathematics ; 510 ; 510 Mathematics
Rights:
info:eu-repo/semantics/openAccess ; http://archiv.ub.uni-heidelberg.de/volltextserver/help/license_urhg.html
Relations:
https://archiv.ub.uni-heidelberg.de/volltextserver/33875/1/thesis_Damrich_final.pdf ; doi:10.11588/heidok.00033875 ; urn:nbn:de:bsz:16-heidok-338754 ; Damrich, Sebastian (2023) Discovering structure without labels. [Dissertation]
Content Provider:
Universität Heidelberg: HeiDok (Heidelberger Dokumentenserver)
Further nameHeidelberg University: HeiDok  Flag of Germany
Loading ...
Loading ...
Loading ...