Skip to yearly menu bar Skip to main content

Workshop: Geometrical and Topological Representation Learning

Two-dimensional visualization of large document libraries using t-SNE

Rita González-Márquez · Philipp Berens · Dmitry Kobak


We benchmarked different approaches for creating 2D visualizations of large document libraries, using the MEDLINE (PubMed) database of the entire biomedical literature as a use case (19 million scientific papers). Our optimal pipeline is based on log-scaled TF-IDF representation of the abstract text, SVD preprocessing, and t-SNE with uniform affinities, early exaggeration annealing, and extended optimization. The resulting embedding distorts local neighborhoods but shows meaningful organization and rich structure on the level of narrow academic fields.

Chat is not available.