In-Person Poster presentation / poster accept

KwikBucks: Correlation Clustering with Cheap-Weak and Expensive-Strong Signals

Sandeep Silwal ⋅ Sara Ahmadian ⋅ Andrew Nystrom ⋅ Andrew McCallum ⋅ Deepak Ramachandran ⋅ Seyed Mehran Kazemi

Keywords: text clustering budgeted clustering correlation clustering query efficient weak and strong signals learning-augmented algorithms query efficient clustering Unsupervised and Self-supervised learning

2023 In-Person Poster presentation / poster accept

[ Poster] [ OpenReview]

Abstract

The unprecedented rate at which the sizes of machine learning (ML) models are growing necessitates novel approaches to enable efficient and scalable solutions. We contribute to this line of work by studying a novel version of the Budgeted Correlation Clustering problem (\bcc) where along with a limited number of queries to an expensive oracle for node similarities (e.g. a large ML model), we have unlimited access to a cheaper but less accurate second oracle. Our formulation is inspired by many practical scenarios where coarse approximations of the expensive similarity metric can be efficiently obtained via weaker models. We develop a theoretically motivated algorithm in this setting that leverages the cheap oracle to judiciously query the strong oracle while maintaining high clustering quality. We empirically demonstrate gains in query minimization and clustering metrics on a variety of datasets with diverse strong and cheap oracles. Most notably, we demonstrate a practical application in text clustering based on expensive cross-attention language models by showing that cheaper (but weaker) embedding-based models can be leveraged to substantially reduce the number of inference calls to the former.

Video

Chat is not available.