Keywords: [ low-rank approximation ] [ numerical linear algebra ] [ Latent Simplex ]

Abstract:
We consider the problem of learning a latent $k$-vertex simplex $K\in\mathbb{R}^d$, given $\mathbf{A}\in\mathbb{R}^{d\times n}$, which can be viewed as $n$ data points that are formed by randomly perturbing some latent points in $K$, possibly beyond $K$. A large class of latent variable models, such as adversarial clustering, mixed membership stochastic block models, and topic models can be cast in this view of learning a latent simplex. Bhattacharyya and Kannan (SODA 2020) give an algorithm for learning such a $k$-vertex latent simplex in time roughly $O(k\cdot\text{nnz}(\mathbf{A}))$, where $\text{nnz}(\mathbf{A})$ is the number of non-zeros in $\mathbf{A}$. We show that the dependence on $k$ in the running time is unnecessary given a natural assumption about the mass of the top $k$ singular values of $\mathbf{A}$, which holds in many of these applications. Further, we show this assumption is necessary, as otherwise an algorithm for learning a latent simplex would imply a better low rank approximation algorithm than what is known.
We obtain a spectral low-rank approximation to $\mathbf{A}$ in input-sparsity time and show that the column space thus obtained has small $\sin\Theta$ (angular) distance to the right top-$k$ singular space of $\mathbf{A}$. Our algorithm then selects $k$ points in the low-rank subspace with the largest inner product (in absolute value) with $k$ carefully chosen random vectors. By working in the low-rank subspace, we avoid reading the entire matrix in each iteration and thus circumvent the $\Theta(k\cdot\text{nnz}(\mathbf{A}))$ running time.