Learning Low-Rank Document Embeddings with Weighted Nuclear Norm Regularization

10/27/2020
by   Katharina Morik, et al.
0

Recently, neural embeddings of documents have shown success in various language processing tasks. These low-dimensional and dense feature vectors of text documents capture semantic similarities better than traditional methods.However, the underlying optimization problem is non-convex and usually solved using stochastic gradient descent.Hence solutions are most-likely sub-optimal and not reproducible, as they are the result of a randomized algorithm.We present an alternative formulation for learning low-rank representations based on convex optimization. Instead of explicitly learning low-dimensional features, we compute a low-rank representation implicitly by regularizing full-dimensional solutions.Our approach uses the weighted nuclear norm, a regularizer that penalizes singular values of matrices. We optimize the regularized objective using accelerated proximal gradient descent.We apply the approach to learn embeddings of documents. These embeddings are guaranteed to converge to a global optimum in a deterministic manner.We show that our convex approach outperforms traditional convex approaches in a numerical study. Furthermore we demonstrate that the embeddings are useful for detecting similarities on a standard dataset. Then we apply our approach in an interdisciplinary research project to detect topics in religious online discussions. The topic descriptions obtained from a clustering of embeddings are coherent and insightful. An earlier version of this work stated, that the weighted nuclear norm is a convex regularizer. This is wrong – the weighted nuclear norm is non-convex, even though the name falsely suggests that it is a matrix norm.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset