
MPIFAUN: An MPIBased Framework for AlternatingUpdating Nonnegative Matrix Factorization
Nonnegative matrix factorization (NMF) is the problem of determining two nonnegative low rank factors W and H, for the given input matrix A, such that A ≈ W H. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient parallel algorithms to solve the problem for big data sets. The main contribution of this work is a new, highperformance parallel computational framework for a broad class of NMF algorithms that iteratively solves alternating nonnegative least squares (NLS) subproblems for W and H. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). The framework is flexible and able to leverage a variety of NMF and NLS algorithms, including Multiplicative Update, Hierarchical Alternating Least Squares, and Block Principal Pivoting. Our implementation allows us to benchmark and compare different algorithms on massive dense and sparse data matrices of size that spans for few hundreds of millions to billions. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements. The code and the datasets used for conducting the experiments are available online.
09/28/2016 ∙ by Ramakrishnan Kannan, et al. ∙ 0 ∙ shareread it

A Practical Randomized CP Tensor Decomposition
The CANDECOMP/PARAFAC (CP) decomposition is a leading method for the analysis of multiway data. The standard alternating least squares algorithm for the CP decomposition (CPALS) involves a series of highly overdetermined linear least squares problems. We extend randomized least squares methods to tensors and show the workload of CPALS can be drastically reduced without a sacrifice in quality. We introduce techniques for efficiently preprocessing, sampling, and computing randomized least squares on a dense tensor of arbitrary order, as well as an efficient samplingbased technique for checking the stopping condition. We also show more generally that the KhatriRao product (used within the CPALS iteration) produces conditions favorable for direct sampling. In numerical results, we see improvements in speed, reductions in memory requirements, and robustness with respect to initialization.
01/23/2017 ∙ by Casey Battaglino, et al. ∙ 0 ∙ shareread it

The geometry of rank decompositions of matrix multiplication II: 3× 3 matrices
This is the second in a series of papers on rank decompositions of the matrix multiplication tensor. We present new rank 23 decompositions for the 3× 3 matrix multiplication tensor M_〈 3〉. All our decompositions have symmetry groups that include the standard cyclic permutation of factors but otherwise exhibit a range of behavior. One of them has 11 cubes as summands and admits an unexpected symmetry group of order 12. We establish basic information regarding symmetry groups of decompositions and outline two approaches for finding new rank decompositions of M_〈 n〉 for larger n.
01/02/2018 ∙ by Grey Ballard, et al. ∙ 0 ∙ shareread it

A 3D Parallel Algorithm for QR Decomposition
Interprocessor communication often dominates the runtime of large matrix computations. We present a parallel algorithm for computing QR decompositions whose bandwidth cost (communication volume) can be decreased at the cost of increasing its latency cost (number of messages). By varying a parameter to navigate the bandwidth/latency tradeoff, we can tune this algorithm for machines with different communication costs.
05/14/2018 ∙ by Grey Ballard, et al. ∙ 0 ∙ shareread it

Joint 3D Localization and Classification of Space Debris using a Multispectral Rotating Point Spread Function
We consider the problem of joint threedimensional (3D) localization and material classification of unresolved space debris using a multispectral rotating point spread function (RPSF). The use of RPSF allows one to estimate the 3D locations of point sources from their rotated images acquired by a single 2D sensor array, since the amount of rotation of each source image about its x, y location depends on its axial distance z. Using multispectral images, with one RPSF per spectral band, we are able not only to localize the 3D positions of the space debris but also classify their material composition. We propose a threestage method for achieving joint localization and classification. In Stage 1, we adopt an optimization scheme for localization in which the spectral signature of each material is assumed to be uniform, which significantly improves efficiency and yields better localization results than possible with a single spectral band. In Stage 2, we estimate the spectral signature and refine the localization result via an alternating approach. We process classification in the final stage. Both Poisson noise and Gaussian noise models are considered, and the implementation of each is discussed. Numerical tests using multispectral data from NASA show the efficiency of our threestage approach and illustrate the improvement of point source localization and spectral classification from using multiple bands over a single band.
06/11/2019 ∙ by Chao Wang, et al. ∙ 0 ∙ shareread it

TuckerMPI: A Parallel C++/MPI Software Package for Largescale Data Compression via the Tucker Tensor Decomposition
Our goal is compression of massivescale gridstructured data, such as the multiterabyte output of a highfidelity computational simulation. For such data sets, we have developed a new software package called TuckerMPI, a parallel C++/MPI software package for compressing distributed data. The approach is based on treating the data as a tensor, i.e., a multidimensional array, and computing its truncated Tucker decomposition, a higherorder analogue to the truncated singular value decomposition of a matrix. The result is a lowrank approximation of the original tensorstructured data. Compression efficiency is achieved by detecting latent global structure within the data, which we contrast to most compression methods that are focused on local structure. In this work, we describe TuckerMPI, our implementation of the truncated Tucker decomposition, including details of the data distribution and inmemory layouts, the parallel and serial implementations of the key kernels, and analysis of the storage, communication, and computational costs. We test the software on 4.5 terabyte and 6.7 terabyte data sets distributed across 100s of nodes (1000s of MPI processes), achieving compression rates between 100200,000× which equates to 9999.999 desired accuracy) in substantially less time than it would take to even read the same dataset from a parallel filesystem. Moreover, we show that our method also allows for reconstruction of partial or downsampled data on a single node, without a parallel computer so long as the reconstructed portion is small enough to fit on a single machine, e.g., in the instance of reconstructing/visualizing a single downsampled time step or computing summary statistics.
01/18/2019 ∙ by Grey Ballard, et al. ∙ 0 ∙ shareread it

Parallel Nonnegative CP Decomposition of Dense Tensors
The CP tensor decomposition is a lowrank approximation of a tensor. We present a distributedmemory parallel algorithm and implementation of an alternating optimization method for computing a CP decomposition of dense tensor data that can enforce nonnegativity of the computed lowrank factors. The principal task is to parallelize the matricizedtensor times KhatriRao product (MTTKRP) bottleneck subcomputation. The algorithm is computation efficient, using dimension trees to avoid redundant computation across MTTKRPs within the alternating method. Our approach is also communication efficient, using a data distribution and parallel algorithm across a multidimensional processor grid that can be tuned to minimize communication. We benchmark our software on synthetic as well as hyperspectral image and neuroscience dynamic functional connectivity data, demonstrating that our algorithm scales well to 100s of nodes (up to 4096 cores) and is faster and more general than the currently available parallel software.
06/19/2018 ∙ by Grey Ballard, et al. ∙ 0 ∙ shareread it
Grey Ballard
is this you? claim profile