### GETF

Geometric expansion of tensor factorization

view repo

Boolean tensor has been broadly utilized in representing high dimensional logical data collected on spatial, temporal and/or other relational domains. Boolean Tensor Decomposition (BTD) factorizes a binary tensor into the Boolean sum of multiple rank-1 tensors, which is an NP-hard problem. Existing BTD methods have been limited by their high computational cost, in applications to large scale or higher order tensors. In this work, we presented a computationally efficient BTD algorithm, namely Geometric Expansion for all-order Tensor Factorization (GETF), that sequentially identifies the rank-1 basis components for a tensor from a geometric perspective. We conducted rigorous theoretical analysis on the validity as well as algorithemic efficiency of GETF in decomposing all-order tensor. Experiments on both synthetic and real-world data demonstrated that GETF has significantly improved performance in reconstruction accuracy, extraction of latent structures and it is an order of magnitude faster than other state-of-the-art methods.

READ FULL TEXT VIEW PDFGeometric expansion of tensor factorization

view repo

A tensor is a multi-dimensional array that can effectively capture the complex multidimensional features. A Boolean tensor is a tensor that assumes binary values endowed with the Boolean algebra. Boolean tensor has been widely adopted in many fields, including dynamic networks, knowledge graphs, recommendation system, spatial-temporal data etc

wan2019mebf, rukat2018probabilistic, hore2016tensor, zhou2013tensor, bi2018multilayer. Tensor decomposition is a powerful tool in extracting meaningful latent structures in the data, for which the popular CANDECOMP/PARAFAC (CP) decomposition is a generalization of the matrix singular value decomposition to tensor

carroll1970analysis. However, these algorithms are not directly usable for Boolean tensors. In this study, we focus on Boolean tensor decomposition (BTD) under similar framework to the CP decomposition.As illustrated in Figure 1, BTD factorizes a binary tensor

as the Boolean sum of multiple rank 1 tensors. In cases when the error distribution of the data is hard to model, BTD applied to binarized data can retrieve more desirable patterns with better interpretation than regular tensor decomposition

miettinen2009matrix, rukat2017bayesian. This is probably due to the robustness of logic representation of BTD. BTD is an NP-hard problem

miettinen2009matrix. Existing BTD methods suffers from low efficiency due to high space/time complexity, and particularly, most BTD algorithms adopted a least square updating approach with substantially high computational cost miettinen2010sparse, park2017fast. This has hindered their application to neither large scale datasets, such as social network or genomics data, or tensors of high-order.We proposed an efficient BTD algorithm motivated by the geometric underpinning of rank-1 tensor bases, namely GETF (Geometric Expansion for all-order Tensor Factorization). To the best of our knowledge, GETF is the first algorithm that can efficiently deal with all-order Boolean tensor decomposition with an complexity, where represents the total number of entries in a tensor. Supported by rigorous theoretical analysis, GETF solves the BTD problem via sequentially identifying the fibers that most likely coincides with a rank-1 tensor component. Our synthetic and real-world data based experiments validated the high accuracy of GETF and its drastically improved efficiency compared with existing methods, in addition to its potential utilization on large scale or high order data, such as complex relational or spatial-temporal data. The key contributions of this study include: (1) Our proposed GETF is the first method capable of all-order Boolean tensor decomposition; (2) GETF has substantially increased accuracy in identifying true rank-1 patterns, with less than a tenth of the computational cost compared with state-of-the-art methods; (3) we provided thorough theoretical foundations for the geometric properties for the BTD problem.

Notations in this study follow those in kolda2009tensor. We denote the order of a tensor as , which is also called ways or modes

. Scalar value, vector, matrix, and higher order tensor are represented as lowercase character

, bold lowercase character x, uppercase character , and Euler script , respectively. Super script with mark indicates the size and dimension of a vector, matrix or tensor while subscript specifies an entry. Specifically, a -order tensor is denoted as and the entry of position is represented as . For a 3-order tensor, we denote its fibers as , or and its slices , , . For a -order tensor, we denote its mode- fiber as with all indices fixed except for . represents the norm of a tensor, and the norm in particular. The basic Boolean operations include , , and . Boolean entry-wise sum, subtraction and product of two matrices are denoted as , and . The outer Boolean product in this paper is considered as the addition of rank-1 tensors, which in general follows the scope of canonical polyadic (CP) decomposition carroll1970analysis. Specifically, a three-order Rank-1 tensor can be represented as the Boolean outer product of three vectors, i.e. . Similarly, for higher order tensor, a -order tensor of rank is the outer product of , i.e. and , represents the rank-1 tensor components of a rank CP decomposition of . In this paper, we denote as the pattern matrix of the th order of , its th column as the th pattern fiber of the th order, and as the -th rank-1 tensor pattern.As illustrated in Figure 1, for a binary -order tensor and a convergence criteria parameter , the binary tensor decomposition problem is to identify low rank binary pattern matrices , the outer product of which best fit , where are matrices of columns. In other words, Here is the cost function. In general, is defined to the reconstruction error , and is usually set to be 1.

In order of difficulty, Boolean tensor decomposition consists of three major tasks, Boolean matrix factorization (BMF, ) stockmeyer1975set, three-way Boolean tensor decomposition (BTD, ) and higher order Boolean tensor decomposition (HBTD, ) leenen1999indclas. All of them are NP hard miettinen2009matrix

. Numerous heuristic solutions for the BMF and BTD problems have been developed in the past two decades

miettinen2010sparse, miettinen2008discrete, miettinen2011boolean, miettinen2014mdl4bmf, erdos2013walk, karaev2015getting, lucchese2010mining, lucchese2013unifying.For BMF, the ASSO algorithm is the first heuristic BMF approach that finds binary patterns embedded within row-wise correlation matrix miettinen2008discrete. On another account, PANDA lucchese2010mining sequentially retrieves the significant patterns under current (residual) binary matrix amid noise. Recently, BMF algorithms using Bayesian framework have been proposed rukat2017bayesian

. The latent probability distribution adopted by Message Passing (MP) achieved top performance among all the state-of-the-art methods for BMF

ravanbakhsh2016boolean.For BTD, Miettinen et al thoroughly defined the BTD problem () in 2011 miettinen2011boolean, and proposed the use of least square update as a heuristic solution. To solve the scalability issue with the least square update, they later extended developed Walk’N’Merge, which applies random walk over a graph in identifying dense blocks as proxies of rank 1 tensors erdos2013walk

. Despite the increase of scalability, Walk’N’Merge tends to pick up many small patterns, the addition of which doesn’t necessarily decrease the loss function by much. The DBTF algorithm introduced by Park et al. is a parallel distributed implementation of alternative least square update based on Khatri-Rao matrix product

park2017fast. Though DBTF reduced the high computational cost, its space complexity increases exponentially with the increase of tensor orders due to the khatri-Rao product operation. Recently, Tammo et al. proposed a probabilistic solution to BTD, called Logistical Or Machine (LOM), with improved fitting accuracy, robustness to noises, and acceptable computational complexity rukat2018probabilistic. However, the high number of iterations it takes to achieve convergence of the likelihood function makes LOM prohibitive to large data analysis. Most importantly, to the best of our knowledge, none of the existing algorithms are designed to handle the HBTD problem for higher order tensors.GETF identifies the rank-1 patterns sequentially: it first extracts one pattern from the tensor; and the subsequent patterns will be extracted sequentially from the residual tensor after removing the preceding patterns. We first derive the theoretical foundation of GETF. We show that the geometric property of the largest rank-1 pattern in a binary matrix developed in wan2019mebf can be naturally extended to higher order tensor. We demonstrated the true pattern fiber of the largest pattern can be effectively distinguished from fibers of overlapped patterns or errors by reordering the tensor to maximize its overlap with a left-triangular-like tensor. Based on this idea, the most likely pattern fibers can be directly identified by a newly develop geometric folding approach that circumvents heuristic greedy searching or alternative least square based optimization.

We first give necessary definitions of the slice, re-order and sum operations on a order tensor and an theoretical analysis of the property of a left-triangular-like (LTL) tensor.

(-order slice). The -order slice of a tensor indexed by is defined by , where is a fixed value if , and is unfixed if , here and . Specifically, we denote a order slice of with the index set unfixed as or , in which is the unfixed index set and are fixed indices.

(Index Reordering Transformation, IRT). The index reordering transformation (IRT) transforms a tensor to , where are any permutation of the index sets, .

(Tensor slice sum). The tensor slice sum of a -order tensor with respect to the index set is defined as . results in a order tensor.

(p-left-triangular-like, p-LTL). A -order tensor is called p-LTL, if any of its -order slice, , and , and .

(flat 2-LTL), A -order 2-LTL tensor is called flat 2-LTL within an error range , if any of its -order slice, , and , and .

The Definition 5 indicates the tensor sum of over any -order slice of a flat 2-LTL tensor is close enough to a linear function with the largest error less than . Figure 2A,C illustrate two examples of flat 2- matrix and 2- 3-order tensor. By the definition, the non-right angle side of a flat 2- -order tensor is close to a dimension plane, which is specifically called as the k-1 dimension plane of the flat 2- tensor in the rest part of this paper.

Assume is a -order flat 2- tensor and has none zero fibers. Then the largest rank-1 subarray in is seeded where one of the pattern fibers is paralleled with the fiber that anchored on the segmenting point (entry ) along the sides of the right angle.

(Geometric perspective in seeding the largest rank-1 pattern) For a order tensor sparse enough and a given tensor size threshold , if its largest rank-1 pattern tensor is larger than , the IRT that reorders into a (k-1)- tensor reorders the largest rank-1 pattern to a consecutive block, which maximize the size of the connected solid shape overlapped with the dimension plane over a flat 2- tensor larger than .

If a -order tensor can be transformed into a p-LTL tensor by IRT, the p-LTL tensor is unique, p=2,…,k-1.

If a -order tensor is p-, then it is x-, for all the x=p,p+1,…,k.

Detailed proofs of the Lemma 1-4 are given in APPENDIX. Lemma 1 and 2 reflect our geometric perspective in identifying the largest rank-1 pattern and seeding the most likely pattern fibers. Specifically, Lemma 1 suggests the optimal position of the fiber that is most likely the pattern fiber of the largest rank-1 pattern tensor under a flat 2- tensor. Figure 2B,D illustrate the position (yellow dash lines) of the most likely pattern fibers in the flat 2- matrix and -order tensor. It is noteworthy that the (k-1)- tensor must exists for a -order tensor, which can be simply derived by reordering the indices of each tensor order by the decreasing order of . However, not all order tensor can be transformed into a 2- tensor via IRT when . A (k-1)- tensor with only one rank-1 pattern tensor is 2-. Intuitively, the left bottom corner of a -order (k-1)- tensor of the largest rank-1 pattern is also 2- (Figure 2D). However, the existence of multiple rank-1 patterns, overlaps among patterns and errors limit the 2- property of left bottom corner of its (k-1)- tensor. Lemma 2 suggests the indices of the largest rank-1 pattern form the largest overlap between the (k-1)- IRT and the the dimension plane over a flat 2- tensor. Based on this property, the largest rank-1 pattern and its most likely fiber can be seeded without heuristic greedy search or likelihood optimization that can substantially improve the computational efficiency. Lemma 3 and 4 suggest that the (k-1)- tensor is the IRT of that is closest to a 2- tensor. Hence how close the intersect between a (k-1)- tensor and a 2- sub tensor is to a 2- tensor, can reflect if the optimal pattern fiber position derived in Lemma 1 fits to the 2- sub tensor region of the (k-1)- tensor.

Based on the geometric property of the largest rank-1 pattern and its most likely pattern fibers, we developed an efficient BTD and HBTD algorithm—GETF, by iteratively reconstructing the to-be-decomposed tensor into a LTL tensor and identifying the largest rank-1 pattern. The main algorithm of GETF is formed by the iteration of the following five steps. Step 1: For a given tensor , in each iteration, GETF first reorders the indices of the current tensor into a (k-1)- tensor by IRT (Figure 3A,D); Step 2: GETF utilizes 2_LTL_projection algorithm to identify the flat 2- tensor that maximizes the overlapped region between its dimension plane and current (k-1)- tensor (Figure 3B,E); Step 3: A Pattern_fiber_finding algorithm is applied to identify the most likely pattern fiber of the overlap region of the 2- tensor and the (k-1)- tensor, i.e., the largest rank-1 pattern (Figure 4); Step 4: A Geometric_folding algorithm is applied to reconstruct the rank-1 tensor component from the identified pattern fiber that best fit the current to-be-decomposed tensor (Figure 5); and Step 5: Remove the identified rank-1 tensor component from the current to-be-decomposed tensor (Figure 3C,F). The inputs of GETF include the to-be-decomposed tensor , a noise tolerance threshold parameter, a convergence criterion and a pattern fiber searching indicator .

Details of the GETF and its sub algorithms are given in APPENDIX. In Algorithm 1, o represents a direction of geometric folding, which is a permutation of . The 2_LTL_projection utilizes a project function and a scoring function to identify the flat 2- tensor that maximizes the solid overlapped region between its dimension plane and a (k-1)- tensor. The Pattern_fiber_finding and Geometric_folding algorithm are described below. Noted, there are directions of pattern fibers and combinations of the orders in identifying them from a -order tensor or reconstructing a rank-1 pattern from them. Empirically, to avoid duplicated computations, we tested conducting times of geometric folding is sufficient to identify the fibers and reconstruct the suboptimal rank-1 pattern. GETF also provides options to set the rounds and noise tolerance level of geometric folding in expanding a pattern fiber via adjusting the parameters and .

The Pattern_fiber_finding algorithm is developed based on Lemma 1. Its input include a -order tensor and a direction vector. Even the input is the entry-wise product of a flat 2- tensor and the largest rank-1 pattern in a (k-1)- tensor, it may still not be 2- due to the existence of errors. We propose a recursive algorithm that recurrently re-orders an order of the input tensor and reports the coordinate of the pattern fiber on this order (See details in APPENDIX). The output is the position of the pattern fiber.

Figure 4 illustrates the pattern fiber finding approach for a 3-order flat 2- tensor . To identify the coordinates of the yellow colored pattern fiber with unfixed index of the 1st order (Figure 4A), its coordinate of the 2nd order is anchored on the 1/3 segmenting point of , denoted as (Figure 4B), and its coordinate of the 3rd order is on the 1/2 segmenting point of (Figure 4C).

The geometric folding approach is to reconstruct the rank-1 tensor pattern best fit from the pattern fiber identified by the Pattern_fiber_finding algorithm (see details in APPENDIX). For a -order tensor and the identified position of pattern fiber, the pattern fiber is denote as (Figure 5A). The algorithm computes the inner product between and each fiber to generate a new order tensor , (Figure 5B). This new tensor is further discretized based on a user defined noise tolerance level and generates a new binary -1 order tensor (Figure 5C). This approach is called as geometric folding of a -order tensor into a -1 order tensor based on the pattern fiber . This approach will be iteratively conducted to fold the -way tensor into a 2 dimensional matrix with -2 rounds of Pattern_fiber_finding and Geometric_folding and identifies -2 pattern fibers. The pattern fibers of the last 2 dimensional will be identified as a BMF problem by using MEBF wan2019mebf. The output of Geometric_folding is the set of pattern fibers of a rank-1 tensor (Figure 5E).

Assume -order tensor has entries. The computation of 2_LTL_projection is fixed based on its screening range, which is smaller than . The computation of each Pattern_fiber_finding is . Geometric_folding is a loop algorithm consisted of additions and Pattern_fiber_finding. The computation for Geometric_folding to fold a -order tensor takes computations. conducts times in each iteration to extract the suboptimal rank-1 tensor, by which, the overall computing cost on each iteration is . Hence is an complexity algorithm.

Lemma 1, 3, and 4 are mathematically rigorous while Lemma 2 is relatively descriptive due to the errors and level of overlaps among pattern tensors cannot be generally formulated, especially in a high order tensor. However, our derivations in APPENDIX reflects the geometric property described in Lemma 2 stands for most of the tensors whose pattern tensors are not heavily overlapped. The advantage of GETF is significant. The computational cost of the IRT and identification of the flat 2- tensor mostly cross the largest pattern are all , where is the tensor size. The property of the position of the most likely pattern fiber enables circumventing heuristic greedy search or optimization for seeding the largest rank-1 pattern. Due to the heuristic consideration of the algorithm, we focused on the method performance and robustness evaluation on an extensive set of synthetic data to demonstrate GETF is robust for high order tensor decomposition with different level of overlapped patterns and errors, followed by the applications on real-world datasets.

We simulated 4 scenarios each for tensor orders that corresponds to BMF, BTD, 4-order HBTD and 5-order HBTD: (1) low density tensor without error, (2) low density tensor with error, (3) high density tensor without error and (4) high density tensor with error. Detailed experiment setup is listed in APPENDIX. We compared GETF with MP on BMF and LOM on BTD settings, which according to recent reviews, are the best performing algorithms for BMF and BTD problems respectively. The evaluation focus on two metrics, time consumption and reconstruction error rukat2017bayesian, ravanbakhsh2016boolean. For 4-order and 5-order HBTD, we conducted GETF only as there is no competing algorithm.

GETF significantly outperformed MP in reconstruction error (Figure 6A,B) and time consumption (Figure 6C) for all the four scenarios. This is also true when comparing to LOM except for the high density with high noise case, where GETF and LOM performed comparatively in terms of reconstruction error (Figure 6G,H,I). We also evaluated each algorithm on different data scale in supplementary materials. GETF maintains better performance with over 10 times faster in computational speed. Figure 6 D-F,J-L show the capability of GETF on decomposing high order tensor data. Notably, the reconstruction error curve of GETF flattened after reaching the true number of components (Figure 6A,B,D,E,G,H,J,K), which is 5, suggesting its high accuracy in identifying true number of patterns. The error bar stands for standard derivation of time consumption in Figure 6 C,F,I,L. Importantly, when the tensor order increases, its size would increase exponentially. The high memory cost remains a challenge for higher order tensors, for which an O(n) algorithm like GETF is extremely desirable. GETF showed consistent performance in the scenarios with or without noise. For a 5-way tensor with more than elements, GETF completed the task in less than 1 minute. Overall, our experiments on synthetic datasets advocated the efficiency and robustness of GETF for the data with different tensor orders, data sizes, signal densities and noise levels.

We applied GETF on two real-world datasets, the Chicago crime record data^{1}^{1}1Chicago crime records downloaded on March 1st, 2020 from https://data.cityofchicago.org/Public-Safety, and a breast cancer spatial-transcriptomics data ^{2}^{2}2Breast cancer spatial transcriptomics data is retrieved from https://www.spatialresearch.org/resources-published-datasets/doi-10-1126science-aaf2403/, which represents two scenarios with relatively lower and higher noise.

We retrieved the crime records in Chicago from 2001 to 2019 and organized them into a 4D tensor, with the four dimensions representing: 436 regions, 365 dates, 19 years and two crime categories (severe, and non-severe), respectively, i.e., . An entry in the tensor has value 1 if there exists the crime category in the region on the date of the year. We first benchmark the performance of GETF and LOM on a 3D slice, . GETF showed clear advantage over LOM with faster decline in reconstruction error. GETF plateured after the first two patterns, while it is more than eight for LOM (Figure 7B). We applied GETF only to the 4D tensor, and used the top two patterns to reconstruct the original tensor, . To look for the crime date pattern, the crime index of a region defined as the total days of a year with crime occurrences in the region. We show that is able to denoise the data and tease out the date patterns. As visualized in Figure 7C, red indicates regions of high crime index, while blue for low crime index. Clearly, the GETF reconstructed tensor is able to distinguish the two regions. However, such a clear separation is not possible on the original tensor (Figure 7D). Next we examine the validity of the two regions with an outsider factor, regional crime counts, defined as the total number of crimes from 2001 to 2019 for that region. From Figure 7E, we could see that the regions with higher crime index according to GETF indeed corresponds to regions of higher regional crime coutns, and vice versa. In summary, we show that GETF is able to reveal the overall crime patterns by denoising the original tensor.

The breast cancer spatial transcriptomics dataset staahl2016visualization, wu2019deep, as in Figure 7F, was collected on a 3D coordinates with 1020 cell positions (), each of which has expression values of 13360 genes, i.e., . The tensor was first binarized, and it takes value 1 if the expression of the gene is larger than zero. We again benchmarked the performance of GETF and LOM on a 3D slice, . LOM failed to generate any useful information seen from the non-decreasing reconstruction error, possibly because of the high noise of the transcriptomics data. On the other hand, GETF manage to derive patterns gradually (Figure 7I). We applied GETF only to the 4D tensor, and among the top 10 patterns, we analyzed two extremest patterns: one the most sparse (red) and the other the most dense (blue) (Figure 7F). The sparse pattern has 24 cell positions all expressing 232 genes (), the dense pattern has 90 cells positions expressing 40 genes (). A lower dimensional embedding of the 114 cells using UMAP mcinnes2018umap clearly demonstrated them to be two distinct clusters (Figure 7J). We also conducted functional annotations using gene ontology enrichment analysis for the genes of the two patterns. Figure 7K,L showed the of the top 5 pathway enriched by the genes in each pattern, assessed by hypergeometric test. It implies that genes in the most dense pattern maintains the vibrancy of the cancer by showing strong activities in transcription and translation; while genes in the most sparse pattern maintains the tissue structure and suppress anti-tumor immune effect. Our analysis demonstrated that the GETF is able to reveal the complicated but integrated spatial structure of breast cancer tissue with different functionalities.

In this paper, we proposed GETF as the first efficient method for the all-way Boolean tensor decomposition problem. We provided rigorous theoretical analysis on the validity of GETF and conducted experiments on both synthetic and real-world datasets to demonstrate its effectiveness and computational efficiency. In the future, to enable the integration of prior knowledge, we plan to enhance GETF with constrained optimization techniques and we believe it can be beneficial for broader applications that desire a better geometric interpretation of the hidden structures.

GETF is a Boolean tensor factorization algorithm, which provides a solution to a fundamental mathematical problem. Hence we consider it is not with a significant subjective negative impact to the society. The structure of binary data naturally encodes the structure of subspace clusters in the data structure. We consider the efficient BTD and HBTD capability led by GETF enables the seeding of patterns for subspace clustering identification or disentangled representation learning, for the data with unknown subspace structure, such as recommendation of different item classes to customers with unknown groups or biomedical data of different patient classes. As we have demonstrated the high computational efficiency of GETF grants the capability to analyze large or high order tensor data, another field can be potentially benefited by GETF is the inference made to the spatial-temporal data collected from mobile sensors. The high efficiency of GETF enable a possible implementation on smart phones for a real-time inference of the data collected from the phones or other multi-modal personal wearable sensors.

Comments

There are no comments yet.