1 Introduction
Boolean matrix factorization (BMF), also known as Boolean matrix decomposition, is a powerful and widely used data mining tool. Like a classical matrix factorization methods, e.g. nonnegative matrix factorization (NNMF) or singular value decomposition (SVD), BMF provides a different description (see Section
3.2) of Boolean data, via new, more fundamental variables called factors.In BMF a given input data matrix is approximated by a product of socalled objectfactor and factorattribute matrices. All matrices contain zeros and ones only. The quality of the factorization—i.e. the quality of factors themselves—is usually measured by standard measures in BMF, namely by the number of factors and by the coverage (how large is the portion of data is described by factors, see Section 3). Both can be easily implemented—in fact each susscefull BMF algorithm already utilized them—in an arbitrary BMF algorithm. Moreover, both are very important in the evaluation of the factorization quality quality . On the other hand, other aspects of the quality of factors, e.g. the interpretability, that are often neglected in the factor evaluation, are also an important parts of the matrix factorization.
By now, various approaches to assessment of the quality of factors were developed quality ; panda+ . One of the most fundamental—but surprisingly not often used—is based on the wellknown minimum description length principle (MDL). In terms of MDL, the best factorization is the factorization with the minimal description. Due to the MDL principle, such factorization is useful and easily interpretable. Neverthless, it was many times shown (see, e.g. quality ; ess
) that mixing MDL and BMF produces a poor results with respect to the BMF standard error measures (the number of factors and the coverage). More details will be provided in Section
3.Recent results mdlfirst in the field of formal concept analysis (FCA)—which is related to the BMF (see Section 3.3)—involving the minimum description length (MDL) motivate us to revise the use of MDL in BMF.
We propose a new heuristic BMF algorithm for frombelow matrix factorization that outperforms existing stateoftheart algorithms and produces very good result w.r.t. the standard BMF measures. The algorithm utilizes formal concept analysis and the MDL principle. Additionally, we present an extensive experimental evaluation of factors delivered by the proposed algorithm and its comparison with some already existing algorithms.
The rest of the paper is organized as follows. In the following Section 2 we provide a brief overview of the related work. Then, in Section 3, a notation used in the paper, a short introduction to BMF and MDL, and a background of the paper are presented. Section 4 describes a design of our algorithm. The algorithm is experimentally evaluated in Section 5. Section 6 draws a conclusion and future research directions.
2 Related Work
In the last decade, many BMF methods were developed panda+ ; ess ; asso ; grecond ; panda ; hyper . It was shown dimension that applying existing non Boolean methods (e.g. NNMF, SVD) on Boolean data is inappropriate, especially from the interpretation standpoint.
A good overview of BMF and related topics can be found e.g. in quality ; ess ; asso . In general, BMF and BMF algorithms are addressed in various papers involving formal concept analysis grecond ; bmfignatov , role mining exact , binary databases tiling or bipartite graphs biclique .
In many application of BMF, instead of a general Boolean factorization—which can be computed for instance by wellknown Asso algorithm—only a certain class of factorization, socalled frombelow matrix factorization ess , is considered (see Section 3).
In the recent years, the minimum description length principle mdlbook has been applied in BMF. It was used mostly to solve the model order selection problem modelselection —i.e. separation of global structure from noise—or as a factor selection criteria in BMF algorithms, e.g. in the stateoftheart algorithm PaNDa panda+ (an improvement and generalized version of PaNDa algorithm panda ). As a special case of application of MDL in BMF Hyper hyper algorithm can be considered, its objective is to minimize the description of factors instead of the minimization of the description length (for more details see panda+ ).
Another related work is mdlfirst , where a set of formal concepts with MDL is considered for the classification task. Our algorithm can be used for simillar tasks. Instead of mdlfirst our algorithm does not require computing the whole set of formal concepts, that makes it applicable in practice. Moreover we used a different approach to MDL measuring.
This paper is, to the best of the author’s knowledge, the first to address the frombelow decomposition based on the MDL.
3 Background and Basic Definitions
3.1 Notation
Through the paper we use a matrix terminology and in some convenient places a relational terminology. Matrices are denoted by uppercase bold letters (). denotes the entry corresponding to the row and the column of . The set of all Boolean (binary) matrices is denoted by . The number of 1s in Boolean matrix is denoted by , i.e .
We interpret input data primarily as an objectattribute incidence matrix, i.e. a relation between the set of objects and the set of attributes. That is, the entry is either or , indicating that the object does or does not have the attribute .
If and , we have the following elementwise matrix operations. The Boolean sum which is the normal matrix sum where . The Boolean subtraction which is the normal matrix subtraction, where .
3.2 Boolean Matrix Factorization
A general aim in BMF is for a given Boolean matrix to find matrices and for which
(1) 
where is Boolean matrix multiplication, i.e. , and represents approximate equality assessed by . The corresponding metric is defined for matrices , and by
(2) 
A decomposition of into may be interpreted as a discovery of factors that exactly or approximately explain the data: interpreting , , and as the object–attribute, object–factor, and factor–attribute matrices, the model (1) has the following interpretation: the object has the attribute , i.e. , if and only if there exists factor such that applies to and is one of the particular manifestations of .
Note also an important geometric view of BMF: a decomposition with factors represents a coverage of the 1s in by rectangular areas in full of 1s, the th rectangle is the Boolean sum of the th column in and the th row in . For more details see, e.g. kim .
If the rectangular areas cover only non zero elements in the matrix , the is called the frombelow matrix decomposition ess . An example of the frombelow BMF follows.
3.3 BMF with Help of Formal Concept Analysis
Formal concept analysis (FCA) fca provides a basic framework for dealing with factors. The main notion of FCA is formal context, which is usually represented as a Boolean matrix, it is defined as a triple , where is a nonempty set of objects , is a nonempty set of attributes and is a binary relation between and . Hence the formal context with objects and attributes is a Booolean matrix .
To every Boolean matrix , one might associate the pair of operators (in FCA well known as the arrow operators) assigning to sets and the sets and defined by
where is the set of all attributes (columns) shared by all objects (rows) in and is the set of all objects sharing all attributes in .
The pair for which and is called the formal concept. and are called the extent and the intent of formal concept , respectively. The concepts are partially ordered as follows: iff (or ), a pair is a subconcept of , while is a superconcept of . The set of all formal concepts we denote by
The whole set of partially ordered formal concepts is called the concept lattice of .
Given a set (with a fixed indexing of the formal concepts ), induces the and Boolean matrices and by
(3) 
and
(4) 
for . That is, the th column and th row of and
are the characteristic vectors of
and , respectively. The set is also called a set of factor concepts. Clearly, is the frombelow matrix decomposition.Example 2.
Let us considered two factorizations depicted in Figure 2. The first one corresponds to the set
The second one corresponds to the set
For more details how formal concept analysis is utilized in BMF and the advantages of such approach see the pioneer work grecond .
3.4 A Brief Introduction to MDL
The minimum description length (MDL) principle, which is a computable version of Kolmogorov complexity mdlbook , is a formalization of the law of parsimony, well known as Occam’s razor. In terms of MDL, it is formulated as follows: the best model is the model that ensures the best compression of the given data.
More formally, for a given set of models and data (in our case represented via Boolean matrix ) the best model is the one that minimizes the following cost function:
(5) 
where is the encoding length of in bits and is the encoding length in bits of the data encoded with .
In general, we are only interested in the length of the encoding, and not in the coding itself, i.e. we do not have to materialize the codes themselves.
Note that MDL requires the compression to be lossless in order to allow for a fair comparison between different models.
3.5 The Quality of Factorization
The quality of the obtained factorization (1) is usually evaluated via some variants of metric (2). From the BMF perspective there are two basic viewpoints, emphasizing the role of the first factors and the need to account for a prescribed portion of data, respectively. They are known as the discrete basis problem (DBP) and the approximate factorization problem (AFP), see asso and grecond ; ess . Both of them emphasize the coverage of data, i.e. the geometric view of BMF.
In many applications of BMF, the interpretation of factors plays a crucial role. It is reasonable instead of the coverage of the obtained factorization empathize a different quality measures that access the interpretability of factors, e.g. the MDL.
On the other hand, the geometric view of BMF is very important and an interpretable factorization should reflect it.
In the next section, we propose a novel BMF algorithm which is based on wellknown GreConD algorithm grecond . The algorithm computes frombelow factorization via minimization of the cost function (5). The results of experiments show that it preserves a lot of information from the original data w.r.t. the error measure (2).
4 Design of Algorithm
4.1 MDL in Frombelow Matrix Factorization
For matrices , , and where we define an error matrix as follows:
One may observe that matrix can be easily computed via metric (2), i.e. . Hence, to provide a lossless compression of it is sufficient to encode the matrices and , i.e. the MDL cost function (5) has the following form
(6) 
According to the MDL principle, the best factorization of minimizes function (6). In the following we explain how to compute the length of the encoding of matrices and in bits. We use a similar approach as in modelselection and we modify it for the frombelow matrix factorization.
More precisely, to use optimal prefix codes we need to encode the dimensions of the matrices and the matrices themselves, i.e.
For the sake of simplicity we may encode the dimensions with blockencoding, which give us
To not introduce some influencing between factors, these are encoded per factor, i.e. we encode per column and per row.
In order to use optimal prefix code, we need to first encode the probability of encountering 1 in a particular column or row respectively, i.e. we need
bits for each extent in set and bits for each intent in set , respectively.For simplicity, extent and intent of factor concept can be seen as characteristic vectors, i.e. and . We need to encode all ones and zeros. The length of optimal code is determined by Shannon entropy. This gives us the number of bits required for the encoding of matrices and :
In a similar way we can compute the number of bits required for the encoding of matrix :
Note, we can encode matrix elementbyelement without any influence, because these elements are clearly independent.
4.2 Algorithm
In this section we propose a BMF algorithm, called MDLGreConD^{1}^{1}1MDLGreConD is an abbreviation of Minimum Description Length Greedy Concept on Demand., that uses the above described MDL cost function. The algorithm is a modified version—it utilizes a similar search strategy—of the GreConD^{2}^{2}2GreConD is an abbreviation of Greedy Concept on Demand. algorithm grecond , which is one of the most successful frombelow matrix decomposition algorithms (see e.g. quality ).
Pseudocode of MDLGreConD is depicted in Algorithm 1. The algorithm works as follows.
The algorithm computes a candidate to a factor concept that minimizes the cost function (6) stored in variable total_cost. This is done via searching of a promising column that is not included in (lines 8–21). Note that the adding of to is realized via and operators mentioned in Section 3.3. Only the best column is considered (lines 16–20). If a new column is added to , i.e. the is changed, the modified is used as a new candidate and another promising column is searched for. If there is no column that reduce the cost function (line 6), already computed candidate is added to the output set of factor concepts. The algorithm ends if there is no candidate that allows for reduction of the cost function.
4.3 Computational Complexity
The Boolean matrix factorization problem is NPhard stockmeyer as well as the computation of factorization that minimizes the cost function (6). The proposed algorithm is heuristic. One may easily derive an exact algorithm with an exponential time complexity. Such algorithm is inapplicable in practice.
We do not provide the time complexity analysis, since the time complexity is not a main concern of Boolean matrix factorization. The presented algorithm is only slightly slower than GreConD algorithm, which is, probably, the fastest BMF algorithm (see e.g. ess ). Both of them are able to factorize, in order of second, on ordinary PC, all the data presented in Section 5.
5 Experimental Evaluation
In this section, the results of an experimental comparison of BMF algorithms with MDLGreConD are presented.
5.1 Datasets
We use 6 different realworld datasets, namely Breast, Ecoli, Iris and Mushroom from UCI repository uci , and Domino and Emea from exact . The characteristics of the datasets are shown in Table 1. All of them are well known and widely used as benchmark datasets in BMF.
dataset  size  dens.  

Breast  
Domino  
Ecoli  
Emea  
Iris  
Mushroom 
5.2 Algorithms
GreConD grecond algorithm is based on the “on demand” greedy search for formal concepts of . It is designed to compute an exact frombelow factorization. Instead of going through all formal concepts, which are the candidates for factor concepts, it constructs the factor concepts by adding sequentially “promising columns” to candidate to factor concept. More formally, a new column that minimizes the error
is added to . This is repeated until no such columns exist. If there is no such column, the is added to the set . The algorithm ends if is smaller than the prescribed parameter or the prescribed number of factors is reached. For more details see ess . Note, that usually , i.e. the whole matrix is covered by factors. Such setting was adopted in our experiments.
PaNDa panda+ is an algorithmic framework based on PaNDa panda algorithm. The algorithm aims to extract a set of pairs that minimizes the cost function:
Every in is computed in two stages. On the first stage the core of is computed, on the second stage the core is extended. A core is a rectangle, not necessarily a formal concept, contained in and it is computed by adding columns from a sorted list. Extension to is performed by adding columns and rows to a core while such an addition allows for reducing the cost. Note, that PaNDa does not produce the frombelow factorization. The computation of PaNDa is driven by several parameters (see panda+ ). All of them are tuned for each dataset. The best obtained results are reported.
Hyper hyper algorithm aims to extract a set of pairs that minimize the cost function which is defined as follows:
As candidates to factors the set of all formal concepts together with all single attribute rectangles in data are considered. Each candidate is divided into a set of single row rectangles that are sorted according to the number of uncovered elements in . Then the algorithm tries to add the single row rectangles back to the candidate, until the above mentioned cost function decreases. After this, the algorithm in each iteration selects the concept from the modified set of candidates that minimizes the cost function. Hyper algorithm produces the frombelow factorization. The size of can be exponentially large. In such case Hyper has the exponential time complexity. To reduce computational cost authors of hyper propose to use only frequent formal concepts (the frequency is an additional parameter of the algorithm). Our experiments show that the frequency affects highly the performance of the algorithm. In our experiments we use the whole set of formal concepts , (for the set sizes see the last column of Table 1).
5.3 Evaluation
In our experiments we compare MDLGreConD algorithm with GreConD, Hyper and PaNDa. We study factors themselves and how well they cover the analyzed datasets.
5.3.1 The number of factors
One of the main characteristic of BMF algorithms is the number of factors they produce. We measure not only the total number of factors, but also how many nontrivial factors are computed. Under trivial factors we mean the singleattribute ones. The results are shown in Table 2.
As it can be seen from the table, PaNDa tends to produce only few factors (w.r.t. the number of attributes, see Table 1).
no. of factors  

dataset  algorithm  nontrivial  trivial 
Breast  GreConD  15  4 
Panda  4  0  
Hyper  36  0  
MDLGreConD  6  1  
Domino  GreConD  13  8 
Panda  3  0  
Hyper  10  132  
MDLGreConD  7  3  
Ecoli  GreConD  38  3 
Panda  6  0  
Hyper  35  30  
MDLGreConD  8  1  
Emea  GreConD  9  33 
Panda  3  0  
Hyper  3  35  
MDLGreConD  7  2  
Iris  GreConD  8  12 
Panda  8  0  
Hyper  13  15  
MDLGreConD  7  0  
Mushroom  GreConD  98  3 
Panda  8  0  
Hyper  89  2  
MDLGreConD  50  0 
Hyper returns the number of factors which is close to the number of attributes. Moreover, more than a half of them are trivial. This is true on all datasets with an exception of Breast and Mushroom data.
On average (see Figure 3), the number of nontrivial factors of GreConD is better than the number in case of Hyper algorithm. MDLGreConD generates a small set of factors, most of them are nontrivial. PaNDa tends to produce the smallest number of factors. All of them are nontrivial.
However, considering only the number of factors might be insufficient, since usually one wants to find not just the smallest number of factors, but the set of factors that capture (coverage) a large part of data. Further we will show how the algorithms capture the analyzed data.
5.3.2 Data coverage
Another important characteristic of factors is how much information from the analyzed dataset they retain. We measure it by coverage rate. We differentiate data coverage and object coverage. Data coverage measures the rate of “crosses” covered by factors in the dataset—this is a standard measure in BMF, see e.g. quality . However, data coverage might be an inappropriate measure in cases where a dataset contains a lot of redundant attributes. Taking into consideration these cases, we measure the object coverage rate, i.e. how many objects are covered at least by one factor. The following example explains how the coverage measures are computed.
Example 3.
The factor set of the first factorization (Figure 2) covers almost all crosses in data, while the second set covers around a half of crosses. The coverings for both of them are given below. The crosses covered by one factor are light gray, the crosses covered by more factors are colored with darker gray.
width=.6
a  b  c  d  e  f  g  h  

1  
2  
3  
4  
5  
6  
7  
8 
a  b  c  d  e  f  g  h  

1  
2  
3  
4  
5  
6  
7  
8 
Note, both factor sets cover all objects, i.e. every row in the dataset has at least one colored cross, thus the object coverage rates is equal to 1 for both factorizations.
For the first factorization, the cross coverage rate is . In the case of the second factorization, the cross coverage rate is . Obviously, the bigger value is better.
Average values of data coverage and object coverage rates over all datasets as well as the minimal, maximal values and quantiles are shown in Figures
5 and 6 respectively. The average data coverage rate of nontrivial factors of MDLGreConD is slightly lower than the analogous measure for GreConD and Hyper. It is important to note that MDLGreConD provides more stable results, in other words, the data coverage rate does not depend a lot on datasets, while for Hyper algorithm, the data coverage rate changes from 0.2 to 1.0. PaNDa covers slightly more than a half of data by a small set of factors. Moreover, if we take into account results regarding the number of factors from Section 5.3.1 MDLGreConD outperforms all remaining algorithms. Namely, it provides a large coverage by a smaller number of factors.Regarding the object coverage rate, all the algorithms have similar performance, however a large number of nontrivial factors in Hyper ensures its high coverage rate for all chosen datasets.
5.3.3 Redundancy of factors
An important characteristic of a factor set is redundancy. The factor set is redundant if it contains repetitive information, i.e. if it contains some overlaps between factors. We measure redundancy by overlapping rate (see Example 4), i.e. how many times the covered crosses are covered by several factors.
Example 4.
For the factor sets from Figure 2 the average overlapping rate is computed as follows. We count the total area of factors . In the case of the first factorization we obtain , , and . The total area is 43, the number of covered crosses is 35, thus, the average overlapping rate is . The second factorization is without overlapped crosses, thus its average overlapping rate is 1.
Averages values of overlapping rate are shown in Figure 7. Our experiments show that factor sets with minimal redundancy are produced by Hyper algorithm. It can be explained regarding the previous experiments (see Section 5.3.1), where it was shown that Hyper algorithm tends to produce a large number of trivial factors. PaNDa tends to produce a very small number of factors with low coverage rate. As one may clearly observe, GreConD produces factorizations with the largest overlapping rate. MDLGreConD generates a nonredundant set.
5.4 Discussion
Let us summarize the experimental evaluation. GreConD and Hyper are both able to explain the whole data. However, the quality of factorizations they produce is lower than the quality of MDLGreConD. More precisely, Hyper produces a large number of trivial factors. GreConD produce a less number of trivial factors, but with a lot of overlappings between them.
The quality of factorization obtained via PaNDa algorithm is low as well. The factors delivered by PaNDa cover only a small part of input data.
According to the experimental evaluation, MDLGreConD algorithm provide a factor set with wellbalanced characteristics. The number of factors is reasonably small, factors themselves explain a large portion of data and are not redundant.
6 Conclusions
In this paper an MDLbased frombelow factorization algorithm, which utilizes formal concept analysis, has been proposed. It produces a small subset of formal concepts having a low information loss rate.
The proposed algorithm does not require computing the whole set of formal concepts, that makes it applicable in practice. More than that, it computes factor sets that have better overall characteristics than factor sets computed by the existing BMF algorithms. The MDLGreConDgenerated factor sets are small, contain few singleattribute factors and have a high coverage with low overlapping rate.
An important direction of future work is application of the proposed method under supervised settings, i.e. for dealing with classification tasks.
References
 [1] Radim Belohlavek, Jan Outrata, and Martin Trnecka. Toward quality assessment of boolean matrix factorizations. Inf. Sci., 459:71–85, 2018.
 [2] Radim Belohlavek and Martin Trnecka. Frombelow approximations in boolean matrix factorization: Geometry and new algorithm. J. Comput. Syst. Sci., 81(8):1678–1697, 2015.
 [3] Radim Belohlavek and Vilem Vychodil. Discovery of optimal factors in binary data via a novel method of matrix decomposition. J. Comput. Syst. Sci., 76(1):3–20, 2010.
 [4] Alina Ene, William G. Horne, Nikola Milosavljevic, Prasad Rao, Robert Schreiber, and Robert Endre Tarjan. Fast exact and heuristic methods for role minimization problems. In Indrakshi Ray and Ninghui Li, editors, 13th ACM Symposium on Access Control Models and Technologies, SACMAT 2008, Estes Park, CO, USA, June 1113, 2008, Proceedings, pages 1–10. ACM, 2008.
 [5] B. Ganter and R. Wille. Formal Concept Analysis Mathematical Foundations. SpringerVerlag, Berlin, Heidelberg, 1999.
 [6] Floris Geerts, Bart Goethals, and Taneli Mielikäinen. Tiling databases. In Einoshin Suzuki and Setsuo Arikawa, editors, Discovery Science, 7th International Conference, DS 2004, Padova, Italy, October 25, 2004, Proceedings, volume 3245 of Lecture Notes in Computer Science, pages 278–289. Springer, 2004.

[7]
Peter D. Grünwald.
The Minimum Description Length Principle (Adaptive Computation and Machine Learning)
. The MIT Press, 2007.  [8] Dmitry I. Ignatov, Elena Nenova, Natalia Konstantinova, and Andrey V. Konstantinov. Boolean matrix factorisation for collaborative filtering: An fcabased approach. In Gennady Agre, Pascal Hitzler, Adila Alfa Krisnadhi, and Sergei O. Kuznetsov, editors, Artificial Intelligence: Methodology, Systems, and Applications  16th International Conference, AIMSA 2014, Varna, Bulgaria, September 1113, 2014. Proceedings, volume 8722 of Lecture Notes in Computer Science, pages 47–58. Springer, 2014.
 [9] Ki Hang Kim. Boolean matrix theory and applications, volume 70. Dekker, 1982.
 [10] M. Lichman. UCI machine learning repository, 2013.
 [11] Claudio Lucchese, Salvatore Orlando, and Raffaele Perego. Mining topk patterns from binary datasets in presence of noise. In Proceedings of the SIAM International Conference on Data Mining, SDM 2010, April 29  May 1, 2010, Columbus, Ohio, USA, pages 165–176. SIAM, 2010.
 [12] Claudio Lucchese, Salvatore Orlando, and Raffaele Perego. A unifying framework for mining approximate topk binary patterns. IEEE Trans. Knowl. Data Eng., 26(12):2900–2913, 2014.
 [13] Tatiana P. Makhalova, Sergei O. Kuznetsov, and Amedeo Napoli. A first study on what MDL can do for FCA. In Dmitry I. Ignatov and Lhouari Nourine, editors, Proceedings of the Fourteenth International Conference on Concept Lattices and Their Applications, CLA 2018, Olomouc, Czech Republic, June 1214, 2018., volume 2123 of CEUR Workshop Proceedings, pages 25–36. CEURWS.org, 2018.
 [14] Pauli Miettinen, Taneli Mielikäinen, Aristides Gionis, Gautam Das, and Heikki Mannila. The discrete basis problem. IEEE Trans. Knowl. Data Eng., 20(10):1348–1362, 2008.
 [15] Pauli Miettinen and Jilles Vreeken. Model order selection for boolean matrix factorization. In Chid Apté, Joydeep Ghosh, and Padhraic Smyth, editors, Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 2124, 2011, pages 51–59. ACM, 2011.
 [16] S. D. Monson, S. Pullman, and R. Rees. A survey of clique and biclique coverings and factorizations of (0,1)matrices. In Bulletin of the ICA, 14, pages 17–86, 1995.
 [17] L. J. Stockmeyer. The Set Basis Problem is NPcomplete. Research reports. IBM Thomas J. Watson Research Division, 1975.
 [18] Nikolaj Tatti, Taneli Mielikäinen, Aristides Gionis, and Heikki Mannila. What is the dimension of your binary data? In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM 2006), 1822 December 2006, Hong Kong, China, pages 603–612. IEEE Computer Society, 2006.
 [19] Yang Xiang, Ruoming Jin, David Fuhry, and Feodor F. Dragan. Summarizing transactional databases with overlapped hyperrectangles. Data Min. Knowl. Discov., 23(2):215–251, 2011.
Comments
There are no comments yet.