Tree structured sparse coding on cubes

01/16/2013 ∙ by Arthur Szlam, et al. ∙ CUNY Law School 0

A brief description of tree structured sparse coding on the binary cube.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 The construction on the cube

1.1 Setup

We are given data points in written as the binary matrix . Our goal is to decompose as a tree of subcubes and “subcube corrections”. A dimensional subcube of is determined by a point , along with a set of restricted indices . The cube consists of the points such that for all , that is

The unrestricted indices can take on either value.

1.2 The construction

Here I will describe a simple version of the construction where each node in the tree corresponds to a subcube of the same dimension , and a hard binary clustering is used at each stage. Suppose our tree has depth . Then the construction consists of

  1. A tree structured clustering of into sets at depth (scale) such that

  2. and cluster representatives (that is -dimensional subcubes)

    such that the restricted sets have the property that if is an ancestor of ,


    for all

Here each

is a vector in

; the complete set of roughly corresponds to from before. However, note that each has precisely entries that actually matter; and moreover because of the nested equalities, the leaf nodes carry all the information on the branch. This is not to say that the tree structure is not important or not used- it is, as the leaf nodes have to share coordinates. However once the full construction is specified, the leaf representatives are all that is necessary to code a data point.

1.3 Algorithms

We can build the partitions and representatives starting from the root and descending down the tree as follows: first, find the best fit dimensional subcube for the whole data set. This is given by a coordinate-wise mode; the free coordinates are the ones with the largest average discrepancy from their modes. Remove the fixed coordinates from consideration. Cluster the reduced ( dimensional) data using means with ; on each cluster find the best fit cube. Continue to the leaves.

1.3.1 Refinement

The terms and can be updated with a Lloyd type alternation. With all of the fixed, loop through each from the root of the tree finding the best subcubes at each scale for the current partition. Now update the partition so that each is sent to its best fit leaf cube.

1.3.2 Adaptive , , etc.

In [1], one of the important points is that many of the model parameters, including the , , and the number of clusters could be determined in a principled way. While it is possible that some of their analysis may carry over to this setting, it is not yet done. However, instead of fixing , we can fix a percentage of the energy to be kept at each level, and choose the number of free coordinates accordingly.

2 Experiments

We use a binarized the MNIST training data by thresholding to obtain

. Here and . Replace of the entries in with noise sampled uniformly from , and train a tree structured cube dictionary with and depth . The subdivision scheme used to generate the multiscale clustering is -means initialized via randomized farthest insertion [2]; this means we can cycle spin over the dictionaries [5], to get many different reconstructions to average over. In this experiment the reconstruction was preformed 50 times for the noise realization. The results are visualized below.

Figure 1: Results of denoising using the tree structured coding. The top left image is the first 64 binarized MNIST digits after replacing of the data matrix with uniform noise. The top right image is recovered, using a binary tree of depth and , and 100 cycle spins, thus the non-binary output, as the final result is the average of the random clustering initialization (of course with the same noise realization). The bottom left image is recovered using robust pca [4], for comparison. The bottom right is the true binary data.


  • [1] W. Allard, G. Chen, and M. Maggioni. Multiscale geometric methods for data sets II: Geometric multi-resolution analysis. to appear in Applied and Computational Harmonic Analysis.
  • [2] David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, SODA ’07, pages 1027–1035, Philadelphia, PA, USA, 2007. Society for Industrial and Applied Mathematics.
  • [3] Richard G. Baraniuk, Volkan Cevher, Marco F. Duarte, and Chinmay Hegde. Model-Based Compressive Sensing. Dec 2009.
  • [4] Emmanuel J. Candès, Xiaodong Li, Yi Ma, and John Wright.

    Robust principal component analysis?

    J. ACM, 58(3):11, 2011.
  • [5] R. R. Coifman and D. L. Donoho. Translation-invariant de-noising. Technical report, Department of Statistics, 1995.
  • [6] G. David and S. Semmes. Singular integrals and rectifiable sets in : au-delà des graphes Lipschitziens. Astérisque, 193:1–145, 1991.
  • [7] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group lasso with overlap and graph lasso. In

    Proceedings of the 26th Annual International Conference on Machine Learning

    , ICML ’09, pages 433–440, New York, NY, USA, 2009. ACM.
  • [8] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary learning. In International Conference on Machine Learning (ICML), 2010.
  • [9] P. W. Jones. Rectifiable sets and the traveling salesman problem. Invent Math, 102(1):1–15, 1990.
  • [10] Seyoung Kim and Eric P. Xing. Tree-guided group lasso for multi-task regression with structured sparsity. In ICML, pages 543–550, 2010.
  • [11] Gilad Lerman. Quantifying curvelike structures of measures by using Jones quantities. Comm. Pure Appl. Math., 56(9):1294–1365, 2003.