Suppose you have an unlabeled repository of MRI, CT, and PET scans of brain regions corresponding to different patients from different stages of the diagnostic process. You wish to sort them into clusters corresponding to individual patients and align the images within each cluster. In this work, we address this exact problem of joint clustering and registration of images using novel multivariate information functionals.
Image registration is the task of geometrically aligning two or more images of the same scene taken at different points in time, from different viewpoints, or by different imaging devices. It is a crucial step in most image processing tasks where a comparative study of the different images is required such as medical diagnosis, target detection, image fusion, change detection, and multimodal image restoration. In such applications it is also essential to classify images of different scenes prior to registering images of the same kind. Thus clustering images according to the scene is also critical to computer vision problems such as object tracking, face tagging, cryo-electron microscopy, and remote sensing.
Different digital images of the same scene can appear significantly different from each other, e.g., consider images of a scene that are negatives of each other. Such factors tend to make clustering and registering images harder. Further, such meta-data about the digital images is often not available a priori. This emphasizes the need for universality, i.e., the design of reliable clustering and registration algorithms that work without the specific knowledge of the priors or channel models that govern the image generation.
Image clustering and registration have often been dealt with separately. However, it is easy to see that clustering registered images and registering images within clusters are both relatively easier. Here we emphasize that the two problems are not separate and define universal, reliable, joint clustering and registration algorithms.
I-a Prior Work
There is rich literature on clustering and registration; we describe a non-exhaustive listing of relevant prior work.
Supervised learning for classification has recently gained prominence through deep convolutional neural networks and other machine learning algorithms. However, these methods require vast amounts of costly labeled training data. Thus, unsupervised image classification is of interest.
Unsupervised clustering of objects has been studied under numerous optimality and similarity criteria . The -means algorithm and its generalization to Bregman divergences  are some popular distance-based methods. Popular techniques for unsupervised image clustering include affinity propagation 5]6], and orthogonal subspace projection . We focus on information-based clustering algorithms [8, 9, 10], owing to the ubiquitous nature of information functionals in universal information processing. Universal clustering has been studied in the communication and crowdsourcing [11, 12].
Separate from clustering, multi-image registration has been studied extensively . Prominent region-based registration methods include maximum likelihood (ML) , minimum KL divergence , correlation detection , and maximum mutual information (MMI) [17, 18]. Feature-based techniques have also been explored .
Lower bounds on mean squared error for image registration in the presence of additive noise using Ziv-Zakai and Cramer-Rao bounds have been explored recently ,. The MMI decoder was originally developed in universal communication . Deterministic reasons for its effectiveness in image registration have been identified . Correctness has been established through information-theoretic arguments .
A problem closely related to image registration is multireference alignment. There the aim is to denoise a signal from noisy, circularly-translated versions of itself, under Gaussian or binary noise models [25, 26]. Versions of this problem have been considered for image denoising under Gaussian noise . Unlike denoising, we consider the task of registration alone but for a wider class of noise models under a universal setting.
I-B Our Contributions
While MMI has been found to perform well in numerous empirical studies, concrete theoretical guarantees are still lacking. In this work, we extend the framework of universal delay estimation to derive universal asymptotic optimality guarantees for MMI in registering two images under the Hamming loss, under mild assumptions on the image models.
Even though the MMI method is universally asymptotically optimal for registering two images, we show that the method is strictly suboptimal in multi-image registration. We define the max multiinformation (MM) image registration algorithm that uses the multiinformation functional in place of pairwise MMI. We prove that the method is universal and asymptotically optimal using type counting arguments.
Then, we consider the task of joint clustering and registration. We define novel multivariate information functionals to characterize dependence in a collection of images. Under a variety of clustering criteria, we define algorithms using these functionals to perform joint clustering and registration and prove consistency of the methods.
Applications such as cryo-electron microscopy handle a large number of images of several molecular conformations. The task of clustering and registration is critical. With such motivation, we revisit joint clustering and registration under the constraint of limited resolution of images. We define blockwise clustering and registration algorithms, further showing they are order-optimal in the scaling of resolution with number of pixels in the system.
We now formulate the joint image clustering and registration problem and define the model of images we work with.
Ii-a Image and Noise
Consider a simple model of images, where each image is a collection of pixels drawn independently and identically from an unknown prior distribution defined on finite space of pixel values . Since the pixels are i.i.d. , we represent the original scene of the image by an
-dimensional random vector,. More specifically, the scene is drawn according to .
Consider a finite collection of distinct scenes (drawn i.i.d. according to the prior) . This set can be interpreted as a collection of different scenes. Each image may be viewed as a noisy depiction of an underlying scene.
Consider a collection, , of scenes drawn from , with each scene chosen independently and identically according to the pmf , so image corresponds to a depiction of the scene , for .
We model images corresponding to this collection of scenes as noisy versions of the underlying scenes drawn as follows
where is the inclusion-wise maximal subset such that for all . That is, images corresponding to the same scene are jointly corrupted by a discrete memoryless channel, while the images corresponding to different scenes are independent conditioned on the scene. Here we assume . The system is depicted in Fig. 1.
Consider any two images generated as above. Let be the conditional distribution that a pixel in the second image is , given the corresponding pixel in the first image is . Let
where are i.i.d. copies of pixel values generated according to the marginal distribution of the first image. Note that if , then with sufficient samples of the image, one can easily identify corresponding pixels in the copy. Hence to avoid the trivial case, we presume there exist finite positive constants such that, for any two images,
Ii-B Image Transformations
Corrupted images are also subject to independent rigid-body transformations such as rotation and translation. Since images are vectors of length , transformations are represented by permutations of . Let be the set of all allowable transformations. We assume is known.
Let be the transformation of image . Then, the final image is , for all . Image transformed by is depicted interchangeably as .
We assume forms a commutative algebra over the composition operator . More specifically,
for , ;
there exists unique s.t. , for all ;
for any , there exists a unique inverse , s.t. .
The number of distinct rigid transformations of images with pixels on the -lattice is polynomial in , i.e., for some .
A permutation cycle, , is a subset of permutation , such that , for all and .
It is clear from the pigeonhole principle that any permutation is composed of at least one permutation cycle. Let the number of permutation cycles of a permutation be .
Identity block of permutation is the inclusion-wise maximal subset such that , for all .
A permutation is simple if , .
Any two permutations are said to be non-overlapping if for all .
Let be chosen uniformly at random from the set of all permutations of . Then for any constants ,
First, we observe that the number of permutations that have an identity block of size at least is given by
Thus, from Stirling’s approximation,
Lengths and number of cycles in a random permutation may be analyzed as detailed in . In particular, we note that for a random permutation , . Using Markov’s inequality, the result follows.
Following Lem. 1, we assume that for any , , i.e., the number of permutation cycles does not grow very fast. Further, let for any .
Ii-C Performance Metrics
We now introduce formal metrics to quantify performance of the joint clustering and registration algorithms.
A clustering of images is a partition of . The sets of a partition are referred to as clusters. The clustering is said to be correct if
Let be the set of all partitions of . For a given collection of scenes, we represent the correct clustering by .
A partition is finer than , , if . Similarly, a partition is denser than , , if .
The correct registration of an image transformed by is .
A universal clustering and registration algorithm is a sequence of functions that are designed in the absence of knowledge of , , and . Here the index corresponds to the number of pixels in each image.
We focus on the -loss function to quantify performance.
The error probability
error probabilityof an algorithm that outputs , is
Alg. is asymptotically consistent if , and is exponentially consistent if .
The error exponent of an algorithm is
We use to denote when clear from context.
Iii Registration of two images
We first consider the problem of registering two images, i.e., , . Thus the problem reduces to registering an image obtained as a result of transforming the output of an equivalent discrete memoryless channel , given input image (reference) . The channel model is depicted in Fig. 2.
This problem has been well-studied in practice and a heuristic method used is the MMI method defined as
where is the mutual information of the empirical distribution where is the indicator function. Note that the MMI method is universal.
Since transformations are chosen uniformly at random, the maximum likelihood (ML) estimate is Bayes optimal:
We first show that the MMI method, and consequently the ML method, are exponentially consistent. We then show that the error exponent of MMI matches that of ML.
Iii-a Exponential Consistency
The empirical mutual information of i.i.d. samples is asymptotically exponentially consistent.
Let be random samples drawn i.i.d. according to the joint distribution
be random samples drawn i.i.d. according to the joint distributiondefined on the finite space . For fixed alphabet sizes, , the ML estimate of entropy and mutual information are asymptotically exponentially consistent and satisfy
where , .
The proof is given in [31, Lem. 7].
MMI and ML are exponentially consistent.
Let and let the correct registration be . Then,
Thus MMI is exponentially consistent as . Finally, and thus, the ML estimate is also exponentially consistent.
Thm. 2 implies there exists such that
Iii-B Whittle’s Law and Markov Types
We now summarize a few results on the number of types and Markov types which are eventually used to analyze the error exponent of image registration.
Consider a sequence . The empirical distribution of is the type of the sequence. Let
be a dummy random variable. Letbe the set of all sequences of length , of type . The number of possible types of sequences of length is polynomial in , i.e., .
The number of sequences of length , of type , is
From bounds on multinomial coefficients, the number of sequences of length and type is bounded as 
Consider a Markov chain defined on the space. Given a sequence of samples from we can compute the matrix of transition counts, where corresponds to the number of transitions from state to state . By Whittle’s formula , the number of sequences with , with and is
where corresponds to the th cofactor of the matrix with
The first-order Markov type of a sequence is defined as the empirical distribution , given by
Here we assume that the sequence is cyclic with period , i.e., for any . Let . Then, from (14), the set of sequences of type , , satisfies
From the definition of , we can bound the trace of the cofactor matrix of as
Again using the bounds on multinomial coefficients, we have
The joint first-order Markov type of a pair of sequences , is the empirical distribution
Then given , the set of conditional first-order Markov type sequences, satisfies 
Let be any two non-overlapping permutations and let be the corresponding permutations of . Let be a simple permutation. Then, for every , there exists , such that
Since permutations are non-overlapping, there is a bijection from to , where . Specifically, consider the permutation defined iteratively as , with . Then, for any , the sequence . Further, this map is invertible and so the sets are of equal size.
Result for conditional types follows mutatis mutandis.
Let . For any ,
Let and . For let the length of permutation cycle of be for . Further, . Let be the identity block of and let . Then we have the decomposition
for all . Here, is the first-order Markov type defined on the th permutation cycle of and is the zeroth-order Markov type corresponding to the identity block of .
From (17), we see that given a valid decomposition of types , the number of sequences can be computed as a product of the number of subsequences of each type, i.e.,
where the sum is over all valid decompositions in (17).
Additionally, from Lem. 2 we know the number of valid subsequences of each type. Let be the marginal corresponding to the first-order Markov type .
Thus, we upper bound the size of the set as
the maximum taken over all valid decompositions in (17). Here, (18) follows from (13) and (15), and (19) follows since the total number of possible types is polynomial in the length of the sequence. Since ,
Now, let , for all , and . And let . Since , using Jensen’s inequality, we have
Now, for large , consider for some . Any other cycle of smaller length contributes to the exponent due to Lem. 1. One valid decomposition of (17) is to have , for all . However, the lengths of the subsequences are different and may not be a valid type of the corresponding length. Nevertheless, for each , there exists a type such that , where is the total variational distance. Further, entropy is continuous in the distribution  and satisfies
This in turn indicates that
Similar decomposition follows for conditional types as well.
Iii-C Error Analysis
We are interested in the error exponent of MMI-based image registration, in comparison to ML. We first note that the error exponent of the problem is characterized by the pair of transformations that are the hardest to compare.
Define as the binary hypothesis testing problem corresponding to image registration when the allowed transformations are only . Let be the corresponding error probability and error exponent.
Let be an asymptotically exponentially consistent estimator. Then,
Let be the estimate output by and be the correct registration. We first upper bound the error probability as
where (21) follows from the union bound.
Additionally, we have
Finally, since , the result follows.
Thus, it suffices to consider the binary hypothesis tests to study the error exponent of image registration.
Let . Then,
Probabilities of i.i.d. sequences are defined by their joint type. Thus, we have
where the summation in (23) is over the set of all joint types of sequences of length and is the number of sequences of length with joint type such that the MMI algorithm makes a decision error.
If a sequence is in error, then all sequences in are in error as is drawn from an i.i.d. source. Now, we can decompose the number of sequences under error as follows
where the sum is taken over such that there is a decision error. The final line follows from the fact that given the joint first-order Markov type, the size of the conditional type is independent of the exact sequence .
Finally, we have
The result then follows from the forthcoming Lem. 5.
Observe that for images with i.i.d. pixels, MMI is the same as minimizing the joint entropy, i.e.,
Thus, we can see that using MMI for image registration is not only universal, but also asymptotically optimal. We next study the problem of image registration for multiple images.
Iv Multi-image Registration
Having universally registered two images, we now consider aligning multiple copies of the same image. For simplicity, let us consider aligning three images; ; results can directly be extended to any finite number of images. Let be the source image and be the noisy, transformed versions to be aligned as shown in Fig. 3.
Here, the ML estimates are
Iv-a MMI is not optimal
We know MMI is asymptotically optimal at aligning two images. Is pairwise MMI, i.e.
optimal for multi-image registration? We show pairwise MMI is suboptimal even though individual transformations are chosen independently and uniformly from .
There exists channel and prior, such that pairwise MMI is suboptimal for multi-image registration.
Let . Consider physically degraded images obtained as outputs of the channel . This is depicted in Fig. 4.
Naturally, the ML estimate of the transformations is obtained by registering image to and subsequently registering to , instead of registering each of the images pairwise to . That is, from (27)
It is evident that is estimated based on the estimate of .
Let be the error exponent of for the physically-degraded channel. Since the ML estimate involves registration of to and to , the error exponent is .
Let be the error exponent of MMI. Then, error exponent of pairwise MMI is . We know MMI is asymptotically optimal for two image registration and so