DeepAI
Log In Sign Up

Neural Implicit Dictionary via Mixture-of-Expert Training

07/08/2022
by   Peihao Wang, et al.
0

Representing visual signals by coordinate-based deep fully-connected networks has been shown advantageous in fitting complex details and solving inverse problems than discrete grid-based representation. However, acquiring such a continuous Implicit Neural Representation (INR) requires tedious per-scene training on tons of signal measurements, which limits its practicality. In this paper, we present a generic INR framework that achieves both data and training efficiency by learning a Neural Implicit Dictionary (NID) from a data collection and representing INR as a functional combination of basis sampled from the dictionary. Our NID assembles a group of coordinate-based subnetworks which are tuned to span the desired function space. After training, one can instantly and robustly acquire an unseen scene representation by solving the coding coefficients. To parallelly optimize a large group of networks, we borrow the idea from Mixture-of-Expert (MoE) to design and train our network with a sparse gating mechanism. Our experiments show that, NID can improve reconstruction of 2D images or 3D scenes by 2 orders of magnitude faster with up to 98 in image inpainting and occlusion removal, which are considered to be challenging with vanilla INR. Our codes are available in https://github.com/VITA-Group/Neural-Implicit-Dict.

READ FULL TEXT VIEW PDF

page 5

page 6

page 7

04/05/2022

Unified Implicit Neural Stylization

Representing visual signals by implicit representation (e.g., a coordina...
06/29/2021

IREM: High-Resolution Magnetic Resonance (MR) Image Reconstruction via Implicit Neural Representation

For collecting high-quality high-resolution (HR) MR image, we propose a ...
05/06/2021

ACORN: Adaptive Coordinate Networks for Neural Scene Representation

Neural representations have emerged as a new paradigm for applications i...
09/22/2022

Edge-oriented Implicit Neural Representation with Channel Tuning

Implicit neural representation, which expresses an image as a continuous...
03/26/2022

Learning Deep Implicit Functions for 3D Shapes with Dynamic Code Clouds

Deep Implicit Function (DIF) has gained popularity as an efficient 3D sh...
05/31/2022

Generalised Implicit Neural Representations

We consider the problem of learning implicit neural representations (INR...
01/28/2022

From data to functa: Your data point is a function and you should treat it like one

It is common practice in deep learning to represent a measurement of the...

1 Introduction

Implicit Neural Representations (INRs) have recently demonstrated remarkable performance in representing multimedia signals in computer vision and graphics

(Park et al., 2019; Mescheder et al., 2019; Saito et al., 2019; Chen et al., 2021c; Sitzmann et al., 2020b; Tancik et al., 2020; Mildenhall et al., 2020)

. In contrast to classical discrete representations, where real-world signals are sampled and vectorized before processing, INR directly parameterizes the continuous mapping between coordinates and signal values using deep fully-connected networks (also known as multi-layer perceptron or MLP). This continuous parameterization enables to represent more complex and flexible scenes without being limited by grid extents and resolution in a more compact and memory efficient way.

However, one significant drawback of this approach is that acquiring an INR usually requires a tedious per-scene

training of neural networks on

dense measurements, which limits the practicality. Yu et al. (2021); Wang et al. (2021); Chen et al. (2021a) generalizes Neural Radiance Field (NeRF) (Mildenhall et al., 2020) across various scenes by projecting image features to a 3D volumetric proxy and then rendering feature volume to generate novel views. To speed up INR training, Sitzmann et al. (2020a); Tancik et al. (2021) apply meta-learning algorithms to learn the initial weight parameters for the MLP based on the underlying class of signals being represented. However, this line of works are either hard to be extended beyond NeRF scenario or incapable of producing high-fidelity results with insufficient supervision.

In this paper, we design a unified INR framework that simultaneously achieves optimization and data efficiency. We think of reconstructing an INR from few-shot measurements as solving an underdetermined system. Inspired by compressed sensing techniques (Donoho, 2006), we represent every neural implicit function as a linear combination of a function basis sampled from an over-complete Neural Implicit Dictionary (NID). Unlike conventional basis representation as a wide matrix, an NID is parameterized by a group of small neural networks that acts as continuous function basis spanning the entire target function space. The NID is shared across different scenes while the sparse codes are specified by each scene. We first acquire the NID “offline” by jointly optimizing it with per-scene coding across a class of instances in a training set. When transferring to unseen scenarios, we re-use the NID and only solves the the scene specific coding coefficients “online”.

To effectively scale to thousands of subnetworks inside our dictionary, we employ the Mixture-of-Expert (MoE) training for NID learning (Shazeer et al., 2017). We model each function basis in our dictionary as an expert subnetwork and the coding coefficients as its gating state. During each feed-forward, we utilize a routing module to generate sparsely coded gates, i.e., activating a handful of basis experts and linearly combining their responses. Training with MoE also “kills two birds with one stone” by constructing transferable dictionaries and avoiding extra computational overheads.

Our contributions can be summarized as follows:

  • We propose a novel data-driven framework to learn a Neural Implicit Dictionary (NID) that can transfer across scenes, to both accelerate per-scene neural encoding and boost their performance.

  • NID is parameterized by a group of small neural networks that acts as continuous function basis to span the neural implicit function space. The dictionary learning is efficiently accomplished via MoE training.

  • We conduct extensive experiments to validate the effectiveness of NID. For training efficiency, we show that our approach is able to achieve 100 faster convergence speed for image regression task. For data efficiency, our NID can reconstruct signed distance function with 98% less point samples, and optimize a CT image with 90% fewer views. We also demonstrate more practical applications for NID, including image inpainting, medical image recovery, and transient object detection for surveillance videos.

2 Preliminaries

Compressed Sensing in Inverse Imaging.

Compressed sensing and dictionary learning are widely applied in inverse imaging problems (Lustig et al., 2008; Metzler et al., 2016; Fan et al., 2018). In classical signal processing, signals are discretized and represented by vectors. A common goal is to reconstruct signals (or digital images) from measurements

, which are formed by linearly transforming the underlying signals plus noise:

. However, is often highly ill-posed, i.e., number of measurements is much smaller than the number of unknowns (), which makes this inverse problem rather challenging. Compressed sensing (Candès et al., 2006; Donoho, 2006) provides an efficient approach to solve this underdetermined linear system by assuming signals are compressible and representing it in terms of few vectors inside a group of spanning vectors . Then we can reconstruct through the following optimization objective:

(1)

where is known as the sparse code coefficient, and is a bound on the noise level. One often replaces the semi-norm with to obtain a convex objective. The spanning vectors can be chosen from orthonormal bases or, more often than not, over-complete dictionaries () (Kreutz-Delgado et al., 2003; Tošić and Frossard, 2011; Aharon et al., 2006; Chen and Needell, 2016). Rather than a bunch of spanning vectors, Chan et al. (2015); Tariyal et al. (2016); Papyan et al. (2017) proposed hierarchical dictionary implemented by neural network layers.

Implicit Neural Representation.

Implicit Neural Representation (INR) in computer vision and graphics replaces traditional discrete representations of multimedia objects with continuous functions parameterized by multilayer perceptrons (MLP)

(Tancik et al., 2020; Sitzmann et al., 2020b). Since this representation is amenable to gradient-based optimization, prior works managed to apply coordinate-based MLPs to many inverse problems in computational photography (Park et al., 2019; Mescheder et al., 2019; Mildenhall et al., 2020; Chen et al., 2021c, b; Sitzmann et al., 2021; Fan et al., 2022; Attal et al., 2021b; Shen et al., 2021) and scientific computing (Han et al., 2018; Li et al., 2020; Zhong et al., 2021). Formally, we denote an INR inside a function space by , which continuously maps -dimension spatio-temporal coordinates (say with for images) to the value space (say pixel intensity). Consider a functional , we intend to find the network weights such that:

(2)

where records the measurement settings. For instance, in computed tomography (CT), is called the volumetric projection integral and specifies the ray parameterization and corresponding colors. When solving ordinal differential equations, takes form of if , while for some constant if , given a compact set and operator which combines derivatives of (Sitzmann et al., 2020b).

Mixture-of-Expert Training.

Shazeer et al. (2017) proposed outrageously wide neural networks with dynamic routing to achieve larger model capacity and higher data parallel. Their approach is to introduce an Mixture-of-Expert (MoE) layer with a number of expert subnetworks and train a gating network to select a sparse combination of the experts to process each input. Let us denote by and the output of the gating network and the output of the -th expert network for a given input . The output of the MoE module can be written as:

(3)

where is the number of experts and . In Shazeer et al. (2017), computation is saved based on the sparsity of . The common sparsification strategy is called noisy top- gating, which can be formulated as:

(4)
(5)

where synthesizes raw gating activations, masks out smallest elements, and scales the magnitude of remaining weights to a constant, which can be chosen from softmax or -norm normalization.

3 Neural Implicit Dictionary Learning

As we discussed before, inverse imaging problems are often ill-posed and it is also true for Implicit Neural Representation (INR). Moreover, training an INR network is also time-consuming. How to kill two bird with one stone by efficiently and robustly acquiring an INR from few-shot observations remains uninvestigated. In this section, we answer this question by presenting our approach Neural Implicit Dictionary (NID), which are learned from data collections a priori and can be re-used to quickly fit an INR. We will first reinterpret two-layer SIREN (Sitzmann et al., 2020b) and point out the limitation of current design. Then we will elaborate on our proposed models and the techniques to improve its generalizability and stability.

3.1 Motivation by Two-Layer SIREN

Common INR architectures are pure Multi-Layer Perceptrons (MLP) with periodic activation functions. Fourier Feature Mapping (FFM)

(Tancik et al., 2020) places a sinusoidal transformation after the first linear layer, while Sinusoidal Representation Network (SIREN) (Sitzmann et al., 2020b) replaces every nonlinear activation with a sinusoidal function. For the sake of simplicity, we only consider two-layer INR architectures to unify the formulation of FFM and SIREN. To be consistent with the notation in Section 2, let us denote INR by function , which can be formulated as below:

(6)
(7)

where and are all network parameters, and mapping (cf. Equation 6) is called positional embedding (Mildenhall et al., 2020; Zhong et al., 2021). After simply rewriting, we can obtain:

(8)
(9)

from which we discover Equations 6-7

can be considered as an approximation of inverse Hartley (Fourier) transform (cf. Equation

9). The weights of the first SIREN layer sample frequency bands on the Fourier domain, and passing coordinates through sinusoidal activation functions maps spatial positions onto cosine-sine wavelets. Then training a two-layer SIREN amounts to finding the optimal frequency supports and fitting the coefficients in Hartley transform.

Although trigonometric polynomials are dense in continuous function space, cosine-sine waves may not be always desirable as approximating functions at arbitrary precision with finite neurons can be infeasible. In fact, some other bases, such as Gegenbauer basis

(Feng and Varshney, 2021) and Plücker embedding (Attal et al., 2021a), have been proven useful in different tasks. However, we argue that since handcrafted bases are agnostic to data distribution, they cannot express intrinsic information about data, thus may generalize poorly across various scenes. This causes per-scene training to re-select the frequency supports and re-fit the Fourier coefficients. Moreover, when observations are scarce, sinusoidal basis can also result in severe over-fitting in reconstruction (Sutherland and Schneider, 2015).

Figure 1: Illustration of our NID pipeline. The blue experts are activated while grey ones are ignored.

3.2 Learning Implicit Function Basis

Having reasoned why current INR architectures generalize badly and demand tons of measurements, we intend to introduce the philosophy of sparse dictionary representation (Kreutz-Delgado et al., 2003; Tošić and Frossard, 2011; Aharon et al., 2006) into INR. A dictionary contains a group of over-complete basis that spans the signal space. In contrast to handcrafted bases or wavelets, dictionary are usually learned from a data collection. Since it is aware of the distribution of the underlying signals to be represented, expressing signals using dictionary enjoys higher sparsity, robustness and generalization power.

Even though dictionary learning algorithms are well established in Aharon et al. (2006), it is far from trivial to design dictionaries amenable to INR on the continuous domain. Formally, we want to obtain a set of continuous maps: such that for every signal inside our target signal space , there exists a sparse coding that can express the signal:

(10)

where is the size of the dictionary, and satisfies for some sparsity . We parameterize each component in the dictionary with small coordinate-based networks by , where denotes the network weights of the -th element. We call this group of function basis Neural Implicit Dictionary (NID).

We adopt an end-to-end optimization scheme to learn the NID. During training stage, we jointly optimize the subnetworks inside NID and the sparse coding assigned with each instance. Suppose we own a data collection with measurements captured from multimedia instances to be represented (say images or geometries of objects): , where is the observation parameters (say coordinates on 2D lattice for images), is the dimension of such parameters, are measured observations (say corresponding RGB colors), denotes the number of observations for -th instance. Then we optimize the following objective on the training dataset:

(11)

where is the INR of the -th instance, is a functional measuring function with respect to a group of parameters .

is the loss function dependent of downstream tasks.

places a regularization onto the sparse coding, is fixed in our experiments. Besides sparsity penalty, we also consider some joint prior distributions among all codings, which will be discussed in Section 3.3. When transferring to unseen scenes, we fix NID basis and only compute the corresponding sparse coding to minimize the objective in Equation 11.

3.3 Training Thousands of Subnetworks with Mixture-of-Expert Layer

Directly invoking thousands of networks causes inefficiency and redundancy due to sample dependent sparsity. Moreover, this brute force computational strategy fails to properly utilize the advantage of modern computing architectures in parallelism. As we introduced in Section 2, Mixture-of-Expert (MoE) training system (Shazeer et al., 2017; He et al., 2021) provides a conditional computation mechanism that achieves stable and parallel training on a outrageously large networks. We notice that MoE layer and NID share the intrinsic similarity in the underlying running paradigm. Therefore, we propose to leverage an MoE layer to represent an NID accommodating thousands of implicit function basis. Specifically, each element in NID is an expert network in MoE layer, and the sparse coding encodes the gating states. Below we elaborate on the implementation details of the MoE based NID layer part by part:

Expert Networks.

Each expert network is a small SIREN (Sitzmann et al., 2020b) or FFM (Tancik et al., 2020) network. To downsize the whole MoE layer, we share the positional embedding and the first 4 layers among all expert networks. Then we append two independent layers for each expert. We note this design can make two experts share the early-stage features and adjust their coherence.

Gating Networks.

The generated gating is used as the sparse coding of an INR instance. We provide two alternatives to obtain the gating values: 1) We employ an encoder network as the gating function to map the (partial) observed measurements to the pre-sparsified weights. For grid-like modality, we utilize convolutional neural networks (CNN)

(He et al., 2016; Liu et al., 2018; Gordon et al., 2019). For unstructured point modality, we adopt set encoders (Zaheer et al., 2017; Qi et al., 2017a, b). 2) We can also leverage a lookup table (Bojanowski et al., 2017) where each scene is assigned with a trainable embedding jointly optimized with expert networks. After computing the raw gating weights, we recall the method in Equation 3 to sparsify gates. Different from Shazeer et al. (2017)

, we do not perform softmax normalization to gating logits. Instead, we sort gating weights with respect to their absolute values, and normalize the weights by its

norm. Comparing aforementioned two gating functions, encoder-based gating networks benefit in parameter saving and instant inference without need of re-fitting sparse coding. However, headless embeddings demonstrate more strength in training efficiency and achieve better convergence.


Methods PSNR () SSIM () LPIPS () # Params FLOPs Throughput
FFM (Tancik et al., 2020) 22.60 0.636 0.244 147.8 20.87 0.479
SIREN (Sitzmann et al., 2020b) 26.11 0.758 0.379 66.56 4.217 0.540
Meta + 5 steps (Tancik et al., 2021) 23.92 0.583 0.322 66.69 4.217 0.536
Meta + 10 steps (Tancik et al., 2021) 29.64 0.651 0.182 66.69 4.217 0.536
NID + init. () 28.75 0.892 0.061 8.972 23.30 30.37
NID + 5 steps () 33.57 0.941 0.027 8.972 23.30 30.37
NID + 10 steps () 35.10 0.954 0.021 8.972 23.30 30.37
NID + init. () 30.26 0.919 0.045 8.972 29.55 21.23
NID + 5 steps () 35.09 0.960 0.019 8.972 29.55 21.23
NID + 10 steps () 37.75 0.971 0.012 8.972 29.55 21.23
Table 1: Performance of NID compared with FFM, SIREN, and Meta on CelebA dataset. the higher the better, the lower the better. The unit of # Params is megabytes, FLOPs is in gigabytes, and throughput is in #images/s.

Patch-wise Dictionary.

It is implausible to construct an over-complete dictionary to represent entire signals. We adopt the walkround in (Reiser et al., 2021; Turki et al., 2021) by partitioning the coordinate space into regular and overlapped patches, and assign separate NID to each block. We implement this by setting up multiple MoE layers and dispatch the coordinate inputs to corresponding MoE with respect to the region where they are located.

Utilization Balancing and Warm-Up.

It was observed that gating network tends to converge to a self-reinforcing imbalanced state, where it always produces large weights for the same few experts (Shazeer et al., 2017). To tackle this problem, we pose a regularization on the Coefficient of Variation (CV) of the sparse codings following Bengio et al. (2015); Shazeer et al. (2017). The CV penalty is defined as:

(12)
(13)

Evaluating this regularization over the whole training set is infeasible. Instead we estimate and minimize this loss per batch. We also find hard sparsification will stop gradient back-propagation, which leads to stationary gating states equal to the initial stage. To address this side-effect, we first abandon hard thresholding and train the MoE layer with

penalty

on codings for several epochs, and enable sparsification afterwards.

Figure 2: A closer look at the early training stages of FFM, SIREN, Meta, and NID, respectively.

4 Experiments and Applications

In this section, we demonstrate the promise of NID by showing several applications in scene representation.

4.1 Instant Image Regression

A prototypical example of INR is to regress a 2D image with an MLP which takes in coordinates on 2D lattice and is supervised with RGB colors. Given a image , our goal is to approximate the mapping by optimizing for every , where . In conventional training scheme, each image is encoded into a dedicated network after thousands of iterations. Instead, we intend to use NID to instantly acquire such INR without training or with only few steps of gradient descent.

Figure 3: Visualization of foreground-background decomposition results for surveillance video via principal component pursuit with NID.

Experimental Settings.

We choose to train our NID on CelebA face dataset (Liu et al., 2015), where each image is cropped to . Our NID contains 4096 experts, each of which share a 4-layer backbone with 256 hidden dimension and own a separate 32-dimension output layer. We adopt 4 residual convolutional blocks (He et al., 2016) as the gating network. During training, the gating network is tuned with the dictionary. NID is warmed up within 10 epochs and then start to only keep top 128 experts for each input for 5000 epochs. At the inference stage, we let gating network directly output the sparse coding of the test image. To further improve the precision, we utilize the output as the initialization, and then use gradient descent to further optimize the sparse coding with the dictionary fixed. We contrast our methods to FFM (Tancik et al., 2020), SIREN (Sitzmann et al., 2020b) and Meta (Tancik et al., 2021). In Table 1, we demonstrate the overall PSRN, SSIM (Wang et al., 2004), and LPIPS (Zhang et al., 2018) of these four models on test set (with 500 images) under the limited training step setting, where FFM and SIREN are only trained for 100 steps. We also present the inference time metrics in Table 1, including the number of parameters to represent 500 images, FLOPs to render a single image, and measured throughput of images rendered per second. In Figure 2, we zoom into the initialization and early training stages of each model.

Results.

Results in Table 1 show that NID () can achieve best performance among all compared models even without subsequent optimization steps. A relative sparser NID () can also surpass both FFM and SIREN (trained with 100 steps) with the initially inferred coding. Compared with meta-learning based method, our model can outperform them by a significant margin () within the same optimization steps. We note that since NID only further tunes the coding vector, both computation and convergence speed are much faster than meta-learning approaches which fine-tune parameters of the whole network. Figure 2 illustrates that the initial sparse coding inferred from the gating network is enough to produce high-accuracy reconstructed images. With 3 more gradient descent steps (which usually takes 5 seconds), it can reach the quality of well-tuned per-scene training INR (which takes 10 minutes). We argue that although meta learning is able to find a reasonable start point, but the subsequent optimization is sensitive to saddle points where the represented images are fuzzy and noisy. In regard to model efficiency, our NID is 8 times more compact than single-MLP representation, as NID shares dictionary among all samples and only needs to additionally record an small gating network. Moreover, our MoE implementation results in a significant throughput gain, as it makes inference highly parallelable. We point out that meta-learning can only provide an initialization. To represent all test images, one has to save all dense parameters separately. Horizontally compared, denser NID is more expressive than sparser one though sacrificing efficiency.

4.2 Facial Image Inpainting.


Figure 4: Qualitative results of inpainting image from corruptions with NID.

Image inpainting recovers images corrupted by occlusion. Previous works (Liu et al., 2018; Yu et al., 2019) only establish algorithms based on discrete representation. In this section, we demonstrate image inpainting directly on continuous INR. Given a corrupted image

, we remove outliers by projecting

onto some low-dimension linear (function) subspace spanned by components in a dictionary. We achieve this by trying to represent the corrupted image as a linear combination of a pre-trained NID, while simultaneously enforcing the sparsity of this combination. Specifically, we fix the dictionary in Equation 11 and choose norm as the loss function (Candès et al., 2011):

(14)

where we assume noises are sparsely distributed on images.

Experimental Settings.

We corrupt images by randomly pasting a color patch. To recover images, we borrow the dictionary trained on CelebA dataset from Section 4.1. However, we do not leverage the gating network to synthesize the sparse coding. Instead, we directly optimize a randomly initialized coding to minimize Equation 11. Our baseline includes SIREN and Meta (Tancik et al., 2021). We change their loss function to norm to keep consistent. To inpaint with Meta, we start from its learned initialization, and optimize two steps towards the objective.

Results.

The inpainting results are presented in Figure 4. Our findings are 1) SIREN overfits all given signals as it does not rely on any image priors. 2) Meta-learning based approach implicitly poses a prior by initializing the networks around a desirable optimum. However, our experiment shows that the learned initialization is ad-hoc to a certain data distribution. When noises are added, Meta turns unstable and converges to a trivial solution. 3) Our NID displays stronger robustness by accurately locating and removing the occlusion pattern.

4.3 Self-Supervised Surveillance Video Analysis

In this section, we establish a self-supervision algorithm that can decompose foreground and background for surveillance videos based on NID. Given a set of video frames , our goal is to find a continuous mapping representing the clip that can be decomposed to: , where is the background and are transient noises (e.g.

, pedestrians). We borrow the idea from Robust Principal Component Analysis (RPCA)

(Candès et al., 2011; Ji et al., 2010) where background is assumed to be “low-rank” and noises are assumed to be sparse. Despite well-established for discrete representation, modeling “low-rank” in continuous domain remains elusive. We achieve this by assuming at each time stamp are largely represented by the same group of experts, i.e., the non-zero elements in the sparse codings concentrate to several points, and the coding weights follow a decay distribution. Mathematically, we first rewrite by decoupling spatial coordinates and time: , where every time slice shares a same dictionary, and sparse coding depends on the timestamp. Then we minimize:

(15)

where the second term penalize the sparsity of according to an exponentially increasing curve (controlled by ), which implies the larger is, the more sparsity is enforced. As a consequence, every time slice are largely approximated by the first few components in NID, which simulates the nature of “low-rank” representation for continuous functions.

Results.

We test the above algorithm on BMC-Real dataset (Vacavant et al., 2012). In our implementation, is also parameterized by another MLP, and we choose . Our qualitative results are presented in Figure 3. We verify that our algorithm can decompose the background and foreground correctly by imitating the behavior of RPCA. This application further demonstrates the potential of our NID in combining with subspace learning techniques.

Methods 128 views 16 views 8 views
PSNR SSIM PSNR SSIM PSNR SSIM
FFM (Tancik et al., 2020) 22.81 0.845 15.22 0.122 13.58 0.095
SIREN (Sitzmann et al., 2020b) 24.32 0.891 18.48 0.510 17.26 0.483
Meta (Tancik et al., 2021) 32.70 0.948 21.39 0.822 18.28 0.574
NID () 36.56 0.939 24.48 0.818 16.24 0.619
NID () 37.49 0.944 26.32 0.829 16.77 0.636
Table 2: Quantitative results of CT reconstruction compared with FFM, SIREN, and Meta. (PSNR in dB)
Figure 5: Qualitative results of CT reconstruction from sparse measurements.

4.4 Computed Tomography Reconstruction

Computed tomography (CT) is a widely used medical imaging technique that captures projective measurements of the volumetric density of body tissue. This imaging formation can be formulated as below:

(16)

where is the location on the image plane, is the viewing angle, and is known as Dirac delta function. Due to limited number of measurements, reconstructing through inversing this integral is often ill-posed. We propose to shrink the solution space by using NID as a regularization.

Experimental Settings.

We conduct experiments on Shepp-Logan phantoms dataset (Shepp and Logan, 1974) with 2048 randomly generated CTs. We first directly train an NID over 1k CT images, during which the total number of experts is 1024, and each CT selects 128/256 experts. In CT scenario, a look-up table is chosen as our gating network. Afterwards, we randomly sample 128 viewing angles, and synthesize 2D integral projections of a bundle of 128 parallel rays from these angles as the measurement. To testify the effectiveness of our method under limited number of observations, we downsample 128 views by 12.5%(16) and 6.25%(8) respectively. Again, we choose FFM (Tancik et al., 2020), SIREN (Sitzmann et al., 2020b), and Meta (Tancik et al., 2021) as our baselines.

Results.

The quantitative results are listed in Table 2. We observe that our NID consistently leads two metrics in the table. When sampled views are sufficient, NID achieves the highest PSNR, while when views are reduced, our NID takes advantage in SSIM. We also plot the qualitative results in Figure 5. We find that our NID can regularize the reconstructed results to be smooth and shape-consistent, which leads to less missing wedge artifacts.

4.5 Shape Representation from Point Clouds

Recent works (Park et al., 2019; Sitzmann et al., 2020a, b; Gropp et al., 2020) convert point clouds to continuous surface representation through directly regressing a Signed Distance Function (SDF) parameterized by MLPs. Suppose is our target SDF, given a set of points , we fit by solving a integral equation of the form below (Park et al., 2019):

(17)

where denotes the signed shortest distance from point to point set . During optimization, we evaluate the first integral via sampling inside the given point cloud and the second term via uniformly sampling over the whole space. Tackling this integral with sparsely sampled points around the surface is challenging (Park et al., 2019). Similarly, we introduce NID to learn a priori SDF basis from data and then leverage it to regularize the solution.

Methods 500k points 50k points 10k points
CD() NC() CD() NC() CD() NC()
SIREN (Sitzmann et al., 2020b) 0.051 0.962 0.163 0.801 1.304 0.169
IGR (Gropp et al., 2020) 0.062 0.927 0.170 0.812 0.961 0.676
DeepSDF (Park et al., 2019) 0.059 0.925 0.121 0.856 2.751 0.194
MetaSDF (Sitzmann et al., 2020a) 0.067 0.884 0.097 0.878 0.132 0.755
ConvONet (Peng et al., 2020) 0.052 0.938 0.082 0.914 0.133 0.845
NID () 0.058 0.940 0.067 0.948 0.093 0.921
NID () 0.053 0.956 0.063 0.952 0.088 0.945
Table 3: Quantitative results of SDF reconstruction compared with SIREN, DeepSDF, MetaSDF. CD is short for Chamfer Distance (magnified by ), NC means Normal Consistency. the higher the better, the lower the better.

Experimental Settings.

Our experiments about SDF are conducted on ShapeNet (Chang et al., 2015) datasets, from which we pick the chair category for demonstration. To guarantee meshes are watertight, we run the toolkit provided by Huang et al. (2018) to convert the whole dataset. We split the chair category following Choy et al. (2016), and fit our NID over the training set. The total number of experts is 4096, and after 20 warm-up epochs, only 128/256 experts will be preserved for each sample. We choose lookup table as our gating network. During inference time, we sample 500k, 50k and 10k point clouds, respectively, from the test surfaces. Then we optimize objective in Equation 17 to obtain the regressed SDF with represented by our NID. In addition to SIREN and IGR (Gropp et al., 2020), We choose DeepSDF (Park et al., 2019), MetaSDF (Sitzmann et al., 2020a), and ConvONet (Peng et al., 2020)

as our baselines. Our evaluation metrics are Chamfer distance (the average minimal pairwise distance) and normal consistency (the angle between corresponding normals).

Figure 6: Qualitative results of SDF reconstruction from sparse point clouds.

Results.

We put our numerical results in Table 3, from which we can summarize that our NID is more robust to smaller number of points. As the performance of other methods drops quickly, the CD metric of NID stays below 0.1 and NC keeps above 0.9. We also provide qualitative illustration in Figure 6. We conclude that thanks to the constraint of our NID, the SDF will not collapse at some point where observations are missing. DeepSDF and ConvONet reply on latent feature space to decode geometries, which shows the potential in regularizing geometries. However, the superiosity of our model suggests our dictionary based representation is advantageous over conditional implicit representation.

5 Related Work

Generalizable Implicit Neural Representations.

Implicit Neural Representation (INR) (Tancik et al., 2020; Sitzmann et al., 2020b) notoriously suffers from the limited cross-scene generalization capability. Tancik et al. (2021); Sitzmann et al. (2020a) propose meta-learning based algorithms to better initialize INR weights for fast convergence. Chen et al. (2021c); Park et al. (2019); Chabra et al. (2020); Chibane et al. (2020); Jang and Agapito (2021); Martin-Brualla et al. (2021); Rematas et al. (2021) introduce learnable latent embeddings to encode scene specific information and condition the INR on the latent code for generalizable representation. In Sitzmann et al. (2020b), the authors further utilize a hyper-network (Ha et al., 2016) to predict INR weights directly from inputs. Compared with conditional fields or hyper-network based methods, sparse coding based NID, with just one last layer, can achieve faster adaptation. The dictionary representation simplifies the mapping between latent spaces to a sparse linear combination over the additive basis, which can be manipulated more interpretably and also contributes to transferability. Last but not least, it is known that imposing sparsity can help overcome noise in ill-posed inverse problems (Donoho, 2006; Candès et al., 2011).

Mixture of Experts (MoE).

Mixture of Experts (Jacobs et al., 1991; Jordan and Jacobs, 1994; Chen et al., 1999; Yuksel et al., 2012; Roller et al., 2021) perform conditional computations composed of a group of parallel sub-models (a.k.a. experts) according to a routing policies (Dua et al., 2021; Roller et al., 2021). Recent advances (Shazeer et al., 2017; Lepikhin et al., 2020; Fedus et al., 2021) improve MoE by adopting a sparse-gating strategy, which only activates a minority of experts by selecting top candidates according to the scores given by the gating networks. This brings massive advantages in model capacity, training time, and achieved performance (Shazeer et al., 2017). Fedus et al. (2021)

even built language models with trillions of parameters. To stabilize the training,

Hansen (1999); Lepikhin et al. (2020); Fedus et al. (2021) investigated auxiliary loading loss to balance the selection of experts. Alternatively, Lewis et al. (2021); Clark et al. (2022) encourage a balanced routing by solving a linear assignment problem.

6 Conclusion

We propose Neural Implicit Dictionary (NID) learned from data collection to represent the signals as a sparse combination of the function basis inside. Unlike tradition dictionary, our NID contains continuous function basis, which are parameterized by subnetworks. To train thousands of networks efficiently, we employ Mixture-of-Expert training strategy. Our NID enjoys higher compactness, robustness, and generalization. Our experiments demonstrate promising applications of NID in instant regression, image inpainting, video decomposition, and reconstruction from sparse observations. Our future work may bring in subspace learning theories to analyze NID.

Acknowledgement

Z. W. is in part supported by a US Army Research Office Young Investigator Award (W911NF2010240).

References

  • M. Aharon, M. Elad, and A. Bruckstein (2006) K-svd: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing 54 (11), pp. 4311–4322. Cited by: §2, §3.2, §3.2.
  • B. Attal, J. Huang, M. Zollhoefer, J. Kopf, and C. Kim (2021a) Learning neural light fields with ray-space embedding networks. arXiv preprint arXiv:2112.01523. Cited by: §3.1.
  • B. Attal, E. Laidlaw, A. Gokaslan, C. Kim, C. Richardt, J. Tompkin, and M. O’Toole (2021b) Törf: time-of-flight radiance fields for dynamic scene view synthesis. Advances in neural information processing systems 34. Cited by: §2.
  • E. Bengio, P. Bacon, J. Pineau, and D. Precup (2015) Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297. Cited by: §3.3.
  • P. Bojanowski, A. Joulin, D. Lopez-Paz, and A. Szlam (2017) Optimizing the latent space of generative networks. arXiv preprint arXiv:1707.05776. Cited by: §3.3.
  • E. J. Candès, X. Li, Y. Ma, and J. Wright (2011) Robust principal component analysis?. Journal of the ACM (JACM) 58 (3), pp. 1–37. Cited by: §4.2, §4.3, §5.
  • E. J. Candès, J. Romberg, and T. Tao (2006) Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on information theory 52 (2), pp. 489–509. Cited by: §2.
  • R. Chabra, J. E. Lenssen, E. Ilg, T. Schmidt, J. Straub, S. Lovegrove, and R. Newcombe (2020) Deep local shapes: learning local sdf priors for detailed 3d reconstruction. In European Conference on Computer Vision, pp. 608–625. Cited by: §5.
  • T. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, and Y. Ma (2015)

    PCANet: a simple deep learning baseline for image classification?

    .
    IEEE transactions on image processing 24 (12), pp. 5017–5032. Cited by: §2.
  • A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu (2015) ShapeNet: An Information-Rich 3D Model Repository. Technical report Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago. Cited by: §4.5.
  • A. Chen, Z. Xu, F. Zhao, X. Zhang, F. Xiang, J. Yu, and H. Su (2021a) MVSNeRF: fast generalizable radiance field reconstruction from multi-view stereo. arXiv preprint arXiv:2103.15595. Cited by: §1.
  • G. Chen and D. Needell (2016) Compressed sensing and dictionary learning. Finite Frame Theory: A Complete Introduction to Overcompleteness 73, pp. 201. Cited by: §2.
  • H. Chen, B. He, H. Wang, Y. Ren, S. N. Lim, and A. Shrivastava (2021b) Nerv: neural representations for videos. Advances in Neural Information Processing Systems 34. Cited by: §2.
  • K. Chen, L. Xu, and H. Chi (1999) Improved learning algorithms for mixture of experts in multiclass classification. Neural networks 12 (9), pp. 1229–1252. Cited by: §5.
  • Y. Chen, S. Liu, and X. Wang (2021c) Learning continuous image representation with local implicit image function. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 8628–8638. Cited by: §1, §2, §5.
  • J. Chibane, T. Alldieck, and G. Pons-Moll (2020) Implicit functions in feature space for 3d shape reconstruction and completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6970–6981. Cited by: §5.
  • C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese (2016) 3D-r2n2: a unified approach for single and multi-view 3d object reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §4.5.
  • A. Clark, D. d. l. Casas, A. Guy, A. Mensch, M. Paganini, J. Hoffmann, B. Damoc, B. Hechtman, T. Cai, S. Borgeaud, et al. (2022) Unified scaling laws for routed language models. arXiv preprint arXiv:2202.01169. Cited by: §5.
  • D. L. Donoho (2006) Compressed sensing. IEEE Transactions on information theory 52 (4), pp. 1289–1306. Cited by: §1, §2, §5.
  • D. Dua, S. Bhosale, V. Goswami, J. Cross, M. Lewis, and A. Fan (2021) Tricks for training sparse translation models. arXiv preprint arXiv:2110.08246. Cited by: §5.
  • Z. Fan, Y. Jiang, P. Wang, X. Gong, D. Xu, and Z. Wang (2022) Unified implicit neural stylization. arXiv preprint arXiv:2204.01943. Cited by: §2.
  • Z. Fan, L. Sun, X. Ding, Y. Huang, C. Cai, and J. Paisley (2018) A segmentation-aware deep fusion network for compressed sensing mri. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 55–70. Cited by: §2.
  • W. Fedus, B. Zoph, and N. Shazeer (2021) Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961. Cited by: §5.
  • B. Y. Feng and A. Varshney (2021) SIGNET: efficient neural representation for light fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14224–14233. Cited by: §3.1.
  • J. Gordon, W. P. Bruinsma, A. Y. Foong, J. Requeima, Y. Dubois, and R. E. Turner (2019) Convolutional conditional neural processes. arXiv preprint arXiv:1910.13556. Cited by: §3.3.
  • A. Gropp, L. Yariv, N. Haim, M. Atzmon, and Y. Lipman (2020) Implicit geometric regularization for learning shapes. arXiv preprint arXiv:2002.10099. Cited by: §4.5, §4.5, Table 3.
  • D. Ha, A. Dai, and Q. V. Le (2016) Hypernetworks. arXiv preprint arXiv:1609.09106. Cited by: §5.
  • J. Han, A. Jentzen, and E. Weinan (2018)

    Solving high-dimensional partial differential equations using deep learning

    .
    Proceedings of the National Academy of Sciences 115 (34), pp. 8505–8510. Cited by: §2.
  • J. V. Hansen (1999) Combining predictors: comparison of five meta machine learning methods. Information Sciences 119 (1-2), pp. 91–105. Cited by: §5.
  • J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, and J. Tang (2021) FastMoE: a fast mixture-of-expert training system. arXiv preprint arXiv:2103.13262. Cited by: §3.3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.3, §4.1.
  • J. Huang, H. Su, and L. Guibas (2018) Robust watertight manifold surface generation method for shapenet models. arXiv preprint arXiv:1802.01698. Cited by: §4.5.
  • R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton (1991) Adaptive mixtures of local experts. Neural computation 3 (1), pp. 79–87. Cited by: §5.
  • W. Jang and L. Agapito (2021) Codenerf: disentangled neural radiance fields for object categories. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12949–12958. Cited by: §5.
  • H. Ji, C. Liu, Z. Shen, and Y. Xu (2010) Robust video denoising using low rank matrix completion. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1791–1798. Cited by: §4.3.
  • M. I. Jordan and R. A. Jacobs (1994) Hierarchical mixtures of experts and the em algorithm. Neural computation 6 (2), pp. 181–214. Cited by: §5.
  • K. Kreutz-Delgado, J. F. Murray, B. D. Rao, K. Engan, T. Lee, and T. J. Sejnowski (2003) Dictionary learning algorithms for sparse representation. Neural computation 15 (2), pp. 349–396. Cited by: §2, §3.2.
  • D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen (2020) Gshard: scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668. Cited by: §5.
  • M. Lewis, S. Bhosale, T. Dettmers, N. Goyal, and L. Zettlemoyer (2021) Base layers: simplifying training of large, sparse models. In International Conference on Machine Learning, pp. 6265–6274. Cited by: §5.
  • Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar (2020) Fourier neural operator for parametric partial differential equations. arXiv preprint arXiv:2010.08895. Cited by: §2.
  • G. Liu, F. A. Reda, K. J. Shih, T. Wang, A. Tao, and B. Catanzaro (2018) Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 85–100. Cited by: §3.3, §4.2.
  • Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §4.1.
  • M. Lustig, D. L. Donoho, J. M. Santos, and J. M. Pauly (2008) Compressed sensing mri. IEEE signal processing magazine 25 (2), pp. 72–82. Cited by: §2.
  • R. Martin-Brualla, N. Radwan, M. S. Sajjadi, J. T. Barron, A. Dosovitskiy, and D. Duckworth (2021) Nerf in the wild: neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7210–7219. Cited by: §5.
  • L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger (2019) Occupancy networks: learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4460–4470. Cited by: §1, §2.
  • C. A. Metzler, A. Maleki, and R. G. Baraniuk (2016) From denoising to compressed sensing. IEEE Transactions on Information Theory 62 (9), pp. 5117–5144. Cited by: §2.
  • B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2020) Nerf: representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, pp. 405–421. Cited by: §1, §1, §2, §3.1.
  • V. Papyan, Y. Romano, and M. Elad (2017) Convolutional neural networks analyzed via convolutional sparse coding. The Journal of Machine Learning Research 18 (1), pp. 2887–2938. Cited by: §2.
  • J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove (2019) Deepsdf: learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 165–174. Cited by: §1, §2, §4.5, §4.5, Table 3, §5.
  • S. Peng, M. Niemeyer, L. Mescheder, M. Pollefeys, and A. Geiger (2020) Convolutional occupancy networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pp. 523–540. Cited by: §4.5, Table 3.
  • C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017a) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: §3.3.
  • C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017b) Pointnet++: deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413. Cited by: §3.3.
  • C. Reiser, S. Peng, Y. Liao, and A. Geiger (2021) KiloNeRF: speeding up neural radiance fields with thousands of tiny mlps. arXiv preprint arXiv:2103.13744. Cited by: §3.3.
  • K. Rematas, R. Martin-Brualla, and V. Ferrari (2021) Sharf: shape-conditioned radiance fields from a single view. arXiv preprint arXiv:2102.08860. Cited by: §5.
  • S. Roller, S. Sukhbaatar, A. Szlam, and J. E. Weston (2021) Hash layers for large sparse models. In Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Eds.), External Links: Link Cited by: §5.
  • S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li (2019) Pifu: pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2304–2314. Cited by: §1.
  • N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean (2017) Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538. Cited by: §1, §2, §3.3, §3.3, §3.3, §5.
  • S. Shen, Z. Wang, P. Liu, Z. Pan, R. Li, T. Gao, S. Li, and J. Yu (2021) Non-line-of-sight imaging via neural transient fields. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.
  • L. A. Shepp and B. F. Logan (1974) The fourier reconstruction of a head section. IEEE Transactions on nuclear science 21 (3), pp. 21–43. Cited by: §4.4.
  • V. Sitzmann, E. R. Chan, R. Tucker, N. Snavely, and G. Wetzstein (2020a) Metasdf: meta-learning signed distance functions. arXiv preprint arXiv:2006.09662. Cited by: §1, §4.5, §4.5, Table 3, §5.
  • V. Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein (2020b) Implicit neural representations with periodic activation functions. Advances in Neural Information Processing Systems 33. Cited by: §1, §2, §3.1, §3.3, Table 1, §3, §4.1, §4.4, §4.5, Table 2, Table 3, §5.
  • V. Sitzmann, S. Rezchikov, W. T. Freeman, J. B. Tenenbaum, and F. Durand (2021) Light field networks: neural scene representations with single-evaluation rendering. arXiv preprint arXiv:2106.02634. Cited by: §2.
  • D. J. Sutherland and J. Schneider (2015) On the error of random fourier features. arXiv preprint arXiv:1506.02785. Cited by: §3.1.
  • M. Tancik, B. Mildenhall, T. Wang, D. Schmidt, P. P. Srinivasan, J. T. Barron, and R. Ng (2021) Learned initializations for optimizing coordinate-based neural representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2846–2855. Cited by: §1, Table 1, §4.1, §4.2, §4.4, Table 2, §5.
  • M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. T. Barron, and R. Ng (2020) Fourier features let networks learn high frequency functions in low dimensional domains. arXiv preprint arXiv:2006.10739. Cited by: §1, §2, §3.1, §3.3, Table 1, §4.1, §4.4, Table 2, §5.
  • S. Tariyal, A. Majumdar, R. Singh, and M. Vatsa (2016) Deep dictionary learning. IEEE Access 4, pp. 10096–10109. Cited by: §2.
  • I. Tošić and P. Frossard (2011) Dictionary learning. IEEE Signal Processing Magazine 28 (2), pp. 27–38. Cited by: §2, §3.2.
  • H. Turki, D. Ramanan, and M. Satyanarayanan (2021) Mega-nerf: scalable construction of large-scale nerfs for virtual fly-throughs. arXiv preprint arXiv:2112.10703. Cited by: §3.3.
  • A. Vacavant, T. Chateau, A. Wilhelm, and L. Lequievre (2012) A benchmark dataset for outdoor foreground/background extraction. In Asian Conference on Computer Vision, pp. 291–300. Cited by: §4.3.
  • Q. Wang, Z. Wang, K. Genova, P. P. Srinivasan, H. Zhou, J. T. Barron, R. Martin-Brualla, N. Snavely, and T. Funkhouser (2021) Ibrnet: learning multi-view image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699. Cited by: §1.
  • Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §4.1.
  • A. Yu, V. Ye, M. Tancik, and A. Kanazawa (2021) Pixelnerf: neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4578–4587. Cited by: §1.
  • J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2019) Free-form image inpainting with gated convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4471–4480. Cited by: §4.2.
  • S. E. Yuksel, J. N. Wilson, and P. D. Gader (2012) Twenty years of mixture of experts. IEEE Transactions on Neural Networks and Learning Systems 23 (8), pp. 1177–1193. External Links: Document Cited by: §5.
  • M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. Salakhutdinov, and A. Smola (2017) Deep sets. arXiv preprint arXiv:1703.06114. Cited by: §3.3.
  • R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595. Cited by: §4.1.
  • E. D. Zhong, T. Bepler, B. Berger, and J. H. Davis (2021) CryoDRGN: reconstruction of heterogeneous cryo-em structures using neural networks. Nature Methods 18 (2), pp. 176–185. Cited by: §2, §3.1.