1 Introduction
Implicit Neural Representations (INRs) have recently demonstrated remarkable performance in representing multimedia signals in computer vision and graphics
(Park et al., 2019; Mescheder et al., 2019; Saito et al., 2019; Chen et al., 2021c; Sitzmann et al., 2020b; Tancik et al., 2020; Mildenhall et al., 2020). In contrast to classical discrete representations, where realworld signals are sampled and vectorized before processing, INR directly parameterizes the continuous mapping between coordinates and signal values using deep fullyconnected networks (also known as multilayer perceptron or MLP). This continuous parameterization enables to represent more complex and flexible scenes without being limited by grid extents and resolution in a more compact and memory efficient way.
However, one significant drawback of this approach is that acquiring an INR usually requires a tedious perscene
training of neural networks on
dense measurements, which limits the practicality. Yu et al. (2021); Wang et al. (2021); Chen et al. (2021a) generalizes Neural Radiance Field (NeRF) (Mildenhall et al., 2020) across various scenes by projecting image features to a 3D volumetric proxy and then rendering feature volume to generate novel views. To speed up INR training, Sitzmann et al. (2020a); Tancik et al. (2021) apply metalearning algorithms to learn the initial weight parameters for the MLP based on the underlying class of signals being represented. However, this line of works are either hard to be extended beyond NeRF scenario or incapable of producing highfidelity results with insufficient supervision.In this paper, we design a unified INR framework that simultaneously achieves optimization and data efficiency. We think of reconstructing an INR from fewshot measurements as solving an underdetermined system. Inspired by compressed sensing techniques (Donoho, 2006), we represent every neural implicit function as a linear combination of a function basis sampled from an overcomplete Neural Implicit Dictionary (NID). Unlike conventional basis representation as a wide matrix, an NID is parameterized by a group of small neural networks that acts as continuous function basis spanning the entire target function space. The NID is shared across different scenes while the sparse codes are specified by each scene. We first acquire the NID “offline” by jointly optimizing it with perscene coding across a class of instances in a training set. When transferring to unseen scenarios, we reuse the NID and only solves the the scene specific coding coefficients “online”.
To effectively scale to thousands of subnetworks inside our dictionary, we employ the MixtureofExpert (MoE) training for NID learning (Shazeer et al., 2017). We model each function basis in our dictionary as an expert subnetwork and the coding coefficients as its gating state. During each feedforward, we utilize a routing module to generate sparsely coded gates, i.e., activating a handful of basis experts and linearly combining their responses. Training with MoE also “kills two birds with one stone” by constructing transferable dictionaries and avoiding extra computational overheads.
Our contributions can be summarized as follows:

We propose a novel datadriven framework to learn a Neural Implicit Dictionary (NID) that can transfer across scenes, to both accelerate perscene neural encoding and boost their performance.

NID is parameterized by a group of small neural networks that acts as continuous function basis to span the neural implicit function space. The dictionary learning is efficiently accomplished via MoE training.

We conduct extensive experiments to validate the effectiveness of NID. For training efficiency, we show that our approach is able to achieve 100 faster convergence speed for image regression task. For data efficiency, our NID can reconstruct signed distance function with 98% less point samples, and optimize a CT image with 90% fewer views. We also demonstrate more practical applications for NID, including image inpainting, medical image recovery, and transient object detection for surveillance videos.
2 Preliminaries
Compressed Sensing in Inverse Imaging.
Compressed sensing and dictionary learning are widely applied in inverse imaging problems (Lustig et al., 2008; Metzler et al., 2016; Fan et al., 2018). In classical signal processing, signals are discretized and represented by vectors. A common goal is to reconstruct signals (or digital images) from measurements
, which are formed by linearly transforming the underlying signals plus noise:
. However, is often highly illposed, i.e., number of measurements is much smaller than the number of unknowns (), which makes this inverse problem rather challenging. Compressed sensing (Candès et al., 2006; Donoho, 2006) provides an efficient approach to solve this underdetermined linear system by assuming signals are compressible and representing it in terms of few vectors inside a group of spanning vectors . Then we can reconstruct through the following optimization objective:(1) 
where is known as the sparse code coefficient, and is a bound on the noise level. One often replaces the seminorm with to obtain a convex objective. The spanning vectors can be chosen from orthonormal bases or, more often than not, overcomplete dictionaries () (KreutzDelgado et al., 2003; Tošić and Frossard, 2011; Aharon et al., 2006; Chen and Needell, 2016). Rather than a bunch of spanning vectors, Chan et al. (2015); Tariyal et al. (2016); Papyan et al. (2017) proposed hierarchical dictionary implemented by neural network layers.
Implicit Neural Representation.
Implicit Neural Representation (INR) in computer vision and graphics replaces traditional discrete representations of multimedia objects with continuous functions parameterized by multilayer perceptrons (MLP)
(Tancik et al., 2020; Sitzmann et al., 2020b). Since this representation is amenable to gradientbased optimization, prior works managed to apply coordinatebased MLPs to many inverse problems in computational photography (Park et al., 2019; Mescheder et al., 2019; Mildenhall et al., 2020; Chen et al., 2021c, b; Sitzmann et al., 2021; Fan et al., 2022; Attal et al., 2021b; Shen et al., 2021) and scientific computing (Han et al., 2018; Li et al., 2020; Zhong et al., 2021). Formally, we denote an INR inside a function space by , which continuously maps dimension spatiotemporal coordinates (say with for images) to the value space (say pixel intensity). Consider a functional , we intend to find the network weights such that:(2) 
where records the measurement settings. For instance, in computed tomography (CT), is called the volumetric projection integral and specifies the ray parameterization and corresponding colors. When solving ordinal differential equations, takes form of if , while for some constant if , given a compact set and operator which combines derivatives of (Sitzmann et al., 2020b).
MixtureofExpert Training.
Shazeer et al. (2017) proposed outrageously wide neural networks with dynamic routing to achieve larger model capacity and higher data parallel. Their approach is to introduce an MixtureofExpert (MoE) layer with a number of expert subnetworks and train a gating network to select a sparse combination of the experts to process each input. Let us denote by and the output of the gating network and the output of the th expert network for a given input . The output of the MoE module can be written as:
(3) 
where is the number of experts and . In Shazeer et al. (2017), computation is saved based on the sparsity of . The common sparsification strategy is called noisy top gating, which can be formulated as:
(4)  
(5) 
where synthesizes raw gating activations, masks out smallest elements, and scales the magnitude of remaining weights to a constant, which can be chosen from softmax or norm normalization.
3 Neural Implicit Dictionary Learning
As we discussed before, inverse imaging problems are often illposed and it is also true for Implicit Neural Representation (INR). Moreover, training an INR network is also timeconsuming. How to kill two bird with one stone by efficiently and robustly acquiring an INR from fewshot observations remains uninvestigated. In this section, we answer this question by presenting our approach Neural Implicit Dictionary (NID), which are learned from data collections a priori and can be reused to quickly fit an INR. We will first reinterpret twolayer SIREN (Sitzmann et al., 2020b) and point out the limitation of current design. Then we will elaborate on our proposed models and the techniques to improve its generalizability and stability.
3.1 Motivation by TwoLayer SIREN
Common INR architectures are pure MultiLayer Perceptrons (MLP) with periodic activation functions. Fourier Feature Mapping (FFM)
(Tancik et al., 2020) places a sinusoidal transformation after the first linear layer, while Sinusoidal Representation Network (SIREN) (Sitzmann et al., 2020b) replaces every nonlinear activation with a sinusoidal function. For the sake of simplicity, we only consider twolayer INR architectures to unify the formulation of FFM and SIREN. To be consistent with the notation in Section 2, let us denote INR by function , which can be formulated as below:(6)  
(7) 
where and are all network parameters, and mapping (cf. Equation 6) is called positional embedding (Mildenhall et al., 2020; Zhong et al., 2021). After simply rewriting, we can obtain:
(8)  
(9) 
from which we discover Equations 67
can be considered as an approximation of inverse Hartley (Fourier) transform (cf. Equation
9). The weights of the first SIREN layer sample frequency bands on the Fourier domain, and passing coordinates through sinusoidal activation functions maps spatial positions onto cosinesine wavelets. Then training a twolayer SIREN amounts to finding the optimal frequency supports and fitting the coefficients in Hartley transform.Although trigonometric polynomials are dense in continuous function space, cosinesine waves may not be always desirable as approximating functions at arbitrary precision with finite neurons can be infeasible. In fact, some other bases, such as Gegenbauer basis
(Feng and Varshney, 2021) and Plücker embedding (Attal et al., 2021a), have been proven useful in different tasks. However, we argue that since handcrafted bases are agnostic to data distribution, they cannot express intrinsic information about data, thus may generalize poorly across various scenes. This causes perscene training to reselect the frequency supports and refit the Fourier coefficients. Moreover, when observations are scarce, sinusoidal basis can also result in severe overfitting in reconstruction (Sutherland and Schneider, 2015).3.2 Learning Implicit Function Basis
Having reasoned why current INR architectures generalize badly and demand tons of measurements, we intend to introduce the philosophy of sparse dictionary representation (KreutzDelgado et al., 2003; Tošić and Frossard, 2011; Aharon et al., 2006) into INR. A dictionary contains a group of overcomplete basis that spans the signal space. In contrast to handcrafted bases or wavelets, dictionary are usually learned from a data collection. Since it is aware of the distribution of the underlying signals to be represented, expressing signals using dictionary enjoys higher sparsity, robustness and generalization power.
Even though dictionary learning algorithms are well established in Aharon et al. (2006), it is far from trivial to design dictionaries amenable to INR on the continuous domain. Formally, we want to obtain a set of continuous maps: such that for every signal inside our target signal space , there exists a sparse coding that can express the signal:
(10) 
where is the size of the dictionary, and satisfies for some sparsity . We parameterize each component in the dictionary with small coordinatebased networks by , where denotes the network weights of the th element. We call this group of function basis Neural Implicit Dictionary (NID).
We adopt an endtoend optimization scheme to learn the NID. During training stage, we jointly optimize the subnetworks inside NID and the sparse coding assigned with each instance. Suppose we own a data collection with measurements captured from multimedia instances to be represented (say images or geometries of objects): , where is the observation parameters (say coordinates on 2D lattice for images), is the dimension of such parameters, are measured observations (say corresponding RGB colors), denotes the number of observations for th instance. Then we optimize the following objective on the training dataset:
(11)  
where is the INR of the th instance, is a functional measuring function with respect to a group of parameters .
is the loss function dependent of downstream tasks.
places a regularization onto the sparse coding, is fixed in our experiments. Besides sparsity penalty, we also consider some joint prior distributions among all codings, which will be discussed in Section 3.3. When transferring to unseen scenes, we fix NID basis and only compute the corresponding sparse coding to minimize the objective in Equation 11.3.3 Training Thousands of Subnetworks with MixtureofExpert Layer
Directly invoking thousands of networks causes inefficiency and redundancy due to sample dependent sparsity. Moreover, this brute force computational strategy fails to properly utilize the advantage of modern computing architectures in parallelism. As we introduced in Section 2, MixtureofExpert (MoE) training system (Shazeer et al., 2017; He et al., 2021) provides a conditional computation mechanism that achieves stable and parallel training on a outrageously large networks. We notice that MoE layer and NID share the intrinsic similarity in the underlying running paradigm. Therefore, we propose to leverage an MoE layer to represent an NID accommodating thousands of implicit function basis. Specifically, each element in NID is an expert network in MoE layer, and the sparse coding encodes the gating states. Below we elaborate on the implementation details of the MoE based NID layer part by part:
Expert Networks.
Each expert network is a small SIREN (Sitzmann et al., 2020b) or FFM (Tancik et al., 2020) network. To downsize the whole MoE layer, we share the positional embedding and the first 4 layers among all expert networks. Then we append two independent layers for each expert. We note this design can make two experts share the earlystage features and adjust their coherence.
Gating Networks.
The generated gating is used as the sparse coding of an INR instance. We provide two alternatives to obtain the gating values: 1) We employ an encoder network as the gating function to map the (partial) observed measurements to the presparsified weights. For gridlike modality, we utilize convolutional neural networks (CNN)
(He et al., 2016; Liu et al., 2018; Gordon et al., 2019). For unstructured point modality, we adopt set encoders (Zaheer et al., 2017; Qi et al., 2017a, b). 2) We can also leverage a lookup table (Bojanowski et al., 2017) where each scene is assigned with a trainable embedding jointly optimized with expert networks. After computing the raw gating weights, we recall the method in Equation 3 to sparsify gates. Different from Shazeer et al. (2017), we do not perform softmax normalization to gating logits. Instead, we sort gating weights with respect to their absolute values, and normalize the weights by its
norm. Comparing aforementioned two gating functions, encoderbased gating networks benefit in parameter saving and instant inference without need of refitting sparse coding. However, headless embeddings demonstrate more strength in training efficiency and achieve better convergence.Methods  PSNR ()  SSIM ()  LPIPS ()  # Params  FLOPs  Throughput 

FFM (Tancik et al., 2020)  22.60  0.636  0.244  147.8  20.87  0.479 
SIREN (Sitzmann et al., 2020b)  26.11  0.758  0.379  66.56  4.217  0.540 
Meta + 5 steps (Tancik et al., 2021)  23.92  0.583  0.322  66.69  4.217  0.536 
Meta + 10 steps (Tancik et al., 2021)  29.64  0.651  0.182  66.69  4.217  0.536 
NID + init. ()  28.75  0.892  0.061  8.972  23.30  30.37 
NID + 5 steps ()  33.57  0.941  0.027  8.972  23.30  30.37 
NID + 10 steps ()  35.10  0.954  0.021  8.972  23.30  30.37 
NID + init. ()  30.26  0.919  0.045  8.972  29.55  21.23 
NID + 5 steps ()  35.09  0.960  0.019  8.972  29.55  21.23 
NID + 10 steps ()  37.75  0.971  0.012  8.972  29.55  21.23 
Patchwise Dictionary.
It is implausible to construct an overcomplete dictionary to represent entire signals. We adopt the walkround in (Reiser et al., 2021; Turki et al., 2021) by partitioning the coordinate space into regular and overlapped patches, and assign separate NID to each block. We implement this by setting up multiple MoE layers and dispatch the coordinate inputs to corresponding MoE with respect to the region where they are located.
Utilization Balancing and WarmUp.
It was observed that gating network tends to converge to a selfreinforcing imbalanced state, where it always produces large weights for the same few experts (Shazeer et al., 2017). To tackle this problem, we pose a regularization on the Coefficient of Variation (CV) of the sparse codings following Bengio et al. (2015); Shazeer et al. (2017). The CV penalty is defined as:
(12)  
(13) 
Evaluating this regularization over the whole training set is infeasible. Instead we estimate and minimize this loss per batch. We also find hard sparsification will stop gradient backpropagation, which leads to stationary gating states equal to the initial stage. To address this sideeffect, we first abandon hard thresholding and train the MoE layer with
penaltyon codings for several epochs, and enable sparsification afterwards.
4 Experiments and Applications
In this section, we demonstrate the promise of NID by showing several applications in scene representation.
4.1 Instant Image Regression
A prototypical example of INR is to regress a 2D image with an MLP which takes in coordinates on 2D lattice and is supervised with RGB colors. Given a image , our goal is to approximate the mapping by optimizing for every , where . In conventional training scheme, each image is encoded into a dedicated network after thousands of iterations. Instead, we intend to use NID to instantly acquire such INR without training or with only few steps of gradient descent.
Experimental Settings.
We choose to train our NID on CelebA face dataset (Liu et al., 2015), where each image is cropped to . Our NID contains 4096 experts, each of which share a 4layer backbone with 256 hidden dimension and own a separate 32dimension output layer. We adopt 4 residual convolutional blocks (He et al., 2016) as the gating network. During training, the gating network is tuned with the dictionary. NID is warmed up within 10 epochs and then start to only keep top 128 experts for each input for 5000 epochs. At the inference stage, we let gating network directly output the sparse coding of the test image. To further improve the precision, we utilize the output as the initialization, and then use gradient descent to further optimize the sparse coding with the dictionary fixed. We contrast our methods to FFM (Tancik et al., 2020), SIREN (Sitzmann et al., 2020b) and Meta (Tancik et al., 2021). In Table 1, we demonstrate the overall PSRN, SSIM (Wang et al., 2004), and LPIPS (Zhang et al., 2018) of these four models on test set (with 500 images) under the limited training step setting, where FFM and SIREN are only trained for 100 steps. We also present the inference time metrics in Table 1, including the number of parameters to represent 500 images, FLOPs to render a single image, and measured throughput of images rendered per second. In Figure 2, we zoom into the initialization and early training stages of each model.
Results.
Results in Table 1 show that NID () can achieve best performance among all compared models even without subsequent optimization steps. A relative sparser NID () can also surpass both FFM and SIREN (trained with 100 steps) with the initially inferred coding. Compared with metalearning based method, our model can outperform them by a significant margin () within the same optimization steps. We note that since NID only further tunes the coding vector, both computation and convergence speed are much faster than metalearning approaches which finetune parameters of the whole network. Figure 2 illustrates that the initial sparse coding inferred from the gating network is enough to produce highaccuracy reconstructed images. With 3 more gradient descent steps (which usually takes 5 seconds), it can reach the quality of welltuned perscene training INR (which takes 10 minutes). We argue that although meta learning is able to find a reasonable start point, but the subsequent optimization is sensitive to saddle points where the represented images are fuzzy and noisy. In regard to model efficiency, our NID is 8 times more compact than singleMLP representation, as NID shares dictionary among all samples and only needs to additionally record an small gating network. Moreover, our MoE implementation results in a significant throughput gain, as it makes inference highly parallelable. We point out that metalearning can only provide an initialization. To represent all test images, one has to save all dense parameters separately. Horizontally compared, denser NID is more expressive than sparser one though sacrificing efficiency.
4.2 Facial Image Inpainting.
Image inpainting recovers images corrupted by occlusion. Previous works (Liu et al., 2018; Yu et al., 2019) only establish algorithms based on discrete representation. In this section, we demonstrate image inpainting directly on continuous INR. Given a corrupted image
, we remove outliers by projecting
onto some lowdimension linear (function) subspace spanned by components in a dictionary. We achieve this by trying to represent the corrupted image as a linear combination of a pretrained NID, while simultaneously enforcing the sparsity of this combination. Specifically, we fix the dictionary in Equation 11 and choose norm as the loss function (Candès et al., 2011):(14)  
where we assume noises are sparsely distributed on images.
Experimental Settings.
We corrupt images by randomly pasting a color patch. To recover images, we borrow the dictionary trained on CelebA dataset from Section 4.1. However, we do not leverage the gating network to synthesize the sparse coding. Instead, we directly optimize a randomly initialized coding to minimize Equation 11. Our baseline includes SIREN and Meta (Tancik et al., 2021). We change their loss function to norm to keep consistent. To inpaint with Meta, we start from its learned initialization, and optimize two steps towards the objective.
Results.
The inpainting results are presented in Figure 4. Our findings are 1) SIREN overfits all given signals as it does not rely on any image priors. 2) Metalearning based approach implicitly poses a prior by initializing the networks around a desirable optimum. However, our experiment shows that the learned initialization is adhoc to a certain data distribution. When noises are added, Meta turns unstable and converges to a trivial solution. 3) Our NID displays stronger robustness by accurately locating and removing the occlusion pattern.
4.3 SelfSupervised Surveillance Video Analysis
In this section, we establish a selfsupervision algorithm that can decompose foreground and background for surveillance videos based on NID. Given a set of video frames , our goal is to find a continuous mapping representing the clip that can be decomposed to: , where is the background and are transient noises (e.g.
, pedestrians). We borrow the idea from Robust Principal Component Analysis (RPCA)
(Candès et al., 2011; Ji et al., 2010) where background is assumed to be “lowrank” and noises are assumed to be sparse. Despite wellestablished for discrete representation, modeling “lowrank” in continuous domain remains elusive. We achieve this by assuming at each time stamp are largely represented by the same group of experts, i.e., the nonzero elements in the sparse codings concentrate to several points, and the coding weights follow a decay distribution. Mathematically, we first rewrite by decoupling spatial coordinates and time: , where every time slice shares a same dictionary, and sparse coding depends on the timestamp. Then we minimize:(15)  
where the second term penalize the sparsity of according to an exponentially increasing curve (controlled by ), which implies the larger is, the more sparsity is enforced. As a consequence, every time slice are largely approximated by the first few components in NID, which simulates the nature of “lowrank” representation for continuous functions.
Results.
We test the above algorithm on BMCReal dataset (Vacavant et al., 2012). In our implementation, is also parameterized by another MLP, and we choose . Our qualitative results are presented in Figure 3. We verify that our algorithm can decompose the background and foreground correctly by imitating the behavior of RPCA. This application further demonstrates the potential of our NID in combining with subspace learning techniques.
Methods  128 views  16 views  8 views  

PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  
FFM (Tancik et al., 2020)  22.81  0.845  15.22  0.122  13.58  0.095 
SIREN (Sitzmann et al., 2020b)  24.32  0.891  18.48  0.510  17.26  0.483 
Meta (Tancik et al., 2021)  32.70  0.948  21.39  0.822  18.28  0.574 
NID ()  36.56  0.939  24.48  0.818  16.24  0.619 
NID ()  37.49  0.944  26.32  0.829  16.77  0.636 
4.4 Computed Tomography Reconstruction
Computed tomography (CT) is a widely used medical imaging technique that captures projective measurements of the volumetric density of body tissue. This imaging formation can be formulated as below:
(16) 
where is the location on the image plane, is the viewing angle, and is known as Dirac delta function. Due to limited number of measurements, reconstructing through inversing this integral is often illposed. We propose to shrink the solution space by using NID as a regularization.
Experimental Settings.
We conduct experiments on SheppLogan phantoms dataset (Shepp and Logan, 1974) with 2048 randomly generated CTs. We first directly train an NID over 1k CT images, during which the total number of experts is 1024, and each CT selects 128/256 experts. In CT scenario, a lookup table is chosen as our gating network. Afterwards, we randomly sample 128 viewing angles, and synthesize 2D integral projections of a bundle of 128 parallel rays from these angles as the measurement. To testify the effectiveness of our method under limited number of observations, we downsample 128 views by 12.5%(16) and 6.25%(8) respectively. Again, we choose FFM (Tancik et al., 2020), SIREN (Sitzmann et al., 2020b), and Meta (Tancik et al., 2021) as our baselines.
Results.
The quantitative results are listed in Table 2. We observe that our NID consistently leads two metrics in the table. When sampled views are sufficient, NID achieves the highest PSNR, while when views are reduced, our NID takes advantage in SSIM. We also plot the qualitative results in Figure 5. We find that our NID can regularize the reconstructed results to be smooth and shapeconsistent, which leads to less missing wedge artifacts.
4.5 Shape Representation from Point Clouds
Recent works (Park et al., 2019; Sitzmann et al., 2020a, b; Gropp et al., 2020) convert point clouds to continuous surface representation through directly regressing a Signed Distance Function (SDF) parameterized by MLPs. Suppose is our target SDF, given a set of points , we fit by solving a integral equation of the form below (Park et al., 2019):
(17) 
where denotes the signed shortest distance from point to point set . During optimization, we evaluate the first integral via sampling inside the given point cloud and the second term via uniformly sampling over the whole space. Tackling this integral with sparsely sampled points around the surface is challenging (Park et al., 2019). Similarly, we introduce NID to learn a priori SDF basis from data and then leverage it to regularize the solution.
Methods  500k points  50k points  10k points  

CD()  NC()  CD()  NC()  CD()  NC()  
SIREN (Sitzmann et al., 2020b)  0.051  0.962  0.163  0.801  1.304  0.169 
IGR (Gropp et al., 2020)  0.062  0.927  0.170  0.812  0.961  0.676 
DeepSDF (Park et al., 2019)  0.059  0.925  0.121  0.856  2.751  0.194 
MetaSDF (Sitzmann et al., 2020a)  0.067  0.884  0.097  0.878  0.132  0.755 
ConvONet (Peng et al., 2020)  0.052  0.938  0.082  0.914  0.133  0.845 
NID ()  0.058  0.940  0.067  0.948  0.093  0.921 
NID ()  0.053  0.956  0.063  0.952  0.088  0.945 
Experimental Settings.
Our experiments about SDF are conducted on ShapeNet (Chang et al., 2015) datasets, from which we pick the chair category for demonstration. To guarantee meshes are watertight, we run the toolkit provided by Huang et al. (2018) to convert the whole dataset. We split the chair category following Choy et al. (2016), and fit our NID over the training set. The total number of experts is 4096, and after 20 warmup epochs, only 128/256 experts will be preserved for each sample. We choose lookup table as our gating network. During inference time, we sample 500k, 50k and 10k point clouds, respectively, from the test surfaces. Then we optimize objective in Equation 17 to obtain the regressed SDF with represented by our NID. In addition to SIREN and IGR (Gropp et al., 2020), We choose DeepSDF (Park et al., 2019), MetaSDF (Sitzmann et al., 2020a), and ConvONet (Peng et al., 2020)
as our baselines. Our evaluation metrics are Chamfer distance (the average minimal pairwise distance) and normal consistency (the angle between corresponding normals).
Results.
We put our numerical results in Table 3, from which we can summarize that our NID is more robust to smaller number of points. As the performance of other methods drops quickly, the CD metric of NID stays below 0.1 and NC keeps above 0.9. We also provide qualitative illustration in Figure 6. We conclude that thanks to the constraint of our NID, the SDF will not collapse at some point where observations are missing. DeepSDF and ConvONet reply on latent feature space to decode geometries, which shows the potential in regularizing geometries. However, the superiosity of our model suggests our dictionary based representation is advantageous over conditional implicit representation.
5 Related Work
Generalizable Implicit Neural Representations.
Implicit Neural Representation (INR) (Tancik et al., 2020; Sitzmann et al., 2020b) notoriously suffers from the limited crossscene generalization capability. Tancik et al. (2021); Sitzmann et al. (2020a) propose metalearning based algorithms to better initialize INR weights for fast convergence. Chen et al. (2021c); Park et al. (2019); Chabra et al. (2020); Chibane et al. (2020); Jang and Agapito (2021); MartinBrualla et al. (2021); Rematas et al. (2021) introduce learnable latent embeddings to encode scene specific information and condition the INR on the latent code for generalizable representation. In Sitzmann et al. (2020b), the authors further utilize a hypernetwork (Ha et al., 2016) to predict INR weights directly from inputs. Compared with conditional fields or hypernetwork based methods, sparse coding based NID, with just one last layer, can achieve faster adaptation. The dictionary representation simplifies the mapping between latent spaces to a sparse linear combination over the additive basis, which can be manipulated more interpretably and also contributes to transferability. Last but not least, it is known that imposing sparsity can help overcome noise in illposed inverse problems (Donoho, 2006; Candès et al., 2011).
Mixture of Experts (MoE).
Mixture of Experts (Jacobs et al., 1991; Jordan and Jacobs, 1994; Chen et al., 1999; Yuksel et al., 2012; Roller et al., 2021) perform conditional computations composed of a group of parallel submodels (a.k.a. experts) according to a routing policies (Dua et al., 2021; Roller et al., 2021). Recent advances (Shazeer et al., 2017; Lepikhin et al., 2020; Fedus et al., 2021) improve MoE by adopting a sparsegating strategy, which only activates a minority of experts by selecting top candidates according to the scores given by the gating networks. This brings massive advantages in model capacity, training time, and achieved performance (Shazeer et al., 2017). Fedus et al. (2021)
even built language models with trillions of parameters. To stabilize the training,
Hansen (1999); Lepikhin et al. (2020); Fedus et al. (2021) investigated auxiliary loading loss to balance the selection of experts. Alternatively, Lewis et al. (2021); Clark et al. (2022) encourage a balanced routing by solving a linear assignment problem.6 Conclusion
We propose Neural Implicit Dictionary (NID) learned from data collection to represent the signals as a sparse combination of the function basis inside. Unlike tradition dictionary, our NID contains continuous function basis, which are parameterized by subnetworks. To train thousands of networks efficiently, we employ MixtureofExpert training strategy. Our NID enjoys higher compactness, robustness, and generalization. Our experiments demonstrate promising applications of NID in instant regression, image inpainting, video decomposition, and reconstruction from sparse observations. Our future work may bring in subspace learning theories to analyze NID.
Acknowledgement
Z. W. is in part supported by a US Army Research Office Young Investigator Award (W911NF2010240).
References
 Ksvd: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing 54 (11), pp. 4311–4322. Cited by: §2, §3.2, §3.2.
 Learning neural light fields with rayspace embedding networks. arXiv preprint arXiv:2112.01523. Cited by: §3.1.
 Törf: timeofflight radiance fields for dynamic scene view synthesis. Advances in neural information processing systems 34. Cited by: §2.
 Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297. Cited by: §3.3.
 Optimizing the latent space of generative networks. arXiv preprint arXiv:1707.05776. Cited by: §3.3.
 Robust principal component analysis?. Journal of the ACM (JACM) 58 (3), pp. 1–37. Cited by: §4.2, §4.3, §5.
 Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on information theory 52 (2), pp. 489–509. Cited by: §2.
 Deep local shapes: learning local sdf priors for detailed 3d reconstruction. In European Conference on Computer Vision, pp. 608–625. Cited by: §5.

PCANet: a simple deep learning baseline for image classification?
. IEEE transactions on image processing 24 (12), pp. 5017–5032. Cited by: §2.  ShapeNet: An InformationRich 3D Model Repository. Technical report Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago. Cited by: §4.5.
 MVSNeRF: fast generalizable radiance field reconstruction from multiview stereo. arXiv preprint arXiv:2103.15595. Cited by: §1.
 Compressed sensing and dictionary learning. Finite Frame Theory: A Complete Introduction to Overcompleteness 73, pp. 201. Cited by: §2.
 Nerv: neural representations for videos. Advances in Neural Information Processing Systems 34. Cited by: §2.
 Improved learning algorithms for mixture of experts in multiclass classification. Neural networks 12 (9), pp. 1229–1252. Cited by: §5.

Learning continuous image representation with local implicit image function.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 8628–8638. Cited by: §1, §2, §5.  Implicit functions in feature space for 3d shape reconstruction and completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6970–6981. Cited by: §5.
 3Dr2n2: a unified approach for single and multiview 3d object reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §4.5.
 Unified scaling laws for routed language models. arXiv preprint arXiv:2202.01169. Cited by: §5.
 Compressed sensing. IEEE Transactions on information theory 52 (4), pp. 1289–1306. Cited by: §1, §2, §5.
 Tricks for training sparse translation models. arXiv preprint arXiv:2110.08246. Cited by: §5.
 Unified implicit neural stylization. arXiv preprint arXiv:2204.01943. Cited by: §2.
 A segmentationaware deep fusion network for compressed sensing mri. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 55–70. Cited by: §2.
 Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961. Cited by: §5.
 SIGNET: efficient neural representation for light fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14224–14233. Cited by: §3.1.
 Convolutional conditional neural processes. arXiv preprint arXiv:1910.13556. Cited by: §3.3.
 Implicit geometric regularization for learning shapes. arXiv preprint arXiv:2002.10099. Cited by: §4.5, §4.5, Table 3.
 Hypernetworks. arXiv preprint arXiv:1609.09106. Cited by: §5.

Solving highdimensional partial differential equations using deep learning
. Proceedings of the National Academy of Sciences 115 (34), pp. 8505–8510. Cited by: §2.  Combining predictors: comparison of five meta machine learning methods. Information Sciences 119 (12), pp. 91–105. Cited by: §5.
 FastMoE: a fast mixtureofexpert training system. arXiv preprint arXiv:2103.13262. Cited by: §3.3.
 Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.3, §4.1.
 Robust watertight manifold surface generation method for shapenet models. arXiv preprint arXiv:1802.01698. Cited by: §4.5.
 Adaptive mixtures of local experts. Neural computation 3 (1), pp. 79–87. Cited by: §5.
 Codenerf: disentangled neural radiance fields for object categories. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12949–12958. Cited by: §5.
 Robust video denoising using low rank matrix completion. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1791–1798. Cited by: §4.3.
 Hierarchical mixtures of experts and the em algorithm. Neural computation 6 (2), pp. 181–214. Cited by: §5.
 Dictionary learning algorithms for sparse representation. Neural computation 15 (2), pp. 349–396. Cited by: §2, §3.2.
 Gshard: scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668. Cited by: §5.
 Base layers: simplifying training of large, sparse models. In International Conference on Machine Learning, pp. 6265–6274. Cited by: §5.
 Fourier neural operator for parametric partial differential equations. arXiv preprint arXiv:2010.08895. Cited by: §2.
 Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 85–100. Cited by: §3.3, §4.2.
 Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §4.1.
 Compressed sensing mri. IEEE signal processing magazine 25 (2), pp. 72–82. Cited by: §2.
 Nerf in the wild: neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7210–7219. Cited by: §5.
 Occupancy networks: learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4460–4470. Cited by: §1, §2.
 From denoising to compressed sensing. IEEE Transactions on Information Theory 62 (9), pp. 5117–5144. Cited by: §2.
 Nerf: representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, pp. 405–421. Cited by: §1, §1, §2, §3.1.
 Convolutional neural networks analyzed via convolutional sparse coding. The Journal of Machine Learning Research 18 (1), pp. 2887–2938. Cited by: §2.
 Deepsdf: learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 165–174. Cited by: §1, §2, §4.5, §4.5, Table 3, §5.
 Convolutional occupancy networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pp. 523–540. Cited by: §4.5, Table 3.
 Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: §3.3.
 Pointnet++: deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413. Cited by: §3.3.
 KiloNeRF: speeding up neural radiance fields with thousands of tiny mlps. arXiv preprint arXiv:2103.13744. Cited by: §3.3.
 Sharf: shapeconditioned radiance fields from a single view. arXiv preprint arXiv:2102.08860. Cited by: §5.
 Hash layers for large sparse models. In Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Eds.), External Links: Link Cited by: §5.
 Pifu: pixelaligned implicit function for highresolution clothed human digitization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2304–2314. Cited by: §1.
 Outrageously large neural networks: the sparselygated mixtureofexperts layer. arXiv preprint arXiv:1701.06538. Cited by: §1, §2, §3.3, §3.3, §3.3, §5.
 Nonlineofsight imaging via neural transient fields. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.
 The fourier reconstruction of a head section. IEEE Transactions on nuclear science 21 (3), pp. 21–43. Cited by: §4.4.
 Metasdf: metalearning signed distance functions. arXiv preprint arXiv:2006.09662. Cited by: §1, §4.5, §4.5, Table 3, §5.
 Implicit neural representations with periodic activation functions. Advances in Neural Information Processing Systems 33. Cited by: §1, §2, §3.1, §3.3, Table 1, §3, §4.1, §4.4, §4.5, Table 2, Table 3, §5.
 Light field networks: neural scene representations with singleevaluation rendering. arXiv preprint arXiv:2106.02634. Cited by: §2.
 On the error of random fourier features. arXiv preprint arXiv:1506.02785. Cited by: §3.1.
 Learned initializations for optimizing coordinatebased neural representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2846–2855. Cited by: §1, Table 1, §4.1, §4.2, §4.4, Table 2, §5.
 Fourier features let networks learn high frequency functions in low dimensional domains. arXiv preprint arXiv:2006.10739. Cited by: §1, §2, §3.1, §3.3, Table 1, §4.1, §4.4, Table 2, §5.
 Deep dictionary learning. IEEE Access 4, pp. 10096–10109. Cited by: §2.
 Dictionary learning. IEEE Signal Processing Magazine 28 (2), pp. 27–38. Cited by: §2, §3.2.
 Meganerf: scalable construction of largescale nerfs for virtual flythroughs. arXiv preprint arXiv:2112.10703. Cited by: §3.3.
 A benchmark dataset for outdoor foreground/background extraction. In Asian Conference on Computer Vision, pp. 291–300. Cited by: §4.3.
 Ibrnet: learning multiview imagebased rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699. Cited by: §1.
 Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §4.1.
 Pixelnerf: neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4578–4587. Cited by: §1.
 Freeform image inpainting with gated convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4471–4480. Cited by: §4.2.
 Twenty years of mixture of experts. IEEE Transactions on Neural Networks and Learning Systems 23 (8), pp. 1177–1193. External Links: Document Cited by: §5.
 Deep sets. arXiv preprint arXiv:1703.06114. Cited by: §3.3.

The unreasonable effectiveness of deep features as a perceptual metric
. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595. Cited by: §4.1.  CryoDRGN: reconstruction of heterogeneous cryoem structures using neural networks. Nature Methods 18 (2), pp. 176–185. Cited by: §2, §3.1.