DeepAI
Log In Sign Up

PREF: Phasorial Embedding Fields for Compact Neural Representations

05/26/2022
by   Binbin Huang, et al.
4

We present a phasorial embedding field PREF as a compact representation to facilitate neural signal modeling and reconstruction tasks. Pure multi-layer perceptron (MLP) based neural techniques are biased towards low frequency signals and have relied on deep layers or Fourier encoding to avoid losing details. PREF instead employs a compact and physically explainable encoding field based on the phasor formulation of the Fourier embedding space. We conduct comprehensive experiments to demonstrate the advantages of PREF over the latest spatial embedding techniques. We then develop a highly efficient frequency learning framework using an approximated inverse Fourier transform scheme for PREF along with a novel Parseval regularizer. Extensive experiments show our efficient and compact frequency-based neural signal processing technique is on par with and even better than the state-of-the-art in 2D image completion, 3D SDF surface regression, and 5D radiance field reconstruction.

READ FULL TEXT VIEW PDF

page 2

page 4

page 6

page 8

page 11

page 14

page 15

page 16

03/29/2006

Fourier Analysis and Holographic Representations of 1D and 2D Signals

In this paper, we focus on Fourier analysis and holographic transforms f...
06/07/2013

Quaternionic Fourier-Mellin Transform

In this contribution we generalize the classical Fourier Mellin transfor...
10/11/2016

Quantum spectral analysis: frequency at time (a lecture)

A quantum time-dependent spectrum analysis, or simply, quantum spectral ...
07/06/2020

Joint Frequency- and Image-Space Learning for Fourier Imaging

We propose a neural network layer structure that combines frequency and ...
02/21/2019

STFNets: Learning Sensing Signals from the Time-Frequency Perspective with Short-Time Fourier Neural Networks

Recent advances in deep learning motivate the use of deep neural network...
02/16/2022

Small defects reconstruction in waveguides from multifrequency one-side scattering data

Localization and reconstruction of small defects in acoustic or electrom...
06/16/2022

Unified Fourier-based Kernel and Nonlinearity Design for Equivariant Networks on Homogeneous Spaces

We introduce a unified framework for group equivariant networks on homog...

1 Introduction

Coordinate-based neural representations have emerged as a compelling alternative for modeling and processing signals (e.g., sound, image, video, geometry, etc.). Instead of using discrete primitives such as pixels or vertices, coordinate-based multi-layer perceptrons (MLPs) represent graphics images llff ; huang2021textrm ; fathony2020multiplicative , shapes SAL ; IGR ; mescheder2018occupancy ; peng2020convolutional , or even radiance fields chen2021mvsnerf ; lin2021barf ; meng2021gnerf ; mildenhall2020nerf ; muller2022instant ; park2021nerfies ; zhang2020nerf++ in terms of continuous functions that are memory efficient and amenable for solving inverse reconstruction problems via training. Earlier coordinate-based MLPs tend to be biassed towards low frequencies, whereas more recent implicit neural approaches have adopted a sinusoidal representation for better recovering high frequencies, by either transforming the input coordinates to the Fourier basis mildenhall2020nerf ; zhong2021cryodrgn ; yu2022anisotropic , or encoding the sinusoidal nonlinearity via deeper MLP architectures sitzmann2019siren ; fathony2020multiplicative .

Although effective in countering high frequency losses, such approaches are inefficient in training or inference, inherent to the MLP-based optimization strategy: brute-force mappings from low-dimension, low-frequency inputs onto high frequency target functions do not sufficiently consider the underlying characteristics of the mapping function. As a result, they have largely resorted to large MLPs (either wide in hidden dimensions or deep in layers) to reliably learn the corresponding function mapping. The downside of large MLPs is the long training time and slow inference speed.

To accelerate training, embedding-based MLPs yu2021plenoxels ; sun2021direct ; muller2022instant ; chen2022tensorf have adopted an inverse optimization strategy. They jointly search for an optimal mapping network and the optimal inputs (i.e., the high dimension embedding volume). By replacing the input field with a high-dimension high-frequency embedding volume (whose size is generally determined by the volume resolution and kernel size), they manage to bridge the gap between the low-dimension low-frequency coordinate inputs and the high-frequency outputs with a much smaller MLP in both width and depth. Successes of these approaches are illustrated by their accelerations by orders of magnitudes in both training and rendering chen2022tensorf ; muller2022instant ; sun2021direct ; yu2021plenoxels ; yu2021plenoctrees .

However, as their embedding space is also represented in discrete forms, state-of-the-art techniques have relied on interpolations for querying high-dimensional embedded features. To maintain high efficiency, the most adopted schemes are still linear interpolation. Yet they present several major limitations:

  • The highest recoverable frequency of the embedding fields is determined by the resolution of the volume. To preserve high frequencies, it is critical to discretize the volume at a very fine level. Fully frequency preservation hence requires demand extremely high memory consumption, prohibitive on even the most advanced GPUs.

  • From a signal processing perspective, linear interpolation within the embedding volume not only leads to aliasing but also causes higher-order derivatives to vanish, hindering back-propagation and the overall convergence. Figure 1 shows a typical example.

  • Spatially discretized embedding, compared to its frequency dual, provides limited insights on the signal. In particular, despite a large variety of Fourier signal processing tools for both editing and stylization, few are directly applicable to spatial embedding.

To address these limitations, we present PREF, a novel PhasoRial Embedding Field, that represents the Fourier transformed embedding with compact, complex-numbered phasors. We derive a comprehensive framework using PREF to approximate the Fourier embedding space so that each spatial coordinate’s feature can be represented in terms of the sinusoidal features within a discrete phasor volume. PREF presents several key advantages over previous spatial embedding techniques. First, the nonlinear transform nature of the phasor avoids the vanishing derivative problem in linearly interpolated space embedding. Second, PREF is highly compact and does not require using high resolution volumes to preserve high frequency. In fact, it facilitates easy queries of the specific frequency of the input embedding. Finally, based on Fourier transforms, PREF manages to exploit many existing Fourier signal processing techniques such as differentiation, convolution, and Parseval’s theorem to conduct traditionally expensive operations for the MLP. Overall, PREF provides a new embedding that benefits direct and inverse neural reconstructions on tasks ranging from 2D image completion to 3D point cloud processing and the 5D neural radiance field reconstruction.

To summarize, the contributions of our work include:

  • A compact and physically explainable encoding field based on the phasor formulation called PREF. We conduct a comprehensive theoretical analysis and demonstrate the advantages of PREF over previous spatial embedding techniques.

  • We show that PREF provides a robust and compact solution to MLP-based signal processing. It is compact and at the same time effectively preserves high frequency components and therefore details in signal reconstruction.

  • We develop a highly efficient frequency learning framework using an approximated inverse Fourier transform scheme along with a novel Parseval regularizer. Comprehensive experiments demonstrate that PREF outperforms the state-of-the-art techniques in various signal reconstruction tasks in both accuracy and robustness.

Figure 1:

Comparison of spatial encoding and frequency encoding fitting a ground truth image (top left). We show that PREF fits the image well, correctly reflecting its first- and second-order derivatives. On the left plot, we show that PREF outperform explict (grid) representation and state-of-the art implicit neural networks under different activations.

2 Related Work

Our PREF framework is in line with renewed interest on adopting implicit neural networks to represent continuous signals from low dimensional input. In computer vision and graphics, they include 2D images with pixels

Chen_2021_CVPR , 3D surfaces in the form of occupancy fields mescheder2018occupancy ; DVR ; peng2020convolutional or signed distance fields (SDFs) SAL ; IGR ; wang2021neus ; yariv2020multiview ; martel2021acorn ; NGLOD ; yariv2021volume ; wang2021spline , concrete 3D volumes with a density field ji2017surfacenet ; qi2016volumetric , 4D light fields levoy1996light ; wood2000surface and 5D plenoptic functions for the radiance fields mildenhall2020nerf ; zhang2020nerf++ ; oechsle2021unisurf ; liu2020neural ; park2021nerfies ; yariv2020multiview .

Encoding vs. Embedding.

To learn high frequencies, state-of-the-art implicit neural networks have adopted the Fourier encoding scheme by transforming the coordinates by periodic and functions or under Euler’s formula . Under Fourier encoding, feature optimization through MLPs can be mapped to optimizing complex-valued matrices with complex-valued inputs in the linear layer. For example, Position Encoding (PE) and Fourier Feature Maps (FFM) both transform spatial coordinates to Fourier basis at earlier input layers martin2021nerf ; tancik2020fourfeat ; zhong2021cryodrgn whereas SIREN sitzmann2019siren

embeds the process in the deeper layers by using periodic activation functions.

Improving training and inference efficiency of MLP-based networks has also been explored from the embedding perspective with smart data structures. Various schemes reiser2021kilonerf ; hedman2021baking ; yu2021plenoctrees ; yu2021plenoxels ; sun2021direct ; muller2022instant ; chen2022tensorf replace the deep MLP architecture with voxel-based representations, to trade memory for speed. Early approaches bake an MLP to an Octree along with a kilo sub-NeRF or 3D texture atlas for real-time rendering reiser2021kilonerf ; hedman2021baking . These approaches rely on a pre-trained or half-trained MLP as prior and therefore still incur long per-scene training. Planoxel yu2021plenoxels and DVGO sun2021direct

directly optimize the density values on discretized voxels and employ an auxiliary shading function represented by either spherical harmonics (SH) or a shallow MLP network to account for view dependency. They achieve orders of magnitude acceleration over the original NeRF on training but incur very large memory footprint by storing per-voxel features. However, over parametrization can easily lead to noisy density estimation and subsequently inaccurate surface estimations and rendering. The seminal work of Instant-NGP

muller2022instant spatially groups features via embedding with a hash table to achieve unprecedentedly fast training. In a similar vein, TensoRF chen2020tensor

employs highly efficient tensor decomposition via vertical projections.

It is worth noting that embedding-based techniques share many similarities to neural inversion inversion

that aims to jointly optimize the inputs and network weights. In a nutshell, different from traditional feed-forward neural network optimization that attempts to refine network weights, neural inversion seeks to find inputs, often non-unique, that will produce the desired output response under the fixed set of weights. Classic examples include latent embedding optimization

rusu2018meta

in machine learning and GAN inversions

karras2019style ; xia2021gan ; Karras2021 ; karras2019style that finds the optimal latent code to best match the target image for subsequent editing or stylization. In the context of embedding, the focus is to maintain a learnable high-order input embedding and then compute the feature via interpolation in the embedding space via schemes as simple as the nearest neighbor but no more complicated than linear interpolation. While efficiency in feature querying, they share a common limitation of vanishing gradient, that is, the piecewise constant first-order derivative and zero second-order derivative due to linear interpolation, leading to higher than usual errors than brute-force pure MLP implementations. We show that our PREF representation is both compact and effective in preserving high frequency details by overcoming the hurdle of vanishing derivatives.

3 Background and Notations

Our goal is to fit a continuous function parameterized with low dimensional inputs . Let be an MLP with parameter , be a complex-valued (phasor) volume of dimensionality . represents the inverse Fourier Transform. We use the MLP to approximate as as , where can be computed as .

Our phasorial embedding field (PREF) resembles existing spatial embedding. However, spatial embedding uses a real-value volume rather than and employs local interpolation or hash functions in place for . Similar to spatial embedding though, PREF can also handles any continuous fields by the spatially-embedded MLPs, e.g., signed distance fields (SDFs) NGLOD ; wang2021neus ; yariv2020multiview or radiance fields (RFs) chen2020tensor ; sun2021direct ; yu2021plenoxels ; mildenhall2020nerf . In fact, as a frequency-based alternative, PREF explicitly associates frequencies with features: each volume entry represents the Fourier coefficients under the corresponding frequency.

Next, we show that under PREF, many neat properties of the Fourier transform translate to the complex phasor volume, facilitating much more efficient optimization and accessible manipulation. For simplicity, we carry out our derivations in 2D while high dimensional extensions can be similarly derived as shown in various applications.

We first present the inverse Fourier transform, and along the way, present the associate theorems, which we utilize to optimize our phasorial representation. Before proceeding, we explain our notation. Let be the continue and discrete Fourier series, and is a phasor volume that translates to phasorial embedding field .

Inverse Fourier Transform. Let be a 2D continuous band-limited signal, its discrete inverse Fourier transform factorizes the signal into a Fourier series with corresponding coefficients:

(1)

Recall that corresponds to an equally-spaced matrix where each entry is a phasor with real and imaginary part of the corresponding frequency as:

(2)

The resulting phasor volume can be viewed as a multi-channel version of the Fourier map . Therefore,

inherits several nice properties of the Fourier transform, which is efficient for frequency-domain manipulations.

Theorem 1.

Let be an absolutely continuous differentiable function, and be its inverse Fourier transform, we have

(3)
Theorem 2.

Let be absolutely continuous differentiable function, and be its inverse Fourier transform, we have

(4)

Beyond these, modeling a signal in a Fourier domain provides various unique properties; for example, the circular convolution theorem indicates that a convolution of two sequences can be obtained as the inverse transform of the product of the individual transforms.

A nive solution is to represent the signals based on the standard inverse Fourier transform and a dense phasorial volume grid describing the Fourier coefficients, then jointly update the volume grid and MLP with gradient descent. However, such a solution is inefficient since both model size and computation increase in , where is the resolution of the Fourier series. We next introduce a novel modeling, phasorial encoding field, for resolving the limitation in the nive solution.

4 Phasorial Embedding Fields

Our PREF is a de facto continuous feature field transformed from a multi-channel multi-dimensional square Fourier (phasor) volume (e.g., for 2D tasks). In a nutshell, PREF employs the Fourier transform in Eq. 1 to map spatial coordinate into -channel feature . The mapped results can be fed into an MLP to process task-specific fields such as SDFs or RFs. Figure 2 illustrate this. As each channel is independent, PREF can transform the spatial coordinates and update the phasor volume in parallel. Therefore, we simply omit in the rest of the derivations for clarity. We start by defining a phasor volume of PREF and subsequently discuss how to efficiently extract features from PREF and then optimize the volume.

Figure 2: Pipeline of our PREF.

4.1 Phasor Volume Decomposition

Brute-force representation of PREF with full frequencies is clearly too expensive in both computation and memory. Because many natural signals are band-limited and to reduce complexity, we instead use a sparse set of frequencies to encode PREF. The process is equivalent to selectively marking a large portion of the entries in matrix as zero. In particular, we set out to factorize by logarithmic sampling along each dimensions. This results in two thin matrices and , with being a small number. Consequently, such a transform simplifies to . The same formulations can be also easily extended to a higher dimensional signal with a similar logarithmic sampling along individual dimensions.

Recall recovering corresponds to transforming the full volume with selectively marking specific entries as zero. For clarity, we use the same notation of full volume from Section 4.2 and to 4.3. A more detailed discussion on complexity is carried out in Section 4.4.

4.2 IFT Approximation

Recall that full numerical integration of the Fourier transform (i.e., Eq. 1) requires computing an expensive inner product between the entire phasor volume and the frequency-encoded volume . It needs to be computed for every single point in , prohibitively expensive for any practical frequency learning scheme. To make the computational tractable, we observe that if all input coordinates are equally spaced, then the Fourier transform simplifies to the Discrete Fourier Transform (DFT) that can be explicitly evaluated using fast Fourier transform (FFT) methods, e.g., the Cooley–Tukey algorithm. To compute an arbitrary (off-grid) input coordinate , one possible solution is to first compute a map of equally-spaced coordinates by 2D FFT and then perform bilinear interpolation from the 2D on-grid map given. However, such a scheme, i.e., computing an entire dense grid with many unused vertices, provides a poor trade-off between complexity and accuracy.

We instead employ both FFT and numerical integration (NI) to achieve both high accuracy and relatively low complexity. Specifically, we first perform 1D FFT along one of the axes to obtain a intermediate map , with .

(5)

where is a linear interpolation operation that corresponds to interpolating from the intermediate map . Note that, the length in the reduced dimension is extremely small (details described in 4.1). Therefore, per-sample numerical integration is very efficient, significantly reducing the training cost. Further, different from interpolation in the spatial domain chen2022tensorf ; muller2022instant ; sun2021direct ; yu2021plenoxels that results in vanishing high order gradient, such a frequency interpolation scheme, benefits from the periodic characteristic in the frequency domain, and manages to preserve gradients (see Fig 1).

4.3 Volume Regularization

Recall that many high dimensional signal reconstruction problems including NeRF are ill-posed. Therefore, PREF alone without additional priors may still produce reconstruction artifacts on such problems. In traditional signal processing, several regularization techniques such as the Lasso Regression (LR loss) and Total Variation regularizers (TV loss) have been imposed on natural signals to restrict the complexity (parsimony) of the reconstruction. Yet both LR and TV losses are designed for processing signals in the spatial domain and are not directly applicable to PREF. We therefore propose a novel Parseval regularizer as:

(6)

We show that the Parseval regularizer behaves like anisotropic TV in the spatial domain.

Lemma 3.

Let be integrable, and be its Fourier transform. The anisotropic TV loss of can be represented by .

Proof: Recall the TV loss can be computed as . Since and are Fourier pairs, we have Fourier transform preserves the energy of original quantity based on Parseval’s theorem (theorem 2), i.e.,

(7)

According to theorem 1, and are also Fourier pairs. The integration derivative along axis is defined as,

(8)

By taking square root on both sides, we have . And can be derived with similar proof.

4.4 Complexity Analysis

Our PREF approximation scheme reduces the memory complexity by representing the feature space with a sparse set of frequencies, i.e., to represent a channel 3D feature volume. Similar to NGP, we can equip a shallow MLP module to PREF to achieve full spectral reconstruction.

Finally, to further improve efficiency, we adopt a two-step procedure for computing the PREF features for all input coordinates: (1) we conduct FFT to compute an off-grid intermediate map per batch of the input, and (2) we perform per sample numerical integration (NI) of the Fourier Transform. This indicates that the larger the batch, the lower the average complexity for the input samples. For a feature volume of size with samples to be transformed and queried, our scheme reduces the complexity from based on naive 3D NI to using 2D FFT and 1D NI. Notice that in a single training batch. Therefore, our implementation only incurs around increase in computational cost over the spatially-embedded acceleration schemes. Yet PREF reduces the memory consumption, improves modeling/rendering quality via full spectral reconstruction, and provides convenient frequency manipulations.

5 Experiments

We evaluate our PREF with a number of natural signal processing tasks, including 2D image completion, signed distance field regression, and radiance field reconstruction. For each task, we tailor a solution based on our PREF.

5.1 Baselines

We first analyze and compare our PREF with three standard backbones of coordinate-based MLPs, which also focus on boosting the capability or training efficiency.

Positional Encoding with Fourier Features. The Fourier features tancik2020fourfeat aim to learn high-frequency functions in low dimensional problem domains. Fourier features transform low dimension coordinates to before passing them into an MLP, where is defined as for

. Such a mapping function is deterministic and robust to hyperparameter

. However, the usage of exponential spaced on-axis frequency series is insufficient to cover the complete frequency domain. Therefore, PE bias toward axis-align signals tancik2020fourfeat .

Periodic Activation Functions. SIREN sitzmann2019siren proposes an alternative encoding: . A major advantage of SIREN is that it better preserves high-order gradients and subsequently supports network modulation by controlling the amplitude and phase of the active layer. A major challenge here is that the periodic activation contains a mass of local minimal. Thus it requires a careful network initialization for stable training.

Spatial Feature Grids. The latest acceleration schemes including DVGO sun2021direct , i-NGP muller2022instant , and TensoRF chen2020tensor , use

as a local transform. That is, TensoRF uses an orthogonal projection, DVGO uses linear interpolation, and i-NGP uses the hashing to map spatial coordinates into a learnable embedding space. These schemes greatly reduce the MLPs dependency and are highly efficient in training (e.g., a Pytorch implementation reduces NeRF training time from hours to minutes, and a CUDA implementation further reduces the time cost to seconds). Yet, as local interpolants, they struggle to maintain continuity while avoiding diminishing high-order gradients.

PREF. Our PREF employs based on a learnable embedding grid in the frequency domain. It continuously (after inverse Fourier transform) and globally encodes spatial coordinates into the sum of inner products. Here we advocate the "global" characteristics of PREF because each phasor impacts all spatial grids. Next, we will demonstrate that PREF can simultaneously achieve robustness via frequency decomposition and maintain the high-order gradients via periodic modeling. It is also highly efficient to conduct embedding-based learning equipped with shallow MLPs.

5.2 Applications

We evaluate our PREF on three neural reconstruction tasks, including image regression, SDF regression and radiance field reconstruction. The choice of phasor volume size and MLP varies with tasks, mainly for fair comparisons with the aforementioned baselines, which we detail in each task.

Figure 3: Quality results of our PREF and comparison methods (Nerf martin2021nerf , Plenoxels yu2021plenoxels and DVGO sun2021direct

) on the synthetic NeRF scenes. Benefiting from our phasorial embedding, we manage to preserve details while rendering less outliers.

Natural Text
Dense Grid sun2021direct
PE tancik2020fourfeat
SIREN sitzmann2019siren
Ours
Table 1: 2D image results(mean standard deviation of PSNR)

2D Image Regression and Reconstruction.

2D image regression aims to evaluate the capability of the representation supervised with all image pixels, and image reconstruction or image inpainting is trained with partial pixels and predict the missing ones. We quantitatively evaluate the PSNR scores between the outputs and ground-truth images for both tasks.

We first conduct pilot experiments of frequency learning under the image regression task, as shown in Figure 1. The dense grid setting, corresponding to the 3rd and the 5th column, adopts a linear interpolation in a learnable feature grid and optional, followed by an MLP; Our PREF uses the same volume resolution as the dense grids and performs the standard inverse Fourier transform to predict the target image. Our method significantly outperforms the dense grid setting in terms of the convergence speed and the reconstruction quality of pixels. We owe this improvement to the periodic nature of our Fourier decomposition that globally regularizes the whole signal instead of regionally as that in dense grids. We show that the representation of local parameterization with linear interpolation can hurt its generalization ability, consequently producing noisy results and losing its high-order gradients.

We further evaluate our PREF with the inpainting task, where we train our PREF with a regularly-spaced grid containing missing pixels from each image in the Natural and Text dataset. In this experiment, our PREF consists of a phasor volume with the reduced dimension d = 8 and a two-layer MLP. Following fathony2020multiplicative , We report test error on an unobserved pixels in Tab. 1. Reasonably, our approach also achieves qualitatively better reconstruction results.

Method BatchSize Steps Time Size(MB) PSNR SSIM
SRN sitzmann2019scene - - 10h - 22.26 0.846
NeRF mildenhall2020nerf 4096 300k 35h 5.0 31.01 0.947
SNeRG hedman2021baking 8192 250k 15h 1771.5 30.38 0.950
NSVF liu2020neural 8192 150k 48h - 31.75 0.950
PlenOctrees yu2021plenoctrees 1024 200k 15h 1976.3 31.71 0.958
Plenoxels yu2021plenoxels 5000 128k 11.4m 778.1 31.71 0.958
DVGO sun2021direct 5000 30k 15.0m 612.1 31.95 0.957
TensoRF chen2022tensorf 4096 30k 17.4m 71.8 33.14 0.963
Ours 4096 30k 18.1m 34.4 32.08 0.952
Table 2: Application for fast radiance fields reconstruction. We compare our approach with both pure MLP models (SRN), positional encoding (NeRF, SNeRG, PlenOctrees) and grid based (Plenoxels, DVGO) approaches.

SDF regression and editing. Next, we explore the capability of PREF for geometric representation mescheder2018occupancy ; park2019deepsdf and editing. We evaluate the 3D shape regression from given point clouds together with its SDF values. We adopt Armadillo and Gargoyle, two widely-used models. For each model, we normalize the training mesh into a bounded box of , and sample points for training: points on the surface, points around the surface by adding a Gaussian noise on the surface point with , and points uniformly sampled within the bounding box. We report the IOU of the ground truth mesh and the regressed signed distance field, by discretizing them into two volumes. We also report the Chamfer distance metric by sampling surface points from the predicted mesh with marching cube lorensen1987marching . We implement DVGO by ourselves and adopt the implementation of i-NGP111adopted from https://github.com/ashawkey/torch-ngp for comparisons. We include detailed parameter choices in supplementary. The quantitative evaluation in Tab.4 demonstrates that we achieve competitive results with existing STOAs, while remaining a more compact model.

Our trained SDF model also allows implicit surface editing such as Gauss smoothing, where we apply a point-wise multiplication of a Gauss on the trained phasorial embedding, as shown in Fig.4. This may be useful for surface denoising or texture removal.

Armadillo Gargoyle
Memory (MB) IOU Chamfer IOU Chamfer
i-NGP muller2022instant 46.7 99.34 5.54e-6 99.42 1.03e-5
DVGO sun2021direct 128.0 98.81 5.67e-6 97.99 1.19e-5
PE tancik2020fourfeat 6.0 96.65 1.21e-5 80.46 1.03e-4
Ours 36.0 99.02 5.57e-6 99.05 1.06e-5
Table 3: SDF regression results. We compare our approach with both coordinate-based models (PE) and embedding-based models (i-NGP, DVGO).
Figure 4: SDF editing via phasorial embedding. After the sdf is regressed, PREF allows implicit surface editing by manipulate the phasorial embedding. This figure show the smoothing effect, by point-wise scaling the embedding with a Gauss function, i.e., , where is the frequency coordinate. Notice we perform smoothing in the neural fields, which is different to existing techniques like mesh smoothing.

Neural Radiance Field Reconstruction. Finally, we evaluate PREF on the popular radiance field reconstruction tasks. NeRF reconstruction attempts to recover scene geometrics and appearance given a set of multi-view input images with posed cameras. In this task, we evaluate the reconstruction quality of the novel views and evaluate the model size for the compactness and the efficiency with its training speed. In practice, we individually model the density and color, then jointly optimize them via a volume render scheme mildenhall2020nerf supervised only with image color. However, this sometimes leads to overfitting to the training view and leading to floaters in empty space due to the strong model capability. We utilize a new Parseval term to regularize the embedding field, as described in sec. 4.3. We set the expected volume size as , roughly containing 36MB parameters. Please see supplementary for detailed information. During optimization, we apply a coarse-to-fine training scheme starting from the low frequency series. We then progressively unlock the rest high frequency series to be learnable at step to reach the expected frequencies . Tab. 2 shows the quantitative results; our model can be on par with the state-of-the-art radiance field reconstruction approaches while reaching compact modeling and a fast training process.

6 Conclusion

We have presented a novel neural approach for compact modeling that decompose a natural signal into Fourier series, and developed a new approximating inverse transform scheme for efficient reconstruction. PREF produces high quality images, shapes, and radiance fields from given limited data and outperforms recent works. Benefiting from our physical meaningful Fourier decomposition and fast transformation, we allow explicitly manipulating the learned embedding under different frequencies. One interesting future direction is to apply our representation to 3D-aware image generation, and PREF is potentially able to resolve the challenge of high-resolution rendering.

References

Appendix A Phasorial Embedding Fields Implementation Details

Phasor Volume Decomposition. Recall that PREF is a continuous embedding field corresponding to a multi-channel multi-dimensional square Fourier volume. We elaborate on implementation details. Let be a 3D phasor volume representing an embedding field .

Note that the is Hermitian symmetric when the is a real-valued feature embedding, i.e., (i.e., its complex conjugate). Further, base on the observation that natural signals are generally band-limited, we model their corresponding fields with band-limited phasor volumes where we are able to partially mask out some entries and factor the volume along respective dimensions, as shown in Fig 5. Thus we factorize the full spectrum into tri-thine embeddings by reusing the linearity of the Fourier Transform, .

Figure 5: Phasor volume decomposition. Let the phasor volume with zero frequency centered. We selectively mask out some entries of the phasor volume as zero, then approximate the spatial feature embedding of a large phasor volume in terms of the sum of the embedding from three smaller ones, one for each dimension. The complete phasor volume to spatial feature embeddings is shown in the PyTorch pseudo-code in Algorithm 1.

IFT Implementation. Recall that for a 3D phasor volume, we approximate via sub-procedures of applying 2D Fast Fourier Transforms (FFTs) and 1D numerical integration (NI) to achieve high efficiency. Therefore, given a batch of spatial coordinates, our PREF representation transforms them into a batch of feature embeddings in parallel where PREF can serve as a plug-and-play module. Such a module can be applied to many existing implicit neural representations to conduct task-specific neural field reconstructions. We present a sketchy PyTorch pseudo-code in Algorithm 1.

class PREF(nn.Module):
def _init_(self, res, d, ks):
# res: resolution size
# d: reduced dim size
# ks: output kernel size
Nx, Ny, Nz = res
# log sampling freq in reduced dimension

self.freq = torch.tensor([0]+[2**i for i in torch.arange(d-1)])

self.Pu = nn.Parameter(torch.zeros(1, ks, d, Ny, Nz).to(torch.complex))
self.Pv = nn.Parameter(torch.zeros(1, ks, Nx, d, Nz).to(torch.complex))
self.Pw = nn.Parameter(torch.zeros(1, ks, Nx, Ny, d).to(torch.complex))
def forward(self, xyz):
# 2D Fast Fourier Transform
Pu = torch.fft.ifftn(self.Pu, dim=(3,4))
Pv = torch.fft.ifftn(self.Pv, dim=(2,4))
Pw = torch.fft.ifftn(self.Pw, dim=(2,3))
# 2D Linear interpolation
xs, ys, zs = xyz.chunk(3, dim=-1)
Px = grid_sample_cmplx(Pu.transpose(3,3).flatten(1,2), torch.stack([zs, ys], dim=-1)[None]).reshape(Pu.shape[1], Pu.shape[2], -1)
Py = grid_sample_cmplx(Pv.transpose(2,3).flatten(1,2), torch.stack([zs, xs], dim=-1)[None]).reshape(Pv.shape[1], Pv.shape[3], -1)
Pz = grid_sample_cmplx(Pw.transpose(2,4).flatten(1,2), torch.stack([xs, ys], dim=-1)[None]).reshape(Pw.shape[1], Pw.shape[4], -1)
# 1D Numerical Integration
fx = batch_NI(Px, xs, self.freq)
fy = batch_NI(Py, ys, self.freq)
fz = batch_NI(Pz, zs, self.freq)
# Summation
return fx+fy+fz
Algorithm 1 PyTorch pseudo code for PREF

Phasor Volume Initialization. Our PREF approach can be alternatively viewed as a frequency space learning scheme to existing spatial coordinate-based MLPs. In our experiments, we found zero initialization works well for applications ranging from 2D image regression to 5D radiance fields reconstruction while certain applications require more tailored initialization, e.g., geometric initialization in SAL ; yariv2020multiview . This is because (with being the frequency coordinate) needs to satisfy the unique constraints of . We thus initialize the phasor volume as follows: Let be the initialization of . We have = , with as the Fourier transform. We then transform via the inverse Fourier transform (due to duality between and ) as the approximation to . We found such a strategy enhances the stability and efficiency.

Computation Time. One of the key benefits of PREF is its efficiency. As discussed in section 4.3, we conduct frequency-based neural field reconstruction by employing IFT, which is computationally low cost and at the same time effective. When the input batch is sufficiently large (e.g., samples per batch in radiance field reconstruction), the per-sample numerical evaluation will dominate the computational cost. Since such per-sample evaluation can be efficiently implemented using matrix product, it is essentially equivalent to adding a tiny linear layer. The overall implementation makes PREF nearly as fast as the state-of-the-art, e.g., instant-NGP for NeRF. For example, on the Lego example, our PyTorch PREF produces the final result in 16 minutes on a single RTX3090, considerably faster than the original NeRF and comparable to the PyTorch implementation of i-NGP. We are in the process of implementing PREF on CUDA analogous, and hopefully, it may achieve comparable performance to the CUDA version of i-NGP.

Appendix B Application to Image Regression

b.1 Implementation & Reproducibility Details

Pilot experiments. Before proceeding to more sophisticated tasks such as generation and reconstruction, we first explore a toy example that utilizes PREF to continuously parametrize an image, e.g., a grayscale image. To better provide insights into our frequency-based learning framework, we use a complex-valued grid to regress (upsample) the input image via inverse Fourier transform, using PREF vs. bilinear upsampling on the same MLP network. From a signal processing perspective, if the image is band-limited, e.g., Nyquist frequency is the fold of 128, a frequency scheme should perfectly reconstruct the image. Yet Bilinear interpolation exhibits aliasing due to the characteristics of the first- and second-order derivatives, as shown in the discussion and Fig. 1 in the paper. We then show the performance using embedding where we expand the grid size to a embedding volume followed by a three-layer MLP with a hidden dimension of 256 that maps the embeddings to pixels. Previous studies sitzmann2019siren

have shown that improper activation functions can also lead to aliasing in high-order gradients, despite the choice of embedding techniques. Therefore, for comprehensive studies, we further compare various most-seen activation functions, including ReLU, Tanh, and the most recent, Sine

sitzmann2019siren . Our experiments show that such a frequency-learning scheme consistently outperforms its spatial counterparts, potential owing to its well-behaved derivatives and continuous nature, as shown in the line plot of Fig. 1.

Image completion. Next, we demonstrate PREF on image completion tasks. We use the commonly adopted setting huang2021textrm ; fathony2020multiplicative : given 25 pixels of an image, we set out to predict another pixels. We evaluate PREF vs. SATO on two benchmark datasets - Nature and Text. Specifically, we compare PREF with a dense grid counterpart, and two state-of-the-art coordinate-based MLPsmartin2021nerf ; sitzmann2019siren . The dense grid uses a resolution whereas PREF uses two grids that correspond to the highest frequency of . The two embedding techniques above use the same MLP with three linear layers, 256 hidden dimensions, and ReLU activation. We use Positional Encoding (PE) which consists of a 5-layer MLP with frequencies encoding. We adopt SIREN from sitzmann2019siren that uses a 4-layer MLP and sine activation. Detailed comparisons are listed in Tab 1.

Optimization details. All experiments use the same training configuration. Specifically, we adopt the Adam optimizer adam with default parameters (), a learning rate of . We use loss with iterations to produce the final results.

Armadillo Gargoyle
Memory (MB) IOU Chamfer IOU Chamfer
i-NGP muller2022instant 46.7 99.34 5.54e-6 99.42 1.03e-5
DVGO sun2021direct 128.0 98.81 5.67e-6 97.99 1.19e-5
PE tancik2020fourfeat 6.0 96.65 1.21e-5 80.46 1.03e-4
Ours 36.0 99.02 5.57e-6 99.05 1.06e-5
Table 4: SDF regression results. We compare our approach with both coordinate-based models (PE) and embedding-based models (i-NGP, DVGO).

Appendix C Application to Signed Distance Field Reconstruction

c.1 Task description

Next, we conduct the more challenging task of signed distance field (SDF) reconstruction. An SDF describes the shape in terms of a function as:

(9)

where is a closed surface and and correspond to regions outside of and inside the surface respectively. is the Euclidean distance from a point to the surface. Our goal is to recover a continuous SDF given a set of discretized samples of value that commonly refers to samples from a mesh.

c.2 Implementation & Reproducibility Details

Data preparation.

We adopt two widely used models: gargoyle (50k vertices) and armadillo (49k vertices). For each training epoch, we scale the model within a bounding box of

and samples points for training: points on the surface, points around the surface by adding Gaussian noise to the surface point with , and the last points uniformly sampled within the bounding box.

Metric. We report the IOU of the ground truth mesh and the regressed signed distance field by discretizing them into two volumes. We report the Chamfer distance metric by sampling 30k surface points from the extracted mesh using the marching cube technique lorensen1987marching .

Baseline implementation details. For the embedding-based baselines, we use our implementation of the dense volume technique sun2021direct . It contains learnable parameters that transform the input coordinates to their feature embedding at a length by trilinear-interpolation. We adopted the PyTorch implementation of i-NGP muller2022instant from torch-ngp 222https://github.com/ashawkey/torch-ngp, where they maintain a multi-level hash function to transform the spatial coordinate into feature embedding. We use a 16 num-of-level hash function with dimension 2. Consequently, the output feature embedding is of length 32. Please refer to muller2022instant for more details on the implementation of multi-level hash. For our PREF, we use three complex-valued volumes to nonlinearly transform the spatial coordinates to a d feature embeddings. All the embedding-based baselines and our PREF adopt the same MLP structure for fairness that consists of 3 layers that progressively map the input embedding to the 64 dimension features as well as to a scalar, with ReLU as the intermediate activation. Another baseline we compare against the positional encoding (PE) based NeRF that uses a wider and deeper coordinate-based MLP mildenhall2020nerf where we encode the input coordinates into six frequencies in PE and use an MLP of 8 linear layers, 512 hidden dimensions, and ReLU activation. Tab 4 lists individual model size and performance of the baseline vs. PREF. Our method manage to be on par with the state-of-the-art i-NGP muller2022instant with a compact model size, and outperform its spatial counterpart sun2021direct and frequency-based proceedings. We owe the improvement of PREF to its globally continuous nature that either allows for preserving details.

Training details. We provide additional details on how we train the baseline. As aforementioned, in each epoch, we sample a batch size of to regress the SDF values. The MAPE loss is used for error back-propagation. To optimize the networks, we use the Adam optimizer, with , and . We use an initial learning rate and reduce the learning rate to at the 10th epoch. We adopt a batch size of to optimize all baselines whereas for our method epochs.

Chair Drums Ficus Hotdog Lego Materials Mic Ship Mean Size (MB)
PlenOctrees yu2021plenoctrees 34.66 25.37 30.79 36.79 32.95 29.76 33.97 29.62 31.71 1976.3
Plenoxels yu2021plenoxels 33.98 25.35 31.83 36.43 34.10 29.14 33.26 29.62 31.71 778.1
DVGO sun2021direct 34.09 25.44 32.78 36.74 34.46 29.57 33.20 29.12 31.95 612.1
Ours 34.95 25.00 33.08 36.44 35.27 29.33 33.25 29.23 32.08 34.4
Table 5: PSNR results on each scene from the Synthetic-NeRF dataset mildenhall2020nerf . We show the comparisons of the dense volume variants with our PREF (frequency-based scheme).
Highest Freq Chair Drums Ficus Hotdog Lego Materials Mic Ship Mean Size (MB)
256 34.95 25.00 33.08 36.44 35.27 29.33 33.25 29.23 32.08 34.40
128 33.29 24.64 32.70 36.04 33.77 29.37 31.87 27.75 31.18 9.84
64 31.54 23.82 30.43 35.25 30.56 28.82 31.22 27.08 29.83 2.28
32 30.11 22.59 27.77 34.07 27.39 27.65 30.48 25.79 28.23 0.76
Table 6: Ablation study of phasor volume size (related to the highest frequency). We report the PSNR of each scene, the mean PSRN and the corresponding mean model size. We train each scene less than minutes using a pure pytorch implementation on a RTX 3090, as discussed in the text.

Appendix D Radiance Fields Reconstruction

d.1 Task description

For radiance field, we focus on rendering novel views from a set of images with known camera poses. Each rgb value in each pixel corresponds to a ray cast from the image plane. We adopt the volume rendering model martin2021nerf :

(10)

where and are corresponding density and color at location , is the interval between adjacent samples. Then we optimize the rendered color with the ground truth color with loss.

(11)

d.2 Implementation & Reproducibility Details

PREF model setting. We describe how PREF models the density and the radiance . We use three phasor volume cascaded with a two-layer MLP with hidden dimension 64 and output dimension 1 for computing the density (a scalar). We then use Softplus to map the raw output to the positive-valued density. For the view-dependent radiance branch, we use a relatively large volume of followed by a linear layer to output feature embedding. To render view-dependent radiance, we follow the TensoRF pipeline chen2020tensor : we concatenate the result with the positional encoded view directions and feed them into a 2-layer MLP with hidden dimension and a linear layer to map the feature to color with Sigmoid activation. All linear layers except the output layer use ReLU activation.

Rendering. To compare with SOTA yu2021plenoxels ; sun2021direct ; chen2020tensor , we train each scene using iterations with a batch size of rays. We adopt a progressive training scheme: from the highest frequency of to . Specifically, we gradually unlock the higher frequencies at the training step . Accordingly, the number of samples per ray progressively increases from about 384 to about 1024. This allows us to achieve more stable optimization by first covering the lower frequencies and later high-frequency details. During training, we maintain an alpha mask to skip empty space to avoid unnecessary evaluations.

Optimization. As mentioned in the paper, our PREF uses the Parsvel regularizer to avoid overfitting where our objective is set to with . Without regularization, PREF may overfit specific frequencies, as shown in Fig 6. On the NeRF synthetic dataset, PREF converges on average minutes with iterations on a single RTX 3090, with an initial learning rate of and gradually decayed by a factor of 10 during the training. The Adam optimizer uses and by default.

(a) w/o (PSNR )
(b) w (PSNR )
Figure 6: Ablation study of Parsvel regularizer. Different from spatial embedding that easily produces outliers owing to per-location parameterization, PREF globally parameterizes the scene’s embedding with sinusoidal waves, often producing fewer spatial outliers (isolated points). Yet, PREF has the risk of frequency outliers (isolated waves), especially when the highest frequency used exceeds the minimal sampling rate of the observation (training samples). As Fig 6 (a) shows, PREF w/o can overfit to spatial frequencies (see the disparity map on the right) and consequently results in a performance drop. We arrive at better results via our proposed Parseval regularizer.

d.3 Additional results

We report the breakdown results of our PREF on the Synthetic-NeRF dataset in Tab 5. To further evaluate the effectiveness of our frequency encoding, we report the performance in Tab 6 under different model sizes (by varying the phasor volume size). Notice that PREF produces reasonable results (with a mean PSNR of ) even when the model size is reduced to ultra-small ( MB), a potential benefit for downstream generative tasks that require training thousands of scenes.

Appendix E Application to Shape Editing

e.1 Implementation details

Recall that the continuous embedding field of PREF is synthesized from a phasor volume under various frequencies. Therefore, thanks to Fourier transforms, various tools such as convolution in the continuous embedding fields can be conveniently and efficiently implemented as multiplications. This is therefore a unique advantage of PREF compared with its spatial embedding alternatives chen2020tensor ; sun2021direct ; yu2021plenoxels ; muller2022instant .

Let and be the optimized MLP and phasor volume, respectively. represents the inverse Fourier Transform. Recall that we obtain a reconstruction field by . Modification to the original signal via convolution based filtering can now be derived as:

(12)

where denotes element-wise multiplication and is a filter.

Now, we explore how to manipulate via the optimized phasor volume and kernel . For simplicity, we only the Gaussian filter while more sophisticated filters can also be applied in the same. Assume

(13)

where and covers the complete frequency span of ; that is, we can scale the magnitude of phasor features frequency-wise. For example, by varying the Gaussian kernel size using , PREF can denoise the neural representation of the signal at different scales, as shown in Fig 7.

Figure 7: Filtering sign distance fields via Gaussian smoothing using PREF.

Appendix F Limitations

We have demonstrated that PREF enables fast reconstructions of neural signals in the phasor (frequency) space, with smaller model sizes, comparable and sometimes better performances, and more efficient filtering capabilities. Compared with existing spatial embedding techniques, PREF, however, requires additional computational costs for conducting Fourier transforms and therefore is slightly slower than prior art such as i-NGP (PyTorch). Our immediate next step is to implement a CUDA version of PREF. However, certain Autograd libraries do not readily support complex-valued parameters optimization. Therefore additional efforts are required to write customized CUDA modules for PREF.

Similar to PE martin2021nerf , PREF masks out certain spatial frequencies in the phasor volume to achieve efficiency and compactness. However, this may lead to directional bias, as observed in prior art tancik2020fourfeat . However, since PREF uses more frequencies (3D sparse frequencies) than PE (axis-aligned 1D frequencies), PREF effectively reduces these artifacts, as shown in the experiments. For further improvement, one may adopt Non-uniform Fast Fourier Transform (NuFFT) fessler03 ; beatyy05 ; muckley20 to tackle non-uniform frequency sampling. Overall, by providing a new frequency perspective of neural signal presentation, PREF may stimulate significant future work. To that end, we intend to make our code and data available to the community at GitHub.