1 Introduction
Although convolutional neural networks (ConvNets) have been successfully applied to solve nontrivial image processing problems since the 1990s (LeCun et al., 1989, 2012), their adoption as a de facto standard for image classification (Russakovsky et al., 2015) and segmentation (Long et al., 2015) is due largely to recent breakthroughs in network architecture. Beginning with AlexNet in 2012 (Krizhevsky et al., 2012)
, the annual ImageNet classification challenge (ILSVRC) has been dominated by progressively deeper networks with smaller kernels
(Szegedy et al., 2015; Simonyan & Zisserman, 2014b). Recent solutions to the issues of vanishing and exploding gradients (Glorot & Bengio, 2010) have allowed these networks to extend even deeper, with the ILSVRC15 winner (ResNet (He et al., 2016)) being 8fold deeper than VGG.It is easy to see why the “deeper is better” trend has led to better performing ConvNets. Constructing even a modest 7 x 7 receptive field with stacked kernels requires fewer parameters than a single kernel of size . Intuitively it also captures a richer set of features due to additional nonlinearity. Recent studies have begun to formalize the expressive power of deep versus shallow networks, finding that classification boundaries acquire local curvature and expressivity as an exponential function of network depth but not breadth (Poole et al., 2016). The only obvious tradeoff to this performance is the extra memory necessary to store the intermediate activations in deeper networks.
Motivated by the success of these models in image processing tasks, researchers have begun to investigate ConvNet applications in the video processing domain. Example applications include video classification (Karpathy et al., 2014), segmentation (Couprie et al., 2013) and denoising (Shi et al., 2016b). An important observation that has emerged from these studies is the importance of 3D convolutional primitives for modelling joint spatiotemporal features; the naïve application of traditional 2D ConvNets framebyframe does not capture motion continuity or other rich temporal correlations (Ledig et al., 2016; Tran et al., 2015). It is thus unsurprising that simple 3D ConvNets have yielded stateoftheart performance on video classification benchmarks (Tran et al., 2015)
and volumetric image segmentation, e.g. tracing neurons between electron microscopy samples
(Lee et al., 2015).Given the early success and conceptual simplicity of 3D ConvNets, it is interesting to note that many popular deep learning libraries (e.g. Caffe
(Jia et al., 2014)) do not provide native support. One simple explanation is that these libraries are optimized for execution on GPUs, and higherorder convolutions require prohibitively large volumes of data with respect to the 16 GB ceiling of today’s most advanced GPU hardware. These limitations are clear in previous studies, which either (a) limit the network size (Tran et al., 2015), (b) downsample images to lower resolution (Ji et al., 2013), or (c) include 3D primitives for only a subset of network layers (Lee et al., 2015).There are many potential options for circumventing the issue of ConvNet memory usage. The first is to split the network across multiple GPUs, which requires the careful coordination of activation and gradient flow (Dean et al., 2012). Even in the case of the most successful distributed frameworks for ConvNets (Abadi et al., 2016)
, GPU memory management is largely unresolved. The TensorFlow authors propose two partial solutions warranting further investigation: (a) recomputing versus storing large tensors; and (b) transferring longlived tensors from GPU to host CPU memory. Instead, we propose an alternative to horizontal scalability for overcoming GPU memory constraints – a fast implementation of
dimension convolution optimized for multicore CPU systems, which have access to practically unbounded memory on a single node.2 Prior Art
Algorithms for fast convolution have existed in signal processing literature since the 1980s (Winograd, 1980)
. The general recipe is to transform both data and kernel into a new space, where expensive sliding windowstyle convolutions reduce to cheaper elementwise products. The first examples of this approach in ConvNet literature involved Fast Fourier Transforms (FFTs) exploiting the convolution theorem
(Mathieu et al., 2013; Vasilache et al., 2014). More recently, Lavin and Gray have pioneered the use of the more general class of Winogradstyle algorithms (Lavin & Gray, 2016; Winograd, 1980). Their implementation and its cuDNN derivatives (Chetlur et al., 2014) have produced stateoftheart GPU performance on deep networks of small kernels. In this Section we provide a brief overview of the theory underlying this approach, focusing on the aspects that are important for exploiting the architecture of multicore CPUs.2.1 Fast Vector Convolution
Consider the 1dimension convolution , where the kernel and data vectors are of length and . This problem can be rephrased as one of polynomial multiplication by introducing the associated polynomials , and , where the coefficients of are the solution to the desired convolution. This computation can be distributed across a set of efficient local computations by considering the Chinese Remainder Theorem (CRT) (Ding et al., 1996), as summarized in Theorem 1. By observing that for any polynomial of sufficiently high degree, we can exploit Theorem 1 to efficiently calculate , as shown in Algorithm 1, which can be conveniently rephrased in terms matrixvector products:
(1) 
where , and are introduced as the kernel, data and inverse transforms respectively. With respect to Algorithm 1, Step (1) is implemented by the kernel and data transforms, Step (2) by their transformed elementwise product and Step (3) by the final inverse transform.
Theorem 1 (CRT for Polynomials)
Let , where are pairwise coprime. If are a set of polynomials then there must exist a unique polynomial which satisfies the set of congruences:
provided the degree of m(x) is not less than that of s(x).
2.2 Minimal Winograd Algorithms
In the above formulation (1), the matrices and are the remainders of the polynomial divisions and by respectively. The derivation of is more involved and is omitted for brevity. Importantly, the only parameter required to synthesize these matrices (in addition to the kernel and data ) is the polynomial .
Traditionally, the selection of has been subject to many constraints. First, it should be chosen such that the transform matrices contain only degree1 (scalar) values. Lavin has published code that automates this procedure using the CookToom algorithm to produce transformed kernels and data both of length (Lavin, 2016). For an unpadded convolution of length and ignoring the cost of applying the transforms, this fast algorithm therefore requires fewer computations to calculate than the standard slidingwindow approach. Inappropriate selection of would yield matrices of polynomials (degree ) that require considerably more scalar multiplications to compute.
In reality, the transformations themselves require expensive matrix multiplications that can outweigh the above saving. Accordingly, existing implementations of fast convolution aim to synthesize matrices enriched for “simple” (e.g. integer) values. There are two motivations for this. First, it improves numeric stability which can have an impact on doubleprecision convolutions (Lavin & Gray, 2016). More importantly, it supports the handcrafting of minimal algorithms. These algorithms reduce the cost of applying transform matrices by identifying and eliminating redundant subexpressions. A famous instance of this approach was documented by Winograd (Winograd, 1980). Consider the following matrices:
By substituting these matrices into (1) and factoring out redundant computations, we arrive at the following minimal algorithm for vector convolution:
where:
This is a socalled algorithm for vector convolution, here for and . Importantly, this algorithm only works for fixed length kernel and data vectors (here ). Generating algorithms for different combinations requires both (a) searching over the space of possible polynomials as input to Lavin’s or similar code (Lavin, 2016), and (b) reducing the matrix multiplications to a minimal set of addition, multiplication and shifting operations. To our knowledge there are no automated solutions to either step and thus only a small set of handcrafted Winogradstyle algorithms (e.g. , and ) have been released as fast CPU (Dukhan, 2016) or GPU primitives (Chetlur et al., 2014).
3 Deep Tensor Convolution
Below we present an alternative approach to fast convolution that removes the need to handcraft minimal algorithms. This new approach is better suited to video and volumetric image processing for two main reasons. First, the number of terms involved in a closedform solution for 3 and higherdimensional convolutions makes Winogradstyle refactoring impractical. Second, by removing numeric simplicity as a constraint we are instead able to synthesize transforms optimized to CPU architectural constraints, e.g. data that are integer multiples of the AVX register width. This is made possible by the relaxed memory constraints of CPUs and allows us to close the previous CPUGPU performance gap by a full orderofmagnitude.
We first define dimensional convolution and describe how existing fast algorithms can be extended to this general case. Instead of crafting a minimal algorithm, we show how relaxed memory constraints and efficient sparse linear algebra of CPU systems can be leveraged to amortize transform costs. Later we show how architectureaware transform synthesis can lead to further acceleration.
3.1 Convolution in Dimensions
Mathematically, the standard convolutional layer used in 2D ConvNets extends trivially to higherdimensional tensors. Consider a network where for each layer , kernel and channel , the kernel weights and resulting feature map are both 3D tensors. This calculation can be expressed elementwise as:
(2) 
where is the bias term and
is a ReLU or other nonlinear activation function. This extends to higher dimensions by looping over additional subscripts on
and .The dimensionality of feature maps is clearly preserved in (2), e.g. a video at the input produces a video at the output. The triple loop ranges from 0 to the layer kernel size to perform slidingwindow convolution, and the loop is a reduction over the previous layer’s output channels. This differs from previous studies where the temporal axis is encoded as network channels and flattened after the first layer (Karpathy et al., 2014; Simonyan & Zisserman, 2014a), producing a single 2D image or class label at the output. These methods have been shown to produce less accurate results on a broad range of video processing tasks when compared to true 3D ConvNets (Tran et al., 2015).
It is also evident from (2) why higherdimensional ConvNets suffer from issues of impractical memory consumption. Each layer of an dimensional network requires and to be stored as and –dimensional tensors, owing to their operation over multiple kernels and channels. We believe that this multiplicative effect has likely stalled the adoption of the deeper network architectures that dominate image processing tasks, with recent studies instead compromising on network expressiveness to fit within the 16 GB memory constraints of today’s topend GPUs (Ji et al., 2013; Lee et al., 2015; Tran et al., 2015).
3.2 Accelerating Tensor Convolution
Sidestepping memory constraints by shifting from GPU to CPU hardware is conceptually trivial, as most popular ConvNet frameworks support execution on both CPU and GPU environments. However, the issue preventing the widespread adoption of CPU implementations is not a lack of software support but the large perceived gap between CPU and GPU performance. This is reminiscent of a large ongoing CPUvsGPU debate, with various studies claiming that GPUs provide anywhere from 100to1000x speedup across broad problem domains (Lee et al., 2010). A recent review has demonstrated a similar performance gap in the order of 50x across the most popular ConvNet frameworks (Shi et al., 2016a). Even if distributed GPU solutions like TensorFlow require tensors to be recomputed or swapped between GPU and host CPU memory (Abadi et al., 2016), this overhead is easy to justify if the alternative is a 50fold increase in singlenode execution time.
Here we describe how fast algorithms for convolution can be extended to the general case of dimensional tensors, where the theoretical speedup is a substantial . Although recent studies have begun to explore extensions of FFTbased convolution to 3dimensions (Zlateski et al., 2016), to our knowledge there have been no attempts to extend Lavin and Gray’s Winogradstyle approach (Lavin & Gray, 2016). In order to extend the fast vector algorithm to 1 to dimensions, we consider the mode product of a tensor, , with a matrix, , herein denoted as (Kolda & Bader, 2009):
(3) 
In our case is sparse and is dense, so we implement (3) such that is traversed in the outermost two loops. We also introduce the following notation for brevity:
The fast algorithm for tensor convolution applies the transforms , and separately to each dimension of the kernel and data tensors, and :
(4) 
It is straightforward to show that (1) is a special case of (4) by considering the following equivalence:
where the matrix is the mode major unfolding of tensor (Kolda & Bader, 2009). In the 1dimensional case, is simply and thus . Likewise in 2D, as and then (4) reduces to the case reported by (Lavin & Gray, 2016):
3.3 Amortizing Transform Costs
Manually reducing transform costs via Winogradstyle minimal algorithms is important for 2dimensional GPU implementations. However, this is less important for a CPU implementation of higherdimensional convolution. The reasons are twofold: (a) the matrix multiplication cost can be amortized across a larger number of kernels and channels due to relaxed memory constraints; and (b) CPUs are able to directly leverage the sparse structure of these matrices for further acceleration. Although efficient sparse linear algebra is possible on GPUs, this typically involves reshuffling sparse matrices into a dense representation (e.g. COO, CSR or ELLPACK (Grewe & Lokhmotov, 2011)) and introduces unnecessary computational overhead.
As a simple example, consider Winograd’s minimal F(2,3) algorithm presented in Section 2.2. Computing the output of length requires a total of 6 multiplications – 4 between the data and kernel, and 2 by a constant factor of 0.5. The 4 additions are ignored as modern CPUs can compute fused multiplyaccumulate operations in a single cycle. By contrast, computing explicitly by equation (1) requires 28 multiplications – 4 for the elementwise product, 16 for the data transform and 8 for the inverse transform (assuming transformed kernels are cached at training time). Even leveraging sparsity in the transform matrices requires 19 multiplications, which is more than triple that required for Winograd’s minimal algorithm.
The game changes when one considers these approaches in the context of a ConvNet layer with multiple channels and kernels. Without loss of generality, assume the numbers of kernels and channels are both equal to . As the inverse transform can be applied once over the reduced output and the data transform once across all kernels, the required number of multiplications is just (versus for Winograd). This can be reduced further to by exploiting the sparsity of and .
Although it is also possible to restructure Winograd’s algorithm to exploit the size of the network, for larger networks the
multiplications required by the elementwise product quickly renders the linear transform cost negligible. It is also impractical to construct similar minimal algorithms in higher dimensions. Consider the C3D network of
kernels that has yielded stateoftheart performance across many video processing benchmarks (Tran et al., 2015). As an example, we synthesize the following transform matrices such that convolution reduces to a elementwise product:The theoretical ceiling on speedup obtainable using these matrices is 8fold, ignoring the cost of the matrixtensor products required when applying (4). Figure 1 demonstrates the actual reduction in computations as a function of kernels and channels. For a network of just 100 kernels and 100 channels, it is possible to obtain greater than 6fold acceleration with respect to direct slidingwindow convolution. This is triple the performance margin that could be gained if the network was constrained to 10 kernels and channels due to a lower memory ceiling.
We can further improve this performance margin by exploiting the sparsity of the matrices themselves, as it is comparatively straightforward to implement efficient sparse linear algebra for CPUs. One might worry that the transform matrix sparsity is inversely proportional to the degree of . However, this simply suggests that our fast algorithm is best suited for networks of small kernels, which is fortunately wellaligned with recent trends in deep ConvNet architecture (He et al., 2016; Simonyan & Zisserman, 2014b; Szegedy et al., 2015). Sparsity and numerical precision also decrease as a function of . In practice, the data matrix D is not the full feature map (e.g. an ImageNet image) but rather one of many small, overlapping input tiles (each of size , stepping by along both axes) whose outputs are stitched together to form the final convolution result. In Section 4.2 we discuss how the fullyautomated nature of our implementation can leverage this property for further performance improvement.
4 Optimizing for CPU Architecture
There are a myriad of algorithmic tricks that can be applied to reduce the number of computations required for convolution. Consider the special case where our transforms are the discrete Fourier transform (DFT) and inverse DFT matrices. As the Fourier transform of a realvalued signal has Hermitian symmetry, the number of unique terms in the elementwise product can be reduced (Mathieu et al., 2013). More generally, one could also apply the Strassen algorithm to reduce the number of steps required for matrix multiplication (Cong & Xiao, 2014).
In practice, the merit of any of these approaches depends intimately on whether they can be implemented to effectively leverage hardware. Consider the 50to1 performance ratio observed between existing GPU and CPU implementations (Shi et al., 2016a). For the devices used in this study (Titan X versus Xeon E78890), the ratio of theoretical throughput is actually less than to 5to1. This seems to suggest that current CPU performance limitations are largely issues of software rather than hardware.
Although some previous studies have discussed CPUspecific performance optimizations for neural networks (Vanhoucke et al., 2011), these guidelines have not necessarily translated to optimal implementations. For example, the Eigen 3.2 linear algebra library (used until recently by TensorFlow) does not provide native support for AVX (vectorized) instructions, introducing a tight bottleneck on theoretical throughput. Looking beyond a single core, a recent review demonstrates poor multicore scalability across all major ConvNet frameworks (Shi et al., 2016a). Solving these two issues alone has the potential to close the CPUGPU gap by a full orderofmagnitude, and this improvement is multiplicative with the algorithmic savings described earlier.
4.1 SingleCore Utilization
Although our fast algorithm requires theoretically fewer computations to execute than naïve convolution (e.g. 8fold for C3D kernels), it is considerably more difficult to implement with high CPU utilization. Consider the elementwise product , summed for each channel to produce the dimensional tensor . We can compute the ratio of computations, i.e. 1 multiply and 1 accumulate operation per pair, to the volume of memory loaded:
Little’s Law shows this is problematic for effective CPU utilization, as convolution expressed in this form is bottlenecked by memory bandwidth (Little, 1961). To solve this problem, recall that is one of many small, overlapping tiles that span the fullsize feature map. Considering of these tiles, we introduce the following matrices:
(5) 
where (tilesbychannels) and (channelsbykernels). Each matrix captures a single coordinate in the earlier elementwise product, which is fused with the channelwise reduction into endtoend matrix multiplications:
As can be any number of the small input tiles, we can select to demonstrate a computetomemory ratio that grows linearly in the number of kernels.
The fast convolutional form in (5) is also wellsuited to a number of other practical CPU performance optimizations (Vanhoucke et al., 2011). Foremost among these is the effective use of AVX (vectorized) and FMA (fused multiplyaccumulate) floatingpoint SIMD operations. Consider the function FMA(), which calculates the sum of vector with the elementwise product and stores the result in , all in a single CPU cycle. This function can be leveraged for an efficient practical implementation of (5), as presented in Algorithm 2 for a single tilekernel pair and an AVX vector of width . An illustration of the 2dimensional case is provided in Figure 2. On our Xeon CPU with 256bit AVX registers and two dedicated FMA units, this optimization alone can yield a 32fold speedup over naïve implementations. This margin is expected to double with the introduction of 512bit AVX registers for Intel Skylake and Xeon Phi.
We benchmarked the performance of our fast convolution algorithm on a 1.44 TFLOP/s Xeon E78890 CPU and observe that it executes at 70% maximum utilization. This includes all steps from input to output, including all necessary data reshuffling. As a point of comparison, Intel’s own MKL convolutional primitive runs at just (excluding reshuffling) on the same processor. The Eigen 3.2. linear algebra library is lower utilization still, capped at just
due to a lack of AVX and FMA support. Both of these libraries have been widely used by popular ConvNet frameworks including Caffe, CNTK, TensorFlow and Torch.
4.2 AVXAware Transform Synthesis
The fully automated nature of our transform generation allows for the synthesis of transform matrices that optimize for CPU architectural constraints. From Figure 2, it is clear the full utilization can only be achieved if is an integer multiple of the AVX vector width
. This is an important optimization, as data volumes are constantly small (invariant of numbers of channels and kernels) and thus there is little opportunity to amortize padding overhead.
Table 1 summarizes statistics for example transforms that we have generated for square 2 and 3dimensional kernels, enumerated automatically using (Lavin, 2016). In each case, we generate transforms for the smallest possible such that and . The matrices are provided in the Supplementary Materials.
size  sparsity  speedup  

2D  3D  
(a)  4  2  3  0.33  0.50  0.25  2.25  3.38 
(b)  4  3  2  0.25  0.50  0.33  2.25  3.38 
(c)  8  4  5  0.20  0.31  0.19  6.25  15.63 
(d)  8  5  4  0.19  0.31  0.20  6.25  15.63 
(e)  8  6  3  0.17  0.31  0.21  5.06  11.39 
4.3 Multicore Scalability
Singlecore utilization is just one dimension of performance optimization. Many modern systems contain both multiple CPU chips, with shared access to host RAM; and multiple cores per chip, with shared access to faster L3 cache. We adopt a relatively simple parallelization scheme where threads simultaneously operate on different subsets of input tiles. To avoid memory contention and other concurrency issues we adopt the Cilk Plus workstealing scheduler supported by GCC 4.8 (Blumofe et al., 1996; Robison, 2013), simply applying its forkjoin primitive to all forloops with no iteration dependencies. The number of tiles per thread is empirically tuned to simultaneously maximize L3 cache utilization ( cannot be too large) and computetomemory ratio ( cannot be too small).
We observe that even this simple parallelization scheme yields nearoptimal linear scalability. In Figure 3 we present ConvNet throughput as a function of processor cores for both (a) our fast algorithm, and (b) our own multicore implementation of naïve convolution (which is comparatively simple to implement). Scalability is measured across a single convolution layer for a image with kernels of size . To avoid NUMA issues relating to expensive interchip communication, we spawn independent instances for each CPU in our 4socket sharedmemory server such that all 18 threads in Figure 3 are bound to a single chip. When using all 18 cores of our Intel Xeon E78890 CPU the scalability of (a) is 95% theoretically optimal. As a point of comparison, a recent review examined the scalability of popular ConvNet frameworks Caffe, CNTK, TensorFlow and Torch on a similar 16core Xeon E52630 CPU (Shi et al., 2016a). They reported multicore scalability ranging from (Caffe) to (TensorFlow), which is equivalent to a 2.3 to 5.9fold improvement with our implementation.
4.4 Performance Benchmarking
The most popular ConvNet benchmarks focus exclusively on GPU performance (Chintala, 2015). The only study we could find presenting thorough CPU benchmarking is that of Shi et al., comparing the throughput of Caffe, CNTK, Tensorflow and Torch for the AlexNet and ResNet architectures (Shi et al., 2016a). Although this is a useful study for ballparking our multicore scalability, it is difficult to extrapolate fair comparisons to our overall system throughput for many reasons. Foremost is that the authors do not select CPUoptimized implementations. They adopt an earlier version of TensorFlow that uses the Eigen 3.2 library (no AVX/FMA support), and otherwise use the default frameworkspecific implementations of convolution rather than linking to optimized packages such as Intel MKL.
We benchmark 2D ConvNet performance against two popular frameworks: TensorFlow, using the newer Eigen 3.3 library (with AVX support); and Caffe, compiled to use Intel’s optimized MKL library. We consider the propagation time of a ImageNet image through three convolution layers to capture any necessary interlayer reshuffling. We choose this simple architecture over a named network because we are not interested in comparing execution times of pooling, fullyconnected or other layers. We also select an obscure kernel size (
) for which there have been no Winogradstyle fast algorithms published, in order to demonstrate the generality of our implementation to arbitrary kernels. Each layer contains a modest 32 channels and 32 kernels for spreading the cost associated with applying transform matrices. Results presented are the fastest across batch sizes of 1, 20 and 200. An important innovation of our approach is that it is batch sizeagnostic, making it suitable for singleimage autoregressive models common in generative modelling and deep reinforcement learning.
Our performance benchmarks are presented in Figure 4. The singlecore throughput of (a) our fast algorithm is 0.89 MVox/s, compared to (b) 0.18 for TensorFlow and (c) 0.19 for Caffe. Increasing cores from 1 to 18, our throughput improves to 10.9 MVox/s compared to 1.77 for TensorFlow and 0.41 for Caffe. This is equivalent to an approximate 5 to 25fold improvement in overall performance. In terms of multicore scalability, this is (a) 68% versus (b) 55% and (c) 12%. We note that our performance here is lower than the presented in Figure 3 for a larger input size (i.e. is much larger, yielding a better computetomemory ratio), and that the scalability for TensorFlow and Caffe are both similar to those reported in (Shi et al., 2016a).
5 Discussion
Motivated by the recent success of 3dimensional ConvNets in video and volumetric image processing (Lee et al., 2015; Tran et al., 2015), we have proposed a transition to CPU hardware to overcome the memory constraints limiting the size and expressivity of these networks. Key to this transition is overcoming the impractical performance gap between existing CPU and GPU implementations. To achieve this, we extended previous algorithms of fast convolution to the dimensional case, yielding an orderofmagnitude reduction in computations for popular networks such as C3D. Importantly, our implementation diverges from previous studies that focus on the handcrafting of minimal Winogradstyle algorithms. We instead exploit the relaxed memory constraints, efficient sparse access and other architectural considerations of CPU hardware to overcome the cost of applying transform matrices.
The obvious alternative to our approach is to overcome memory constraints by splitting large networks across multiple GPU devices. Distributed frameworks such as TensorFlow are valuable for a broad class of machine learning problems, e.g. many of the data mining tasks faced by large organizations where the data itself is often sharded across different machines. However, it is important to recognize that the horizontal scalability paradigm is not a onesizefitsall solution. Consider the increasing demand for realtime CPU solutions to image and video processing, particularly on mobile devices. Moving forward, we expect that intensive ConvNetdriven tasks such as video classification and denoising will continue to migrate from the realm of academic research to practical realization (Shi et al., 2016b). Efficient CPU implementations of ConvNets and other deep learning algorithms will play a fundamental role in this transition.
At the opposite end of the spectrum, some “big data” problems in the image processing domain are, counterintuitively, too big to be solved in a distributed setting. Consider the emerging field of highthroughput connectomics (Meirovitch et al., 2016). Multibeam electron microscopes image crosssectional slices of neural tissue at nanometerresolution, which are then segmented by ConvNets to reconstruct the 3dimensional morphology and interconnectivity of individual neurons (Ronneberger et al., 2015). The major issue here is simply one of scale – a seemingly modest cubic millimeter volume of neural tissue takes several months to image at the TB/hr pace of modern electron microscopes, which exceeds maximum data transfer rates. To avoid introducing communication bottlenecks to the connectomics pipeline, it is necessary that segmentation can execute in realtime on a server physically colocated in the same room as the microscope (Lichtman et al., 2014; Matveev et al., 2017). Sharedmemory CPU systems can support hundreds of cores and terabytes of memory in a single server, and it is critical that systems be implemented to exploit these valuable resources.
Treating 2D ConvNets as a special case of tensor convolution, our implementation yields 5 to 25fold improved throughput compared to previous stateoftheart on CPU. This is an important step toward bridging the performance gap between CPU and GPU hardware and is particularly important in the context of emerging hardware trends, e.g. Intel announcing that future generations of CPUs will contain dedicated deep learning accelerators. More importantly, we believe that removing constraints on 3D ConvNet size will herald new opportunities in the machine learning community; particularly in the context of generative models (Denton et al., 2015; Goodfellow et al., 2014), where rich temporal correlations are currently ignored when learning latent manifolds (Ledig et al., 2016).
Acknowledgements
Support is gratefully acknowledged from the National Science Foundation (NSF) under grants IIS1447786 and CCF1563880, and the Intelligence Advanced Research Projects Activity (IARPA) under grant 1380765093555.
References
 Abadi et al. (2016) Abadi, Martın, Agarwal, Ashish, Barham, Paul, Brevdo, Eugene, Chen, Zhifeng, Citro, Craig, Corrado, Greg S, Davis, Andy, Dean, Jeffrey, Devin, Matthieu, et al. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
 Blumofe et al. (1996) Blumofe, Robert D, Joerg, Christopher F, Kuszmaul, Bradley C, Leiserson, Charles E, Randall, Keith H, and Zhou, Yuli. Cilk: An efficient multithreaded runtime system. Journal of parallel and distributed computing, 37(1):55–69, 1996.
 Chetlur et al. (2014) Chetlur, Sharan, Woolley, Cliff, Vandermersch, Philippe, Cohen, Jonathan, Tran, John, Catanzaro, Bryan, and Shelhamer, Evan. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
 Chintala (2015) Chintala, Soumith. Convnet benchmarks. github.com/soumith/convnetbenchmarks, 2015.
 Cong & Xiao (2014) Cong, Jason and Xiao, Bingjun. Minimizing computation in convolutional neural networks. In International Conference on Artificial Neural Networks, pp. 281–290. Springer, 2014.
 Couprie et al. (2013) Couprie, Camille, Farabet, Clément, Najman, Laurent, and LeCun, Yann. Indoor semantic segmentation using depth information. arXiv preprint arXiv:1301.3572, 2013.
 Dean et al. (2012) Dean, Jeffrey, Corrado, Greg, Monga, Rajat, Chen, Kai, Devin, Matthieu, Mao, Mark, Senior, Andrew, Tucker, Paul, Yang, Ke, Le, Quoc V, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.
 Denton et al. (2015) Denton, Emily L, Chintala, Soumith, Fergus, Rob, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pp. 1486–1494, 2015.
 Ding et al. (1996) Ding, Cunsheng, Pei, Dingyi, and Salomaa, Arto. Chinese remainder theorem: applications in computing, coding, cryptography. World Scientific, 1996.
 Dukhan (2016) Dukhan, M. Nnpack. https://github.com/Maratyszcza/NNPACK, 2016.
 Glorot & Bengio (2010) Glorot, Xavier and Bengio, Yoshua. Understanding the difficulty of training deep feedforward neural networks. In Aistats, volume 9, pp. 249–256, 2010.
 Goodfellow et al. (2014) Goodfellow, Ian, PougetAbadie, Jean, Mirza, Mehdi, Xu, Bing, WardeFarley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
 Grewe & Lokhmotov (2011) Grewe, Dominik and Lokhmotov, Anton. Automatically generating and tuning gpu code for sparse matrixvector multiplication from a highlevel representation. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, pp. 12. ACM, 2011.

He et al. (2016)
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian.
Deep residual learning for image recognition.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, June 2016.  Ji et al. (2013) Ji, Shuiwang, Xu, Wei, Yang, Ming, and Yu, Kai. 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2013.
 Jia et al. (2014) Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan, Girshick, Ross, Guadarrama, Sergio, and Darrell, Trevor. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678. ACM, 2014.
 Karpathy et al. (2014) Karpathy, Andrej, Toderici, George, Shetty, Sanketh, Leung, Thomas, Sukthankar, Rahul, and FeiFei, Li. Largescale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732, 2014.
 Kolda & Bader (2009) Kolda, Tamara G and Bader, Brett W. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
 Krizhevsky et al. (2012) Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
 Lavin (2016) Lavin, A. Wincnn. https://github.com/andravin/wincnn, 2016.
 Lavin & Gray (2016) Lavin, Andrew and Gray, Scott. Fast algorithms for convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 LeCun et al. (1989) LeCun, Yann, Boser, Bernhard, Denker, John S, Henderson, Donnie, Howard, Richard E, Hubbard, Wayne, and Jackel, Lawrence D. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
 LeCun et al. (2012) LeCun, Yann A, Bottou, Léon, Orr, Genevieve B, and Müller, KlausRobert. Efficient backprop. In Neural networks: Tricks of the trade, pp. 9–48. Springer, 2012.
 Ledig et al. (2016) Ledig, Christian, Theis, Lucas, Huszár, Ferenc, Caballero, Jose, Aitken, Andrew, Tejani, Alykhan, Totz, Johannes, Wang, Zehan, and Shi, Wenzhe. Photorealistic single image superresolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016.
 Lee et al. (2015) Lee, Kisuk, Zlateski, Aleksandar, Ashwin, Vishwanathan, and Seung, H Sebastian. Recursive training of 2d3d convolutional networks for neuronal boundary prediction. In Advances in Neural Information Processing Systems, pp. 3573–3581, 2015.
 Lee et al. (2010) Lee, Victor W, Kim, Changkyu, Chhugani, Jatin, Deisher, Michael, Kim, Daehyun, Nguyen, Anthony D, Satish, Nadathur, Smelyanskiy, Mikhail, Chennupaty, Srinivas, Hammarlund, Per, et al. Debunking the 100x gpu vs. cpu myth: an evaluation of throughput computing on cpu and gpu. ACM SIGARCH Computer Architecture News, 38(3):451–460, 2010.
 Lichtman et al. (2014) Lichtman, Jeff W, Pfister, Hanspeter, and Shavit, Nir. The big data challenges of connectomics. Nature neuroscience, 17(11):1448–1454, 2014.
 Little (1961) Little, John DC. A proof for the queuing formula: L= w. Operations research, 9(3):383–387, 1961.
 Long et al. (2015) Long, Jonathan, Shelhamer, Evan, and Darrell, Trevor. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440, 2015.
 Mathieu et al. (2013) Mathieu, Michael, Henaff, Mikael, and LeCun, Yann. Fast training of convolutional networks through FFTs. arXiv preprint arXiv:1312.5851, 2013.
 Matveev et al. (2017) Matveev, Alexander, Meirovitch, Yaron, Saribekyan, Hayk, Jakubiuk, Wiktor, Kaler, Tim, Odor, Gergely, Budden, David, Zlateski, Aleksandar, and Shavit, Nir. A multicore path to connectomicsondemand. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 2017.
 Meirovitch et al. (2016) Meirovitch, Yaron, Matveev, Alexander, Saribekyan, Hayk, Budden, David, Rolnick, David, Odor, Gergely, Jones, Seymour KnowlesBarley Thouis Raymond, Pfister, Hanspeter, Lichtman, Jeff William, and Shavit, Nir. A multipass approach to largescale connectomics. arXiv preprint arXiv:1612.02120, 2016.
 Poole et al. (2016) Poole, Ben, Lahiri, Subhaneil, Raghu, Maithra, SohlDickstein, Jascha, and Ganguli, Surya. Exponential expressivity in deep neural networks through transient chaos. arXiv preprint arXiv:1606.05340, 2016.
 Robison (2013) Robison, Arch D. Composable parallel patterns with intel cilk plus. Computing in Science and Engineering, 15(2):66–71, 2013.
 Ronneberger et al. (2015) Ronneberger, Olaf, Fischer, Philipp, and Brox, Thomas. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 234–241. Springer, 2015.
 Russakovsky et al. (2015) Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 Shi et al. (2016a) Shi, Shaohuai, Wang, Qiang, Xu, Pengfei, and Chu, Xiaowen. Benchmarking stateoftheart deep learning software tools. arXiv preprint arXiv:1608.07249, 2016a.

Shi et al. (2016b)
Shi, Wenzhe, Caballero, Jose, Huszár, Ferenc, Totz, Johannes, Aitken,
Andrew P, Bishop, Rob, Rueckert, Daniel, and Wang, Zehan.
Realtime single image and video superresolution using an efficient subpixel convolutional neural network.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1874–1883, 2016b.  Simonyan & Zisserman (2014a) Simonyan, Karen and Zisserman, Andrew. Twostream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, pp. 568–576, 2014a.
 Simonyan & Zisserman (2014b) Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014b.
 Szegedy et al. (2015) Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9, 2015.
 Tran et al. (2015) Tran, Du, Bourdev, Lubomir, Fergus, Rob, Torresani, Lorenzo, and Paluri, Manohar. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497. IEEE, 2015.
 Vanhoucke et al. (2011) Vanhoucke, Vincent, Senior, Andrew, and Mao, Mark Z. Improving the speed of neural networks on CPUs. 2011.
 Vasilache et al. (2014) Vasilache, Nicolas, Johnson, Jeff, Mathieu, Michael, Chintala, Soumith, Piantino, Serkan, and LeCun, Yann. Fast convolutional nets with fbfft: A gpu performance evaluation. arXiv preprint arXiv:1412.7580, 2014.
 Winograd (1980) Winograd, Shmuel. Arithmetic complexity of computations, volume 33. Siam, 1980.
 Zlateski et al. (2016) Zlateski, Aleksandar, Lee, Kisuk, and Seung, H Sebastian. Znnimaximizing the inference throughput of 3d convolutional networks on multicore cpus and gpus. arXiv preprint arXiv:1606.05688, 2016.