Hardware faults can be separated into two categories: fail-stop and fail-continue. Fail-stop faults crash the executing process, thus detectable by the operating system, and can be handled by well-studied checkpoint-and-restart techniques . In contrast, fail-continue faults silently corrupt the results of an execution process without interrupting it. The induced errors are usually called soft errors  or silent data corruptions and are the focus of this paper.
Soft error is much more prevalent than one may realize: even experienced practitioners grossly underestimate their frequency of occurrences . The supercomputer Jaguar, for example, suffers a double-bit memory error once every 24 hours . For another example, a recent research study taking more than 18 months  has confirmed the large-scale infrastructure at Facebook is experiencing silent data corruptions due to device characteristics inside hundreds of Central Processing Units (CPUs). The situation is only worsening: not only cosmic radiation triggers soft error but simple down-to-earth factors such as temperature and power consumption can also be the culprit . Further exacerbating the situation is the rapid emergence of deep-learning ASIC accelerators which are prone to have higher error rates than general-purpose computing hardware do .
Enterprise computing is the first to employ soft error detection, followed by the HPC community [20, 2, 15, 3, 19], and most recently, error detection methods are explored for convolutional neural networks (CNNs) deployed in autonomous driving . While deep learning recommendation systems (DLRMs) may not be critical to personal safety, their computational integrity is crucial to maintain good experience of billions of users per day. A deployable soft-error detection method for DLRMs must only incur low performance overhead lest the goal of maintaining user experience be self defeated, making algorithmic based fault tolerant method (ABFT)  the prime candidate. However, to the best of our knowledge, there is no previous ABFT work targeting DLRMs which typically compute in low-precision quantized arithmetic (details in III-A).
The two workhorse operators of DLRM are general matrix-matrix multiplication (GEMM) and EmbeddingBag (EB) which together account for over 70% of a DLRM’s compute latency. Although ABFT for GEMM has been well studied in the literature, their straightforward adaption to DRLM results in high overhead due to DLRM’s peculiar matrix sizes and shapes and its use of quantized arithmetic. In addition, EB is an operator not present in HPC or even in convolutional neural network. This paper considers error detection by ABFT for these two operators. We do not focus on error resilience as that is relatively simple for recommendation systems: once an error is detected a recommendation score can be recomputed easily assuming error striking twice is very rare.
The paper makes the following contributions on efficient soft-error detection for the key building blocks of DLRM in the quantized arithmetic domain:
We propose the first ABFT implementation for quantized GEMM. By carefully customizing ABFT for GEMM, we optimize the performance and analyze its error detection ability;
We propose the first ABFT implementation for EB, which is especially important for recommendation models.
which explains the low-precision arithmetic used in many industrial machine learning models including DLRMs, and its two main operators that we focus on. SectionsIV and V present our two ABFT algorithms and implementation considerations. Section VI presents our experiments to support our statements on the proposed algorithms’ performance and efficacy. Section VII makes some concluding remarks.
Ii Related Work
Soft error resiliency in deep learning models have been attracting more and more attention in recent years. Redundancy based protections are the most general and reliable solutions, where redundancy can be done at the hardware or software level. Hardware level redundancy is usually used in safety-critical task such as self-driving . Software level redundancy can be done in the same hardware but with duplicated or tripled program or instruction executions . Error detection by redundancy incurs at least 100% in overhead.
ABFT is a low-overhead error detection method. Though less general than using redundancy, it is shown to be effective on convolutional neural networks (CNN) [6, 10, 23]. These ABFT works either target convolution specifically or rely on extra-precision intermediate computation. While one can adapt to some extent these work to GEMM, we aim to eliminate the use of extra precision to further reduce overhead and devise ABFT for GEMM that is DLRM-specific. Furthermore, the EB operator is peculiar to DLRMs and hitherto unexplored.
Iii Arithmetic and Operators
Iii-a Quantized Arithmetic
Deep learning intrinsically relies on computing with real numbers. If the representation and computation of these real numbers can use, say, 8-bit integers instead of 32-bit floating-point numbers, significant memory saving and performance boost can be obtained. Arithmetic in integer to approximate floating-point computation is commonly called quantized arithmetic . One first transforms linearly an interval of interest to the domain of the integer arithmetic in question. For example, for 8-bit unsigned integer: Determine floating-point numbers , so that for all . The resulting value is then rounded to an integer, yielding , hence . In quantized arithmetic, instead of multiplying two floating-point matrices of dimension and , the matrices are represented by and and the corresponding matrix product is realized as integer matrix product:
where is the dimension-vector of all ones. Note all the terms following are rank-1 matrices.
Iii-B Low-precision GEMM in DLRM
Industrial implementations of DLRMs exploiting quantized arithmetic typically use specialized high-performance libraries such as FBGEMM . As shown in Equation III-A, the dominant operation is the integer matrix product , consisting of operations. This , a 32-bit integer matrix, together with the other rank-1 matrices and miscellaneous scale factors are than combined in a requantization process producing the where is represented by the tuple . We show the workflow in Figure 1. In the rest of the paper, when we refer to the matrices using , , , they are corresponded to integer matrices , , for notation simplicity.
Iii-C EmbeddingBag and its low-precision variant
Embedding is a technique that maps discrete categorical data into a -dimensional Euclidean spaces of real numbers. It is widely used in many recommendation systems [25, 24, 21]. An embedding table contains a number of -length row vectors each corresponding to a categorical data and algebraic operations corresponds to combination of these categories. EmbeddingBag (EB)111https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html is one of the most frequently called operators in these embedding based recommendation systems. An EB with batch size of one simply picks out the set of rows given by an index set from an embedding table and sum them up, illustrated in Figure 2. It is also called one embedding lookup. Mathematically, given and an embedding table , EB returns , where is the -th row of the embedding table . Note that for notational convenience we use here to denote a row vector instead of the usual convention of a column vector.
Industrial scale DLRMs often have many embedding tables totaling hundred billions of parameters. Hence instead of using floating-point to represent these real numbers, quantized arithmetic is often used to reduce the DLRMs’ memory footprints . Specifically, each -length embedding row at index is represented by -length vector of short (8 bits for example) integers and one pair of floating-point quantization parameter . The corresponding EB operator must then compute where is a -length row vector of all ones.
As EB with batch size of one returns one row vector, EB with batch size returns vectors, each corresponding to sums of relevantly selected embedding rows from a particular embedding table:
Iv Optimized ABFT for GEMM in DLRMs
The bulk of the computation in a quantized matrix product (Equation III-A) is the usual matrix product of two integer matrices of the form (dropping the subscripts of ) where and are matrices of dimensions -by- and -by-, respectively. Our aim is to detect soft error that happen during this computation after both and have been loaded into memory.
We start with this common ABFT method: One encodes into an augmented matrix by appending a row vector where . Similarly, is encoded into an augmented matrix with an extra column vector where . Figure 3 illustrates the augmented matrices and and their product .
Mathematically, the upper-left -by- block of is , first columns of is , first rows of is , and . Simple algebraic derivations show that a correctly computed satisfies the relationships
Equality checks of these equations on the computed form the basis of ABFT: If equality fails to hold at exactly one row for Equation 3b together with exactly one column for Equation 3a, then the value at the computed is faulty. Furthermore, a corrupted – revealed as a single violation at row and at column – can be corrected using the equation
Straightforward as this common ABFT method for GEMM is, adopting them with low enough overhead that does not impede DLRM user experience requires a number of techniques that we now discuss.
Iv-a Performance optimizations
Iv-A1 Encoding only matrix B
Existing work of ABFT for GEMM considers soft error detection and single error correction. We stated previously (Section I) that we aim solely at error detection. Thus, we just need to encode A or B, but not both. The question is which matrix to encode. To better understand this question, we first derive the theoretical error detection overheads with encoding A or encoding B. Remember that the error detection includes basically 3 stages: encode matrix A (or B); do GEMM with encoded A (or B); verify result matrix by checking Equation 3. The overheads are:
We follow the convention in PyTorch wherecorresponds to activations and the weight parameters of the neural network. Common in DLRMs, is relatively much smaller than or . According to the theoretical overhead equation, encoding matrix will have smaller overheads than encoding .
Another fact also makes encoding preferable in the aspects of both performance and memory error detection ability: , being the trained weight matrix, stays still in the memory for a much longer time. From the perspective of performance overheads, the fact implies we can encode matrix once for multiple GEMM operations thus amortizes the encoding overheads. From the perspective of memory error detection ability, the fact implies matrix have much higher chances to experience memory errors than matrix . (Recall that encoding matrix will not detect memory errors in while encoding matrix will do. In order to cover the errors in , we choose to encode B.) In conclusion, we encode instead of so as to minimize ABFT performance overheads while maximizing detection ability.
Iv-A2 Keeping encoded column in low precision
The encoded row sum vector for matrix seems to require 32-bit integer as value container to ensure correctness. This implies a high overhead because ABFT work has to be done in 32-bit integers while the original GEMM work is done in 8-bit integers. Computation with 32-bit integers can be 2 to 4 times slower than that with 8-bit. To reduce the overheads, we use modulo operations to map the 32-bit row sums into 8-bit. The Equations 3 are proved to still hold under the same modulus . Using modulo operations in the ABFT context is not novel. But we exploit them for better performance rather than to bypass limitation of computer word length .
Iv-A3 Keeping BLAS level-3 updating
A straightforward implementation of ABFT for GEMM (encoding only ) will be: calculate row sums of and store the result () in a separate vector; compute ; compute ; check if row sums of equal . This implementation does not need to modify the normal data structure of to accommodate an extra column, but results in high performance overhead. This is because Step is a matrix-vector product, a BLAS (Basic Linear Algebra Subprograms) level 2 operation. An alternative implementation that relies BLAS level 3 operations can be: allocate new memory for encoded matrix and new memory for ; do GEMM between and and store result in ; check Equation 3; copy former rows and columns of back into . The drawback of this implementation is its high memory overhead.
We found a way to implement ABFT for low-precision GEMMs in BLAS level 3 operations and with small memory overhead. This is possible because of two facts: 1. matrix is packed into blocks before being sent to the efficient GEMM kernel; 2. the matrix in 32-bit integers are intermediate result (as shown in Figure 1 by ). The first fact means that we can pack the original and the separate vector storing row sums together into blocks so that the blocks look like they are from encoded in contiguous memory space. The second fact means that we can directly allocate one more column for the intermediate result matrix than before. Notice we are not increasing the number of columns of 8-bit result matrix. We just need to modify the requantization procedure to let it exclude the last column of the intermediate 32-bit matrix.
Iv-B ABFT detection before requantization
One may ask if we can delay the checksum equality check from examining (Equation 3) to examining so as to detect silent errors in requantization process. Unfortunately, the answer is no. The main reason is that requantization is not a linear operation, i.e., generally where is the requantization operator. Thus, our linear encoding scheme cannot make equality hold in . Lack of error detection for requantization process is not serious considering that this process is less error prone as it only takes around 2% of execution time for larger matrices and around 5% for smaller matrices.
Iv-C Modulus selection and detection ability analysis
As we use modulo to keep the encoded column in low precision to reduce ABFT overhead, the downside is the weakened error detection ability. In this section, we want to discuss how to choose the modulus wisely so that the detection ability degradation is minimized. We assume elements in 8-bit unsigned integer matrix A and 8-bit signed integer matrix B are both random numbers in the uniform distribution independently. We also assume the there are no errors for the encoded column considering its much smaller memory usage and operations numbers compared to the original computation. First, let us look at the situations when the modulus,, will fail to detect errors. For some rows in the result matrix , we denote its row sum (excluding the encoded checksum column) by without any soft error. If soft errors happen and corrupt that row, we denote its row sum by . Then there is a fact that when the absolute value of difference between and is divisible by , the soft error will not be detected. Also, that is only the case when the errors are not detected. More formally, soft errors corrupting that row are not detected if and only if .
We consider two fault models. The first commonly used fault model is the random single-bit flip model which means a random bit of the data in the memory or register flips from 0 to 1 or 1 to 0. The intuition to this model is that
will be powers of 2. That is, any odd modulus can detect all errors in this model. The second model is random data fluctuation which means the correct value of the data is changed to some arbitrary value representable in its data type. For example, a 32-bit signed integer data can be changed to any value in the range ofwith equal likelihood. The intuition to this model is that the larger the modulus is, the smaller number of its multiples is (i. e., the better detection ability is). Those two intuitions give us a good modulus which is 127 for matrix as it is the biggest odd number in the range of . In the rest of this section, we then use 127 as the modulus to simplify the calculation of detection ability.
We quantify the detection ability in terms of probability. Specifically, the detection ability is measured by the probability our modulus based error detection method can detect error(s) when the result matrixis indeed corrupted. This metric is also known as the true positive rate.
Iv-C1 Memory error in 8-bit matrix
An error in matrix can propagate to corrupt a whole column of matrix . Specifically, suppose the corruption happens at and result in a difference of . The -th column of result matrix will be corrupted. Since will be multiplied by for all in , the difference of the corresponding row sums in the result matrix will be . Recall that ABFT cannot detect soft errors if is a multiple of . Notice that is a prime number. By Euclid’s lemma222If a prime number divides the product, , divides or ., is multiple of if and only if or is a multiple of . In the first fault model (random bit-flip at ), could be where . is a multiple of if and only if equals 127, 254, or 0 since matrix A is in 8-bit unsigned integers. i. e., the -th row will not detect the soft error in probability of assuming randomly ranges in . Since all rows will be checked by ABFT, the probability that the error is not detected by all rows will be . Thus, the error is detected in the probability of .
In the second fault model, can be random in the range of . is multiple of if and only if equals 127 or equals 127, 254, or 0. That is, the -th row will not detect the soft error in probability of
assuming is uniformly distributed in . Similar to the above analysis, the probability that the error is not detected by all rows will be . Thus, the error is detected with probability .
Iv-C2 Memory error in 32-bit intermediate result matrix ()
In the first fault model, a random bit-flip in implies the absolute value of difference of its corrupted row sum from its expected value to be for . Thus, the error will be detected with probability 100% since 127 cannot divide any for .
In the second fault model, suppose a random element in is changed to another arbitrary value . Then the difference in absolute value of its corrupted row sum and expected one is also . Think about is located somewhere in an interval of . We can conclude the range of is or where is taken when and is taken when . Denote the number of multiples of in the range of by . We can prove the following property, . The key is that if divides and , .
Thus, number of multiples of mod in the range of and is less than . Thus the detection probability of an error in this model will be at least .
Iv-C3 Memory error in matrix and computational error
As we mentioned, matrix B takes much larger memory space and resides in the memory much longer than matrix A. To keep ABFT overhead low, we only encode matrix B and this means we do not provide memory error detection for matrix A. A computational soft error will corrupt the intermediate result of . Thus it has the same behaviour as memory errors do in the 32-bit result matrix, where we discussed in Section IV-C2.
The complete look of our customized ABFT for low-precision GEMM is presented in Algorithm 1.
V ABFT for low-precision EmbeddingBag
V-a ABFT for EB
Based on the EB operator we introduced in Section III-C, we propose the ABFT technique for EB. To our best knowledge, this is the first ABFT technique for EB operator. Recall that we use to denote the embedding row dimension. The method is illustrated in Figure 4.
is a column vector storing all the row sums of the embedding table. If we sum the elements of at the indices of , it is easy to find the result, , will be equal to the sum of all elements in . Specifically, the following equality holds. ABFT will check if this equality holds to detect soft errors.
If the batch size is more than one, we just apply the equality check for all EBs in the batch.
V-B Adaption to low-precision EB
Recall the low-precision EB variant we introduced in Section III-C. Each embedding row vector in low-precision integers will be multiplied by a scale factor and added by a bias value . Then equation 4 should also be updated to accommodate the scale factor and bias as shown in Equation 5.
The correctness of the above equation is shown as following. Recall that is a -length vector of all ones.
Notice that instead of storing the scaled and bias row sums in 32-bit float type, we still store the row sums in 32-bit integers without being scaled or biased in . This way we can minimize the accumulation of round off errors when we sum up the elements in . The details of ABFT for low-precision EB is presented in Algorithm 2.
V-C Overhead analysis
Denote the number of selected indices by and the length of the embedding vector by . Notice that in Algorithm 2, the row sums of embedding table is pre-computed. This can be done because once the embedding table is trained, it will stay unchanged like the weight matrix (matrix ) in FC layers. Thus, we do not include the operations to calculate row sums as the ABFT overhead. The number of operations in the original EB without ABFT is and extra operations for ABFT is . So the overhead in fraction is . In terms of memory overhead, the 32-bit row sums will take more memory space where is the number of bits (4 or 8) of the low-precision integer in the embedding table.
V-D Round off error bound
Unlike low precision GEMM where all calculations involve only integer, EmbeddingBag operators have floating point numbers where round off error can accumulate. We set a bound to differentiate soft error from round off error in and (as shown in line 5 of Algorithm 2). Setting an appropriate bound is nontrivial  because too large a bound will let lots of soft error escape from the detection and too small means very high false positive rate. We choose a relative bound 1E-5 for our EmbeddingBag operators. This is a loose bound but its detection accuracy is good enough as we will show later. Why we choose a loose bound is because soft errors leading to small fluctuation of floating point results usually does not have big impact to the final machine learning inference .
In this section, we evaluate our proposed ABFT soft error detection for low-precision GEMM and EmbeddingBag. The solutions are evaluated in both error free case and erroneous case. A good soft error detector should have two properties: low performance overhead and low (or no) false positives in error free case; great detection ability (or high true positives) in erroneous case.
Vi-a Performance overhead
Vi-A1 ABFT for low-precision GEMM
Without any soft errors, Figure 5 shows the the performance overhead of our ABFT for low-precision GEMM with different input matrix shapes. Notice that those shapes are frequently used in DLRM and they are not square. We can see from the figure that the ABFT overheads are under 20% for all the 28 shapes. Actually, ABFT overheads are under 10% for many of the shapes (17 out of 28 shapes); under 5% for 7 of the shapes. Notice that for the shape of 1, 800, 3200, GEMM runs faster than its unprotected version. We think the reason is for that specific setting, adding one more column to matrix B makes the cache performance better.
Vi-A2 ABFT for low-precision EmbeddingBag
We test the performance overheads of our proposed error detection method (Algorithm 2) using quantized 8-bit integer embedding table. We flush the cache since the embedding table is too large to be hold in the cache in real world scenario. We tested both regular sum and weighted sum with prefetching optimization turned on and off. The specific parameters we use is listed in Table I. The table columns are also known as embedding dimensions. The average pooling size is the average number of pooled embedding table rows by all EBs in a batch. For example, suppose a batch of two EBs. The first one takes 3 rows from the table and the second takes 5. The average pooling size will be 4.
|table rows||table columns||average pooling size||batch size|
Vi-B Experiments with simulated error
We evaluate the detection accuracy of our proposed detection with simulated errors at source code level. The simulated errors are done by randomly selecting an element in the input or output and flipping a random bit in that element.
Vi-B1 ABFT for low-precision GEMM
We first inject a random bit flip in the input matrix B after the checksum of B has been calculated and repeat the experiments for each shape 100 times totalling 2800 samples. Then we do the random bit flip injection to the 32-bit intermediate result matrix and conduct another 2800 samples. The results are shown in table II. We can see that the detection accuracy when matrix B is injected with error is
. This is 3.72% less than the theoretical estimation in SectionIV-C1 but still very high. We achieve 100% detection accuracy when the random bit flip happens in matrix C and it is consistent with our analysis in Section IV-C2. It is worth noting that we also conducted 2800 runs of error free experiments to validate our false positive rate is zero since there is no round off error in integer operations.
|error in B||error in C||no error|
|not detected runs||137||0||2800|
Vi-B2 ABFT for low-precision EmbeddingBag
We tested the detection accuracy of our proposed solution with 8-bit integer embedding table. For each run, we randomly choose an element and flip a random bit in it. We repeated 400 runs with injected errors and 400 runs without injected errors. Among those 400 runs with errors, 200 of them are injected with bit flips in the upper 4 significant bits and the other 200 are injected with bit flips in the lower 4 insignificant bits. The results are shown in Table III. We can see the detection rate for significant 4 bits are pretty high at 99.5%. The detection rate for insignificant 4 bits are dropped to 47%. The false positive rate is 9.5%. As we can see from the results, our bound is chosen to be loose so that we can have lower false positive rates and the bad thing is that for an insignificant bit flip, detection rate is not high.
|high bits||low bits||no error|
|not detected runs||1||106||362|
Vii Conclusion and Future Work
In this paper, we propose efficient algorithm-based soft error detections for two important low-precision operators, GEMM and EmbeddingBag, in deep learning recommendation models. This is also the first work to benefit those deep learning operators unlike others focusing on convolutional workloads. By careful design and optimization, our proposed soft-error detection can achieve greater than 95% in error detection ability and introduces small overheads less than 26%.
A couple of directions we can continue to explore in the future include GPU platform migration and optimization, deployment to deep learning supercomputers to discover failure prone nodes and exploration of efficient software level error detection for other operations in DLRMs.
-  (2005) Soft errors in advanced computer systems. IEEE Design & Test of Computers 22 (3), pp. 258–266. Cited by: §I.
-  (2018) Fault tolerant one-sided matrix decompositions on heterogeneous systems with gpus. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 854–865. Cited by: §I.
-  (2013) Online-abft: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. ACM SIGPLAN Notices 48 (8), pp. 167–176. Cited by: §I.
-  (2014) Optimization of multi-level checkpoint model for large scale hpc applications. In 2014 IEEE 28th international parallel and distributed processing symposium, pp. 1181–1190. Cited by: §I.
-  (2021) Silent data corruptions at scale. External Links: Cited by: §I.
-  (2018) Analyzing and increasing the reliability of convolutional neural networks on gpus. IEEE Transactions on Reliability 68 (2), pp. 663–677. Cited by: §II.
-  (2016) How to kill a supercomputer: dirty power, cosmic rays, and bad solder. IEEE Spectrum 10, pp. 2–3. Cited by: §I.
-  (2016) Supercomputing’s monster in the closet. IEEE Spectrum 53 (3), pp. 30–35. Cited by: §I.
-  (2019) Post-training 4-bit quantization on embedding tables. arXiv preprint arXiv:1911.02079. Cited by: §III-C.
-  (2020) Making convolutions resilient via algorithm-based error detection techniques. arXiv preprint arXiv:2006.04984. Cited by: §I, §II.
-  (1984) Algorithm-based fault tolerance for matrix operations. IEEE transactions on computers 100 (6), pp. 518–528. Cited by: §I, §IV-A2.
-  (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In , pp. 2704–2713. Cited by: §III-A.
-  (2021) FBGEMM: enabling high-performance low-precision deep learning inference. arXiv preprint arXiv:2101.05615. Cited by: §III-B.
-  (2017) Understanding error propagation in deep learning neural network (dnn) accelerators and applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12. Cited by: §V-D.
Correcting soft errors online in fast fourier transform. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12. Cited by: §I, §V-D.
-  (2017) Characterizing temperature, power, and soft-error behaviors in data center systems: insights, challenges, and opportunities. In 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 22–31. Cited by: §I.
-  (2005) SWIFT: software implemented fault tolerance. In International Symposium on Code Generation and Optimization, pp. 243–254. Cited by: §II.
-  (2020) Compute solution for tesla’s full self-driving computer. IEEE Micro 40 (2), pp. 25–35. Cited by: §II.
-  (2016) New-sum: a novel online abft scheme for general iterative methods. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, pp. 43–55. Cited by: §I.
-  (2017) Silent data corruption resilient two-sided matrix factorizations. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 415–427. Cited by: §I.
-  (2020) Kraken: memory-efficient continual learning for large-scale real-time recommendations. In 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 278–294. Cited by: §III-C.
-  (2018) Thundervolt: enabling aggressive voltage underscaling and timing error resilience for energy efficient deep learning accelerators. In Proceedings of the 55th Annual Design Automation Conference, pp. 1–6. Cited by: §I.
-  (2020) Algorithm-based fault tolerance for convolutional neural networks. arXiv preprint arXiv:2003.12203. Cited by: §II.
-  (2020) Distributed hierarchical gpu parameter server for massive scale deep learning ads systems. arXiv preprint arXiv:2003.05622. Cited by: §III-C.
-  (2019) AIBox: ctr prediction model training on a single node. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 319–328. Cited by: §III-C.