Similarity search is a key computational primitive found in a wide range of applications, such as computational biology , computer graphics , image and video retrieval [3, 4], image classification , content deduplication [6, 7]
, machine learning, databases, data mining , and computer vision 11]
, there has been relatively little work focused on accelerating the task that follows: taking the resulting feature vectors and searching the vast corpus of data for similar content. In recent years, the importance and ubiquity of similarity search has increased dramatically with the explosive growth of visual content: users shared over 260 billion images on Facebook in 2010, and uploaded over 300 hours of video on YouTube every minute in 2014 . This volume of visual data is only expected to continue growing exponentially , and has motivated search-based graphics and vision techniques such as visual memex , 3D reconstruction , and cross-domain image matching .
Similarity search manifests as a simple algorithm: k-nearest neighbors (kNN). At a high level, kNN is an approximate associative computation which tries to find the most similar content with respect to the query content. At its core, kNN consists of many parallelizable distance calculations and a single global top-k sort, and is often supplemented with indexing techniques to reduce the volume of data that must be processed. While computationally very simple, kNN is notoriously memory intensive on modern CPUs and heterogeneous computing substrates making it challenging to scale to large datasets. In kNN, distance calculations are cheap and abundantly parallelizable across the dataset, but moving data from memory to the computing device is a significant bottleneck. Moreover, this data is used only once per kNN query and discarded since the result of a kNN query is only a small set of identifiers. Batching requests to amortize this data movement has limited benefits as time-sensitive applications have stringent latency budgets. Indexing techniques such as kd-trees 
, hierarchical k-means clustering, and locality sensitive hashing  are often employed to reduce the search space but trade reduced search accuracy for enhanced throughput. Indexing techniques also suffer from the curse of dimensionality ; in the context of kNN, this means indexing structures effectively degrade to linear search for increasing accuracy targets.
Because of its significance, generality, parallelism, underlying simplicity, and small result set, kNN is an ideal candidate for near-data processing. The key insight is that a small accelerator can reduce the traditional bottlenecks of kNN by applying orders of magnitude data reduction near memory, substantially reducing the need for data movement. While there have been many attempts at processing-in-memory (PIM) in the past [22, 23, 24, 25, 26, 27], much of prior work suffered from DRAM technology limitations. Logic created in DRAM processes was too slow, while DRAM implemented in logic processes suffered from poor retention and high power demands; attempts at hybrid processes  result in the worst of both. PIM architectures are more appealing today with the advent of die-stacked memory technology which enables the co-existence of an efficient DRAM layer and efficient logic layer .
We propose Similarity Search Associative Memory (SSAM) which integrates a programmable accelerator into a die-stacked memory module. Semantically, a SSAM takes a query as input and returns the top-k closest neighbors stored in memory as output. We evaluate the performance and energy efficiency gains of SSAM by implementing, synthesizing, and simulating the design down to layout. We then compare SSAM against current CPUs, GPUs, and FPGAs, and show that it can achieve better area-normalized throughput and energy efficiency.
Our paper makes the following contributions:
A characterization of state-of-the-art k-nearest neighbors including both application-level and architectural opportunities that justify acceleration.
An application-driven codesign of a near memory vector processor-based accelerator architecture with hardware support for similarity search on top of Hybrid Memory Cube (HMC).
Instruction extensions to leverage hardware units to accelerate similarity search.
The rest of the paper is organized as follows. Section 2 introduces and characterizes the kNN algorithm. Section 3 describes the SSAM architecture and the hardware/software interface. Section 4 outlines evaluation methodology, and Section 5 presents evaluation results. Section 6 discusses the impact of these results on different application areas and design points. Finally, Section 7 discusses related work.
2 Characterization of kNN
We now introduce and characterize the kNN algorithm pipeline and indexing techniques, and highlight the application-level and architectural opportunities for acceleration.
2.1 Case study: content-based search
A typical kNN software application pipeline for content-based search (Figure 1) has five stages: feature extraction, feature indexing, query generation, k-nearest neighbors search, and reverse lookup. In feature extraction (Figure 1a), the raw multimedia corpus is converted into an intermediary feature vector representation. Feature vectors may represent pixel trajectories in a video, word embeddings of a document, or shapes in an image [30, 6, 31]
, and are extracted using feature descriptors or convolutional neural networks[31, 32, 33, 34, 35, 36]. While feature extraction is an important component of this pipeline, it only needs to be performed once for the dataset and can be done offline; a significant portion of work has also shown feature extraction can be achieved efficiently [37, 38, 39, 40, 11]. In indexing (Figure 1b), feature vectors from feature extraction are organized into data structures (discussed in Section 2.3). At query time, these data structures are used to quickly prune the search space; intuitively, these data structures should be able to reduce the search time from linear to logarithmic in the size of the data. Indexing, like feature extraction, can be performed offline and away from the critical path of the query.
While feature extraction and indexing can be performed offline, the query generation stage (Figure 1c) of the search pipeline occurs online. In query generation, a user uploads a multimedia file (image, video, etc.) and requests similar content back. The query runs through the same feature extractor used to create the database before being passed to the search phase. Once a query is generated, the k-nearest neighbors stage (Figure 1d) attempts to search for the most similar content in the database. The kNN algorithm consists of many highly parallelizable distance calculations and a global top-k sort; indexing structures may also be employed to prune the search space but trade accuracy for performance. The similarity metric employed by the distance calculation often depends on the application, but common distance metrics include Euclidean distance, Hamming distance [41, 42, 43, 44, 45, 46, 47, 48]49], and learned distance metrics . The final step in the pipeline is reverse lookup where the resulting nearest neighbors are mapped to their original database content. The resulting media is then returned to the user.
2.2 Typical workload parameters
Prior work shows the feature dimensionality for descriptors such as Speeded Up Robust Feature (SURF) , word embeddings , Scale Invariant Feature Transform (SIFT) , GIST descriptors , AlexNet , and ResNet 
ranges from 64 to 4096 dimensions. For higher dimensional feature vectors, it is common to apply techniques such as principal component analysis to reduce feature dimensionality to tractable lengths. The number of nearest neighbors for an array of search applications has been shown to range from 1 (nearest neighbor) up to 20 [31, 6, 54, 55, 13]
. Each kNN algorithm variant also has a number of additional parameters such as indexing technique, distance function, bucket size, index-specific hyperparameters, and hardware specific optimizations.
To simplify the characterization, we limit our initial evaluation to Euclidean distance and three real world datasets: the Global Vectors for Word Representations (GloVe) dataset , the GIST dataset , and the AlexNet dataset . The GloVe dataset consists of 1.2 million word embeddings extracted from Twitter tweets and the GIST dataset consists of 1 million GIST feature vectors extracted from images. We also constructed an AlexNet dataset by taking 1 million images from the Flickr dataset  and applying AlexNet  to extract the feature vectors. For each dataset, we separate it into a “training” set used to build the search index, and a “test” set used as the queries when measuring application accuracy. Exact dataset parameters used for our characterization and evaluation are shown in Table 1.
2.3 Approximate kNN algorithms tradeoffs
We now characterize three canonical indexing techniques employed by approximate kNN algorithms: kd-trees, hierarchical k-means, and multi-probe locality sensitive hashing (MPLSH). Indexing techniques employ hierarchical data structures which are traversed at query time to prune the search space. In kd-trees, the index is constructed by randomly cutting the dataset by the top-N vector dimensions with highest variance. The resulting index is a tree data structure where each leaf in the tree contains a bucket of similar vectors; the depth of the bucket depends on how tall the tree is limited to be. Queries which traverse the index and end up in the same bucket should be similar; multiple parallel trees are often used in parallel with different cut orders. Multiple leaves in the tree can be visited to improve the quality of the search; to do this, the traversal employs backtracking to check additional “close by” buckets in a depth first search-like fashion. A user-specified bound typically limits the number of additional buckets visited when backtracking.
Similarly, in hierarchical k-means the dataset is partitioned recursively based on k-means cluster assignments to form a tree data structure . Like kd-tree indices, the height of the tree is restricted, and each leaf in the tree holds a bucket of similar vectors which are searched when a query reaches that bucket; backtracking is also used to expand the search space and search “close by” buckets.
Finally, MPLSH constructs a set of hash tables where each hash location is associated with a bucket of similar vectors 
. In MPLSH, hash functions are designed to intentionally cause hash collisions to map similar vectors to the same bucket. To improve accuracy, MPLSH applies small perturbations to the hash result to create additional probes into the same hash table to search “close by” hash partitions. In our evaluation, we use hyperplane MPLSH (HP-MPLSH) which cuts the space into random hyperplanes and set the number of hash bits or hyperplane cuts to 20.
Each of these approximate kNN algorithms trade accuracy for enhanced throughput. In kNN, accuracy is defined as where is the true set of neighbors returned by exact floating point based linear kNN search, and is the set of neighbors returned by approximate kNN. In general, searching more of the dataset improves search accuracy for indexing techniques. To quantify the accuracy of indexing structures, we benchmark the accuracy and throughput of indexing techniques for the GloVe, GIST, and AlexNet datasets. We use the Fast Library for Approximate Nearest Neighbors (FLANN)  to benchmark kd-trees and hierarchical k-means, and Fast Lookups for Cosine and Other Nearest Neighbors Library (FALCONN)  to benchmark HP-MPLSH. For kd-trees and hierarchical k-means we vary the number of leaf nodes or buckets in the tree that backtracking will check, while for HP-MPLSH we increase the number of probes used per hash table. Each of these modifications effectively increases the fraction of the dataset searched per query and lowers overall throughput.
The resulting throughput versus accuracy curves are shown in Figure 2 for single threaded implementations. In general, our results show indexing techniques can provide up to 170 throughput improvement over linear search while still maintaining at least 50% search accuracy, but only up to 13 in order to achieve 90% accuracy. Past 95-99% accuracy, we find that indexing techniques effectively degrade to linear search (blue solid line). More importantly, our results show there is a significant opportunity for also accelerating approximate kNN techniques. Hardware acceleration of approximate kNN search can either increase throughput at iso-accuracy by simply speeding up the computation or increase search accuracy at iso-latency by searching larger volumes of data.
2.4 Alternative numerical representations and distance metrics
We now briefly discuss the space of numerical representations and distance metrics used in kNN search.
Fixed-Point Representations: Fixed-point arithmetic is much cheaper to implement in hardware than floating point units. To evaluate whether floating point is necessary for kNN, we converted each dataset to a 32-bit fixed-point representation and repeated the throughput versus accuracy experiments. Overall, we find there is negligible accuracy loss between 32-bit floating-point and 32-bit fixed-point data representations.
. Binarization techniques trade accuracy for higher throughput since precision is lost by binarizing floating point values but throughput increases since the dataset size is smaller; binarization also enables Hamming distance calculations which are cheaper to implement in hardware. In practice, carefully constructed Hamming codes have been shown to achieve excellent results.
Alternative Distance Metrics: While the canonical distance metric for kNN is the Euclidean norm, there still exist a wide variety of alternative distance metrics. Such alternative metrics include Manhattan distance, cosine similarity, Chi squared distance, Jaccard similarity, and learned distance metrics .
2.5 Architectural Characterization
shows the instruction profile for linear, kd-tree, k-means, and MPLSH based algorithms respectively. Recall that linear search performance is still valuable since higher accuracy targets reduce to linear search; in addition, approximate algorithms still use linear search to scan buckets of vectors at the end of their traversals. As expected, the instruction profile shows that vector operations and extensions are important for kNN workloads due to the many vector-parallel distance calculations. In addition, the high percentage of memory reads confirms that the computation has high data movement demands. Approximate kNN techniques like KD-trees and MPLSH exhibit less skew towards vectorized instructions but still exhibit similar memory intensive behavior and show vectorization is valuable.
3 SSAM Architecture
Based on the characterization results in Section 2, it is clear that similarity search algorithms (1) are an ideal match for vectorized processing units, and (2) can benefit from higher memory bandwidth to better support its data intensive execution phases. We now present our application-guided SSAM module and accelerator architecture which exploits near-data processing and specialized vector compute units to address these bottlenecks.
3.1 System integration and software interface
SSAM is a memory module that integrates into a typical system as a memory module similar to existing DRAM as shown in Figure 3. A host processor interfaces with an SSAM module similar to how it interacts with a DRAM memory module. The host processor is connected to each SSAM module over a communication bus; additional communication links are used if multiple SSAM-enabled modules are instantiated. Since HMC modules can be composed together, these additional links and SSAM modules allows us to scale up the capacity of the system. A typical system may also have multiple host processors (not shown) as the number of SSAM modules that the system must maintain increases.
To abstract the lower level details of SSAMs away from the programmer, we assume a driver stack exposes a minimal memory allocation API which manages user interaction with SSAM-enabled memory regions. An SSAM-enabled memory region is defined as a special part of the memory space which is physically backed by an SSAM instead of a standard DRAM module. A sample programming interface of how one would use SSAM-enabled memory regions is shown in Figure 4. SSAM-enabled memory regions would be tracked and stored in a free list similar to how standard memory allocation is implemented in modern systems. Allocated SSAM memory regions come with a set of special operations that allow the user to set the indexing mode, in additional to handling standard memory manipulation operations like memcpy. Similar to the CUDA programming model, we use analogous memory and execution operations to operate SSAM-enabled memory. Pages with data subject to SSAM queries are pinned (not subject to swapping by the OS).
3.2 SSAM architecture and hybrid memory cube
The SSAM architecture is built on top of a Hybrid Memory Cube 2.0 (HMC) memory substrate  to capitalize on enhanced memory bandwidth. The HMC is a die-stacked memory architecture composed of multiple DRAM layers and a compute layer. The DRAM layers are vertically partitioned into a number of vaults (Figure 5a). Vaults are each accessed via a vault controller which reside on a top-level compute layer. In HMC 2.0, the module is partitioned into a maximum of 32 vaults (only 16 are shown), where each vault controller operates at 10 GB/s yielding an aggregate internal memory bandwidth of 320 GB/s. The HMC architecture also is composed of four external data links (240 GB/s external bandwidth) which send and receive information to the host processor or other HMC modules. These external data links allow one or more HMC modules to be composed to effectively form a larger network of SSAMs if data exceeds the capacity of a single SSAM module.
Our SSAM architecture leverages the existing HMC substrate and introduces a number of SSAM accelerators to handle the kNN search. These SSAM accelerators are instantiated on the compute layer next to existing vault controllers as shown in Figure 5b. SSAM accelerators are further decomposed into processing units (Figure 5d). To fully harness the bandwidth available, we replicate processing units to fully use the memory bandwidth by measuring the peak bandwidth needs of each processing unit across all indexing techniques. For kNN, we expect to achieve near optimal memory bandwidth since almost all data accesses to memory are large contiguous blocks such as bucket scans and data structures, which are contiguously allocated in memory. Our modifications are made orthogonal to the functionality of the HMC control logic so that the HMC module can still operate as a standard memory module (i.e. acceleration logic can be bypassed). Our processing units do not implement a full cache hierarchy since there is little data reuse outside of the query vector and indexing data structure per query. Unlike GPUs cores, processing units are not restricted to operating in lockstep and multiple different indexing kernels can coexist on each SSAM module. Finally, we do not expect external data links to become a bottleneck as a vast majority of the data movement occurs within SSAM modules themselves. As a result, we only expect the communication network between the host processors and SSAM units to consist of kNN results which are a fraction of the original dataset size, and configuration data.
3.3 Processing unit architecture
Each processing unit consists of a fully integrated scalar and vector processing unit similar to  but are augmented with several instructions and hardware units to better support kNN. Fully-integrated vector processing units are naturally well-suited for accelerating kNN distance calculations because they are (1) able to exploit the abundant data parallelism in kNN and (2) well-suited for streaming computations. Using vector processing units also introduces flexibility in the types of distance calculations that can be executed. The scalar unit is better suited for executing index traversals which are sequential in nature, and provides flexibility in the types of indexing techniques that can be employed. Vector units on the other hand are better suited for high throughput data parallel distance calculations in kNN. We use a single instruction stream to drive both the scalar and vector processing units since at any given time a processing unit will only be performing either distance calculations or index traversals in kNN. For our evaluation, we perform a design sweep over several different vector lengths: 2, 4, 8, and 16. We find that 32 scalar registers, and 8 vector registers are sufficient to support our kNN workloads. Finally, we use forwarding paths between pipeline stages to implement chaining of vector operations.
We also integrate several hardware units that are useful for accelerating similarity search. First, we introduce a priority queue unit implemented using a shift register architecture proposed in , and is used to perform the sort and global top-k calculations. For our SSAM design, priority queues are 16 entries deep. We opt to provide a hardware priority queue instead of a software one since the overhead of a priority queue insert becomes non-trivial for shorter vectors. Because of its modular design, the priority queues can be chained to support higher values; likewise, priority queues in the chain can also be disabled if they are not needed. Second, we introduce a small hardware stack unit instantiated on the scalar datapath to aid kNN index traversals. The stack unit is a natural choice to facilitate backtracking when traversing hierarchical index structures. Finally, we integrate a 32 KB scratchpad to hold frequently accessed data structures, such as the query vector and indexing structures. We find that a modestly sized scratchpad memory is sufficient for kNN since the only heavily reused data are the query vectors and indices (data vectors are scanned and immediately discarded).
Unlike conventional scalar-vector architectures, we introduce several new instructions to exercise new hardware units for similarity search. First, we introduce priority queue insert (PQUEUE_INSERT), load (PQUEUE_LOAD), and reset (PQUEUE_RESET) instructions which are used to manipulate the hardware priority queue. The PQUEUE_INSERT instruction takes two registers and inserts them into the hardware priority queue as an (id, value) tuple. The PQUEUE_LOAD instruction reads either the id or value of a tuple in the priority queue at a designated queue position, while the PQUEUE_RESET clears the priority queue. We also introduce a scalar and vector 32-bit fused xor-population count instruction (SFXP and VFXP) which is similar to a fused multiply add instruction. The FXP instruction is useful for cheaply implementing Hamming distance calculations and assumes that each 32-bit word is 32 dimensions of a binary vector. The FXP instruction is also cheap to implement in hardware since the XOR only adds one additional layer of logic to the population count hardware. Finally, we introduce a data prefetch instruction MEM_FETCH since the linear scans through buckets of vectors exhibit predictable contiguous memory access patterns.
|Arithmetic (S/V)||ADD, SUB, MULT, POPCOUNT, ADDI, SUBI, MULTI|
|Bitwise/Shift (S/V)||OR, AND, NOT, XOR, ANDI, ORI, XORI, SR, SL, SRA|
|Control (S)||BNE, BGT, BLT, BE, J|
|Stack Unit (S)||POP, PUSH|
|Register Move/Memory Instructions (S/V)||SVMOVE, VSMOVE, MEM_FETCH, LOAD, LOAD, STORE|
|New SSAM Instructions||(S)PQUEUE_INSERT, (S)PQUEUE_LOAD, (S)PQUEUE_RESET, (S/V)FXP|
3.4 SSAM configuration
We assume that the host processor driver stack is able to communicate with each SSAM to initialize and bring up SSAM devices using a special address region dedicated to operating SSAMs. Execution binaries are written to instruction memories on each processing unit and can be recompiled to support different distance metrics, indexing techniques, and kNN parameters. In addition, any indexing data structures are also written to the scratchpad memory or larger DRAM prior to executing any queries on SSAMs. Any large data structures such as hash function weights in MPLSH or centroids in k-means are stored in SSAM memory since they are larger and experience limited reuse. If hierarchical indexing structures such as kd-trees or hierarchical k-means do not fit in the scratchpad, they are partitioned such that the top half of the hierarchy resides in scratchpad, and the bottom halves are dynamically loaded to the scratchpad from DRAM as needed during execution. A small portion of the scratchpad is also allocated for holding the query vector; this region is continuously rewritten as an SSAM services queries. If an kNN query must touch multiple vaults, the host processor broadcasts the search across SSAM processing units and performs the final set of global top-k reductions on the host processor. Finally, if SSAM capabilities are not needed, the host processor can disable the SSAM accelerator logic so that it operates simply as a standard memory.
4 Evaluation Methodology
We now outline our evaluation methodology used to compare and contrast SSAMs with competing CPUs, GPUs, and FPGAs shown in Table 4. To provide fair energy efficiency and performance measurements, we normalize each platform to a 28 nm technology process.
: To evaluate SSAM, we implemented, synthesized, and place-and-routed our design in Verilog with the Synopsys Design Compiler and IC Compiler using a TSMC 65 nm standard cell library; SRAM memories were generated using the ARM Memory Compiler. We also built an assembler and simulator to generate program binaries, benchmark assembly programs, and validate the correctness of our design. To measure throughput, we use post-placement and route frequency estimates and simulate the time it takes to process each of the workloads in Table1. Each benchmark is handwritten using our instruction set defined in Table 3. For power and energy efficiency estimates, we generate traces from real datasets to measure realistic activity factors. We then use the PrimeTime power analyzer to estimate power and multiply by the simulated run time to obtain energy efficiency estimates. Finally, we report the area estimates provided in the post-placement and route reports normalized to a 28 nm technology.
Xeon E5-2620 CPU: We evaluate a six core Xeon E5-2620 as our CPU baseline. For each platform, we benchmark wall-clock time using the implementations of kNN provided by the FLANN library  for linear, kd-tree, and k-means based search, and the FALCONN library for hyperplane MPLSH . For power and energy efficiency measurements, we use an external power meter to measure dynamic compute power. Dynamic compute power is computed by taking the difference between the load and idle power when running each benchmark. Energy efficiency is then calculated as the product of the run time and dynamic power. Estimates of the CPU die size is taken from .
Titan X GPU: For our GPU comparison, we use an NVIDIA Titan X GPU using a well-optimized, off-the-shelf implementation provided by Garcia et al. . We again record wall-clock time, and measure idle and load power using a power meter to measure run time and energy efficiency. We estimate the die size of the Titan X from .
Kintex-7 FPGA: We measure the performance and energy efficiency of our implementation on a Xilinx Kintex-7 FPGA using Vivado 2014.5. We use post-placement and route frequency estimates and simulated run times to estimate the throughput of kNN on the FPGA fabric. For power measurements, we use the Vivado Power Analyzer tool and refer to  for device area estimates.
5 Evaluation Results
We now present throughput, power, and energy efficiency measurements of SSAMs relative to competing heterogeneous computing platforms. For brevity, we first evaluate Euclidean distance kNN then separately evaluate different indexing techniques and distance metrics.
5.1 Accelerator power and area
Our post-placement and route power and area results are shown in Figures 5(a) and 5(b) respectively for different processing unit vector lengths and different submodules in the design. Area and power measurements are normalized to 28 nm technology using linear scaling factors. In terms of area, a large portion of the accelerator design is devoted to the SRAMs composing the scratchpad memory. However, relative to the CPU or GPU, the SSAM acceleration logic is still significantly smaller. Compared to the Xeon E5-2620, an SSAM is 6.23-15.62 smaller while compared to the Titan X an SSAM is 9.84-24.66 smaller. For comparison, the die size for HMC 1.0 in  in a 90 nm process is 729 mm; normalized to a 28 nm process, the die size would be 70.6 mm which is roughly the same or larger than our SSAM accelerator design111Die size for HMC 2.1 are not publicly available.. In terms of power, a SSAM uses no more than a typical memory module which makes it compatible with the power consumption of die stacked memories. Prior work by Puttaswamy et al.  shows temperature increases from integrating logic on die-stacked memory are not fatal to the design even for a general purpose core. Since SSAM consumes less power than general purpose cores, we do not expect thermal issues to be fatal.
5.2 Throughput and energy efficiency
We now report area-normalized throughput and energy efficiency gains across each platform for exact linear search which is agnostic to dataset composition and index traversal overheads. This quantifies the gains attributed to different heterogeneous computing technologies. Figures 6(a) and 6(b) shows the area-normalized throughput and energy efficiency of a SSAM against competing heterogeneous solutions. The FPGA and SSAM designs are suffixed by the design vector length; for instance, SSAM-4 refers to a SSAM design with processing units that have vector length 4. We observe SSAM achieve area-normalized throughput improvements of up to 426, and energy efficiency gains of up to 934 over multi-threaded Xeon E5-2620 CPU results. We also observe that GPUs and the FPGA implementation of the SSAM acceleration logic exhibit comparable throughput and energy efficiency. The FPGA in some cases underperforms the GPU since it effectively implements a soft vector core instead of a fixed-function unit; we expect that a fixed-function FPGA core would fare better.
In terms of the enhanced bandwidth, we attribute roughly one order of magnitude run time improvement to the higher internal bandwidth of HMC 2.0. Optimistically, standard DRAM modules provide up to 25 GB/s of memory bandwidth whereas HMC 2.0 provides 320 GB/s. For similarity search, the difference in available bandwidth directly translates to raw performance. The remaining gains in energy efficiency and performance can be attributed mostly to accelerator specialization. To quantify the impact of the priority queue, we simulate the performance of SSAM using a software priority queue instead of leveraging the hardware queue. At a high level, the hardware queue improves performance by up to 9.2% for wider vector processing units.
5.3 Approximate kNN search
We now evaluate the impact of approximate indexing structures and specialization on throughput and energy efficiency. Figure 8 compares the throughput versus accuracy curves for a SSAM and Xeon E5-2620 CPU for each dataset. In general, at a 50% accuracy target we observe up to two orders of magnitude throughput improvement for kd-tree, k-means, and HP-MPLSH over CPU baselines. The kd-tree and k-means indexing structures are still dominated by distance calculations and benefit greatly from augmented bandwidth when sequentially scanning through buckets for neighbors. HP-MPLSH on the other hand is composed of a combination of many hash function calculations and bucket traversals; we find that for the parameter sets used in our characterization, the performance of HP-MPLSH is dominated mostly by hashing rate. However, the parameters for HP-MPLSH can be adjusted to reduce the dependence on hash performance by reducing the number of hash bits; this would increase the number of vectors hashed to the same bucket and shift the performance bottleneck from hashing performance back to linear bucket scans.
5.4 Alternative distance metrics
We now briefly quantify the performance of alternative distance metrics on SSAM for three additional distance metrics: Hamming distance, Cosine similarity, and Manhattan distance. Unsurprisingly, the impact of binarizing data vectors and using Hamming distance provides good throughput improvement (up to 9.38) since less data must be loaded to process a vector and Hamming distances using the FXP instruction on SSAMs are cheap. Manhattan distance and Euclidean distances are the same cost since they require roughly the same number of operations. Meanwhile, cosine similarity222Cosine similarity is defined as . is about twice as expensive as Euclidean distance because of the additional numerator and divisor terms. Fixed-point division for cosine similarity is performed in software using shifts and subtracts, however the software division is still much cheaper than the rest of the distance calculation.
We now briefly evaluate the generality of SSAMs for other workloads, and with respect to content addressable memories, then compare SSAMs to alternative near-data processing technologies.
6.1 Index construction and other applications
The SSAM is not limited to approximate kNN search and can also perform other data intensive operations such as index construction or data intensive applications. In kNN, the overhead of building indexing structures is amortized away by the number of queries executed; however, index construction is still three orders of magnitude slower than single query execution. SSAMs can be reprogrammed to also perform these data intensive tasks; index construction also benefits from near-data processing since techniques like k-means and kd-tree construction require multiple scans over the entire dataset. For instance, to build a hierarchical k-means index we execute k-means by treating cluster centroids as the dataset, and streaming the dataset in as kNN queries to determine the closest centroid. While a host processor must still handle the short serialized phases of k-means, SSAMs are able to accelerate the data intensive scans in the k-means kernel by performing the computation near memory. Similarly for kd-tree index construction SSAMs can be used to quickly scan the dataset and compute the variance across all dimensions; the host processor can then assign bifurcation points and generate the tree. In both cases, the host processor must provide some control and high level orchestration but the bulk of each index construction kernel can be offloaded to exploit the benefits of high memory bandwidth.
SSAMs can also be used to accelerate other data intensive applications that benefit from vectorization and enhanced memory bandwidth. Applications such as support vector machines, k-means, neural networks, and frequent itemset mining can all be implemented on SSAM. In particular, the vectorized FXP instruction is useful for evaluation classes of application which rely on many Hamming distance calculations such as binary neural networks, and binary hash functions.
6.2 SSAM as a high density generalized content addressable memory
Semantically SSAMs are a generalization of a content addressable memory (CAM) and ternary content addressable memory (TCAM); more importantly SSAMs are a semantically more powerful associative computing substrate. To use an SSAM as a CAM, we simply fix and check if a neighbor has a distance of zero to the query. To use an SSAM as a TCAM, we use a modified Hamming distance which adjusts for ternary matches. To do this, a ternary bit mask is supplied to an SSAM with the query vector . The ternary bit mask holds a 0 at positions in the query vector where a ternary match should occur and 1 otherwise. The query vector and input data vector are XOR’ed and AND’ed with the ternary bit mask; this effectively masks off positions where the bits did not match. We then check if the distance between and is zero to determine if there was a ternary match. The resulting modified Hamming distance can be expressed POPCOUNT((DQ) & T).
Unlike CAMs and TCAMs, SSAMs more generally support approximate and arbitrary width matches. Approximate matches are more powerful since they are able to return similar content as opposed to only exact content matches. SSAMs can also return multiple matches while CAMs and TCAMs typically are design to only return one memory location.
Architecturally, SSAMs realize a different design point from traditional CAMs or TCAMs since the compute units used to perform the matching are separate from the underlying memory cell implementation. Binary CAMs and TCAMs are organized to simultaneously search all the data in parallel while SSAMs create the illusion of an associative memory but rely on high internal bandwidth to quickly scan data and process a query. SSAMs also benefit from much higher bit density, capacity and reduced operating power of DRAMs while maintaining flexible associative computing capabilities. Table 6 provides a comparison of effective bit density between TCAM, DRAM, and SSAM for a 16 GB capacity memory. In terms of area, SSAMs effectively achieve the same bit density as DRAM because the additional acceleration logic is small relative to DRAM macro sizes. In contrast, TCAMs are more than 19 less dense than standard DRAMs and the SSAM design presented in this paper. Finally, in terms of power a 20 Mb TCAM macro consumes 10.6 W  which is already significantly more than the SSAM acceleration logic and most DRAM modules.
6.3 Alternative near-data processing substrates
Near-data processing manifests in many different shapes and forms; in this section, we briefly contrast our approach against alternative near-data processing architectures.
Micron Automata Processor (AP): The AP is a near-data processing architecture specialized for high speed non-deterministic finite automata (NFA) evaluation . Unlike SSAM, the AP cannot efficiently implement arithmetic operations and is limited to distance metrics like Hamming distance or Jaccard similarity. At a high level, the automata design is composed of multiple parallel NFAs where each NFA encodes a distance calculation with a single dataset vector333Details of the automata design are beyond the scope of this paper. A query vector is streamed into the AP and compared against all NFAs in parallel and sorted. To support execution of different NFAs, the AP can be reconfigured much like reconfiguration on a FPGA. We briefly evaluate the AP by designing, and compiling a design for each dataset, and use the results to build an analytical model to estimate performance for a current generation AP device. Table 7 shows the AP’s performance and energy efficiency compared to SSAM. At a high level, we find that the large datasets presented in this paper do not fit on one AP board configuration, and as a result the AP is bottlenecked by the high reconfiguration overheads compared to SSAM.
Compute in Resistive RAM (ReRAM): Computation in ReRAM is an emerging near-data processing technique that can perform limited compute operations by directly exploiting the resistive properties of ReRAM memory cells . This allows augmented computational capabilities beyond what are available to DRAM based near-data processing techniques such as SSAM. Most notably, Chi et al.  has shown how in-situ ReRAM computation can be used to accelerate convolutional neural networks without moving data out of the ReRAM cells themselves. As the technology matures, it would not be unprecendented to replace DRAM in favor of ReRAM and its augmented computing capabilities.
In-Storage Processing: There has also been a renewed interest in instantiation computation near disk or SSD. Recent work such as Intelligent SSD [80, 81] and Biscuit  have all proposed adding computation near mass storage devices and shown promising improvements for applications like databases. However, compared to SSAM, in-storage processing architectures target a different bandwidth to storage capacity design point. Unlike SSAM, SSD-based near-data processing handles terabytes of data at lower bandwidth speeds which is less ideal for latency critical applications like similarity search.
Die-Stacked HMC Architectures (This Paper) Instantiating an accelerator adjacent to HMC is not a new proposal [83, 84, 85]; prior work has shown that such an architectural abstraction is useful for accelerating graph processing  and neural networks . This architecture has several advantages over in-situ ReRAM computation and the automata processor. First, by abstracting the computation away from the memory substrate, the types of computation supported is decoupled from the restrictions of underlying memory implementations. Second, by separating the computation from actual memory cells, this architectural abstraction achieves much higher compute and memory density; this is unlike substrates like the AP where compute and memory are both instantiated in the same memory process.
7 Related Work
The concept of near-data processing has been studied in the literature for decades. More interestingly, the concept of integrating similarity search accelerators with memory also has an established history, indicating ongoing interest.
CAMs: As far back as the 1980s and 1990s, near-memory accelerators were proposed to improve the performance of nearest neighbor search using CAMs . Kohonen et al.  proposed using a combination of CAMs and hashing techniques to perform nearest neighbor search. Around the same time, Kanerva et al.  propose sparse distributed memory (SDM) and a “Best Match Machine” to implement nearest neighbor search. The ideas behind SDM were later employed by Roberts in PCAM  which is, to the best of our knowledge, the first fabricated CAM-based accelerator capable of performing nearest neighbor search on its own.
Algorithms that exploit TCAMs to perform content addressable search such as ternary locality sensitive hashing (TLSH)  and binary-reflected Gray code  do exist. However, TCAMs suffer from lower memory density, higher power consumption, and smaller capacity than emerging memory technologies. While prior work  shows promising increases in performance, energy efficiency, and capacity, TCAM cells are less dense than DRAM cells. For the massive scale datasets in kNN workloads, the density disparity translates to an order of magnitude in cost. Despite these limitations, there is still active work in TCAMs for data-intensive applications to accelerator associative computations .
Multiprocessor and Vector PIMs: In the late 1990s, Patterson et al.  proposed IRAM which introduced processing units integrated with DRAM. In particular, Gebis et al.  and Kozyarakis et al.  proposed VIRAM which used a vector processor architecture with embedded DRAM similar to our work. Similar to our work, the intention of VIRAM was to capitalize on the higher bandwidth and reduce energy consumption by co-locating general MIPS cores and vector register and compute units near DRAM. Unlike VIRAM, SSAM does not implement a full cache hierarchy, targets a different class of algorithms, and uses a 3D die-stacked solutions.
Kogge et al.  propose the EXECUBE architecture which integrates general purpose cores with DRAM macros. Elliott et al.  propose C-RAM which add SIMD processing units adjacent to the sense amplifiers capable of bit serial operations. Active Pages  and FlexRAM  envisioned a programmable processing element near each DRAM macro block which could be programmed for kNN acceleration. However, none of these prior efforts directly addresses the kNN search problems we discuss.
More recently, Active Memory Cube (AMC)  proposes a similar vector processing unit and cache-less system on top of HMC. While both SSAM and AMC arrive at the same architecture conclusion - that vector PIM on die-stacked DRAM is useful - our work provides a more application-centric approach which allows us to codesign architectural features such as the priority queue.
Application-Driven PIM: Application-justified PIM design is not a new idea. Deering et al.  propose FBRAM, a “new form” of memory optimized to improve random access writes to accelerate z-buffering for 3D graphics algorithms. Lipman and Yang  propose a DRAM based architecture called smart access memory (SAM) for nearest-neighbor search targeting DB applications. Their design tightly integrates a k-nearest neighbor accelerator engine and microarchitecturally shares common elements with our design. Agrawal et al.  exploit accelerators to reduce the total cost of ownership of high-dimensional similarity search. Yu et al.  optimize all-pairs nearest neighbors by fusing neighbor selection with distance computations. Finally, Tandon et al  propose an all pairs similarity accelerator for NLP; however, their work integrates their accelerator with the last level cache instead of memory.
The emergence and maturity of die-stacked memory and alternative memory substrates has also enabled a wide variety of PIM accelerator proposals [103, 83, 84, 104, 105, 106, 107, 108, 85]. Chi et al. , Kim et al. , and Gao et al.  all propose PIM solutions for accelerating neural networks. Ahn et al.  propose PIM on top of HMC for graph processing, and Hsieh et al.  and Zhang et al.  propose PIM-based GPU architectures. Imani et al.  propose MPIM for linear kNN search; however, their architecture uses a resistive RAM-based approach and is limited to bitwise operations. Furthermore, MPIM does not consider modern approximate kNN indexing algorithms nor does it evaluate the quality versus accuracy tradeoffs that these algorithms make.
We presented SSAM, an application-driven near-data processing architecture for similarity search. We showed that by moving computation closer to memory, SSAM is able to address the data movement challenges of similarity search and exploit application codesign opportunities to accelerate similarity search. While we used HMC as our memory backend, the high-level accelerator design and insights still generalize to alternative memory technology and in-memory processing architectures. The PIM proposal presented in this paper are also relevant to other data intensive workloads where data movement is becoming an increasingly fundamental challenge in improving system efficiency.
-  A. S. Arefin, C. Riveros, R. Berretta, and P. Moscato, “Gpu-fs-k nn: A software tool for fast and scalable k nn computation using gpus,” vol. 7, p. e44000, Public Library of Science, 2012.
-  T. J. Purcell, C. Donner, M. Cammarano, H. W. Jensen, and P. Hanrahan, “Photon mapping on programmable graphics hardware,” in Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pp. 41–50, Eurographics Association, 2003.
-  J. Sivic and A. Zisserman, “Video google: A text retrieval approach to object matching in videos,” in Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2, ICCV ’03, (Washington, DC, USA), pp. 1470–, IEEE Computer Society, 2003.
-  H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid, “Aggregating local image descriptors into compact codes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, pp. 1704–1716, Sept. 2012.
O. Boiman, E. Shechtman, and M. Irani, “In defense of nearest-neighbor based
image classification,” in
Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8, IEEE, 2008.
-  M. Kusner, Y. Sun, N. Kolkin, and K. Q. Weinberger, “"From Word Embeddings To Document Distances",” in Proceedings of the 32nd International Conference on Machine Learning (ICML-15) (D. Blei and F. Bach, eds.), pp. 957–966, JMLR Workshop and Conference Proceedings, 2015.
-  L. Aronovich, R. Asher, E. Bachmat, H. Bitner, M. Hirsch, and S. T. Klein, “The design of a similarity based deduplication system,” in Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, SYSTOR ’09, (New York, NY, USA), pp. 6:1–6:14, ACM, 2009.
-  R. Agrawal, C. Faloutsos, and A. N. Swami, “Efficient similarity search in sequence databases,” in Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms, FODO ’93, (London, UK, UK), pp. 69–84, Springer-Verlag, 1993.
-  X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg, “Top 10 algorithms in data mining,” Knowl. Inf. Syst., vol. 14, pp. 1–37, Dec. 2007.
A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: A large data set for nonparametric object and scene recognition,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, pp. 1958–1970, Nov. 2008.
-  Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, “Dadiannao: A machine-learning supercomputer,” in Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pp. 609–622, Dec 2014.
-  D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel, “Finding a needle in haystack: Facebook’s photo storage,” in Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI’10, (Berkeley, CA, USA), pp. 47–60, USENIX Association, 2010.
-  YouTube, “Statistics - YouTube,” 2014.
-  “Rebooting the it revolution: A call to action,” 2015.
-  T. Malisiewicz and A. Efros, “Beyond categories: The visual memex model for reasoning about object relationships,” in Advances in neural information processing systems, pp. 1222–1230, 2009.
-  S. Agarwal, Y. Furukawa, N. Snavely, I. Simon, B. Curless, S. M. Seitz, and R. Szeliski, “Building rome in a day,” Commun. ACM, vol. 54, pp. 105–112, Oct. 2011.
A. Shrivastava, T. Malisiewicz, A. Gupta, and A. A. Efros, “Data-driven visual similarity for cross-domain image matching,”ACM Transaction of Graphics (TOG) (Proceedings of ACM SIGGRAPH ASIA), vol. 30, no. 6, 2011.
-  C. Silpa-Anan and R. Hartley, “Optimised kd-trees for fast image descriptor matching,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8, IEEE, 2008.
-  M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with automatic algorithm configuration,” in International Conference on Computer Vision Theory and Application VISSAPP’09), pp. 331–340, INSTICC Press, 2009.
-  M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,” in Proceedings of the Twentieth Annual Symposium on Computational Geometry, SCG ’04, (New York, NY, USA), pp. 253–262, ACM, 2004.
P. Indyk and R. Motwani, “Approximate nearest neighbors: Towards removing the
curse of dimensionality,” in
Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, (New York, NY, USA), pp. 604–613, ACM, 1998.
-  W. A. Wulf and S. A. McKee, “Hitting the memory wall: Implications of the obvious,” SIGARCH Comput. Archit. News, vol. 23, pp. 20–24, Mar. 1995.
-  D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, “A case for intelligent ram,” IEEE Micro, vol. 17, pp. 34–44, Mar. 1997.
-  Y. Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, and J. Torrellas, “Flexram: toward an advanced intelligent memory system,” in Computer Design, 1999. (ICCD ’99) International Conference on, pp. 192–201, 1999.
-  J. Draper, J. Chame, M. Hall, C. Steele, T. Barrett, J. LaCoss, J. Granacki, J. Shin, C. Chen, C. W. Kang, I. Kim, and G. Daglikoca, “The architecture of the diva processing-in-memory chip,” in Proceedings of the 16th International Conference on Supercomputing, ICS ’02, (New York, NY, USA), pp. 14–25, ACM, 2002.
-  J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama, “Impulse: building a smarter memory controller,” in High-Performance Computer Architecture, 1999. Proceedings. Fifth International Symposium On, pp. 70–79, Jan 1999.
-  M. Oskin, F. T. Chong, and T. Sherwood, “Active pages: A computation model for intelligent memory,” SIGARCH Comput. Archit. News, vol. 26, pp. 192–203, Apr. 1998.
-  IBM, “"Blue Logic SA-27E ASIC",” 1999. "http://www.ic72.com/pdffile/s/381279.pdf".
-  J. T. Pawlowski, “Hybrid memory cube (hmc),” Hot Chips, vol. 23, 2011.
-  Heng Wang and Klaser, A. and Schmid, C. and Cheng-Lin Liu, “Action recognition by dense trajectories,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 3169–3176, June 2011.
-  D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” Int. J. Comput. Vision, vol. 60, pp. 91–110, Nov. 2004.
-  H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “"speeded-up robust features (surf)",” Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346 – 359, 2008. Similarity Matching in Computer Vision and Multimedia.
-  M. Douze, H. Jégou, H. Sandhawalia, L. Amsaleg, and C. Schmid, “Evaluation of GIST Descriptors for Web-scale Image Search,” in Proceedings of the ACM International Conference on Image and Video Retrieval, CIVR ’09, (New York, NY, USA), pp. 19:1–19:8, ACM, 2009.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” inAdvances in Neural Information Processing Systems 25 (F. Pereira, C. Burges, L. Bottou, and K. Weinberger, eds.), pp. 1097–1105, Curran Associates, Inc., 2012.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Computer Vision and Pattern Recognition (CVPR), 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015.
-  J. Hauswald, M. A. Laurenzano, Y. Zhang, C. Li, A. Rovinski, A. Khurana, R. Dreslinski, T. Mudge, V. Petrucci, L. Tang, and J. Mars, “Sirius: An open end-to-end voice and vision personal assistant and its implications for future warehouse scale computers,” in Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), ASPLOS ’15, (New York, NY, USA), ACM, 2015. Acceptance Rate: 17
-  Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “Shidiannao: Shifting vision processing closer to the sensor,” in Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA ’15, (New York, NY, USA), pp. 92–104, ACM, 2015.
-  J. Hauswald, Y. Kang, M. A. Laurenzano, Q. Chen, C. Li, R. Dreslinski, T. Mudge, J. Mars, and L. Tang, “Djinn and tonic: Dnn as a service and its implications for future warehouse scale computers,” in Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), ISCA ’15, (New York, NY, USA), ACM, 2015. Acceptance Rate: 19
-  T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” SIGARCH Comput. Archit. News, vol. 42, pp. 269–284, Feb. 2014.
-  A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” in Proceedings of the 25th International Conference on Very Large Data Bases, VLDB ’99, (San Francisco, CA, USA), pp. 518–529, Morgan Kaufmann Publishers Inc., 1999.
-  M. Raginsky and S. Lazebnik, “Locality-sensitive binary codes from shift-invariant kernels,” in Advances in Neural Information Processing Systems 22 (Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta, eds.), pp. 1509–1517, Curran Associates, Inc., 2009.
-  M. M. B. C. Strecha, A. M. Bronstein and P. Fua, “LDAHash: Improved Matching with Smaller Descriptors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 1, 2012.
-  A. Torralba, R. Fergus, and Y. Weiss, “Small codes and large image databases for recognition,” in In Proceedings of the IEEE Conf on Computer Vision and Pattern Recognition, 2008.
-  J. Wang, S. Kumar, and S.-F. Chang, “Semi-Supervised Hashing for Large-Scale Search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, pp. 2393–2406, Dec. 2012.
-  Y. Weiss, A. Torralba, and R. Fergus, “Spectral Hashing,” in Advances in Neural Information Processing Systems 21 (D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, eds.), pp. 1753–1760, Curran Associates, Inc., 2009.
-  K. He, F. Wen, and J. Sun, “K-means hashing: An affinity-preserving quantization method for learning binary compact codes,” in Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’13, (Washington, DC, USA), pp. 2938–2945, IEEE Computer Society, 2013.
-  M. Norouzi, D. J. Fleet, and R. Salakhutdinov, “Hamming distance metric learning,” in Proceedings of the 25th International Conference on Neural Information Processing Systems, NIPS’12, (USA), pp. 1061–1069, Curran Associates Inc., 2012.
-  J. Weston, S. Chopra, and K. Adams, “#tagspace: Semantic embeddings from hashtags,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1822–1827, 2014.
-  E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, “Distance metric learning, with application to clustering with side-information,” in Proceedings of the 15th International Conference on Neural Information Processing Systems, NIPS’02, (Cambridge, MA, USA), pp. 521–528, MIT Press, 2002.
-  J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1532–1543, 2014.
-  Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor, “Caffe: Convolutional Architecture for Fast Feature Embedding,” arXiv preprint arXiv:1408.5093, 2014.
-  I. Jolliffe, “Principal component analysis,” 2014.
-  A. Yershova and S. LaValle, “Improving Motion-Planning Algorithms by Efficient Nearest-Neighbor Searching,” Robotics, IEEE Transactions on, vol. 23, pp. 151–157, Feb 2007.
-  D. Kelly and L. Azzopardi, “How many results per page?: A study of serp size, search behavior and user experience,” in Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, (New York, NY, USA), pp. 183–192, ACM, 2015.
-  H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, pp. 117–128, Jan. 2011.
-  Yahoo!, “Webscope | Yahoo Labs,” 2014.
-  C. Silpa-Anan and R. I. Hartley, “Optimised kd-trees for fast image descriptor matching.,” in CVPR, IEEE Computer Society, 2008.
-  Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-probe lsh: Efficient indexing for high-dimensional similarity search,” in Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB ’07, pp. 950–961, VLDB Endowment, 2007.
-  A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt, “Practical and optimal lsh for angular distance,” in Advances in Neural Information Processing Systems 28 (C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds.), pp. 1225–1233, Curran Associates, Inc., 2015.
-  Y. Gong and S. Lazebnik, “Iterative Quantization: A Procrustean Approach to Learning Binary Codes,” in CVPR ’11.
-  J. Wang, S. Kumar, and S.-F. Chang, “Semi-Supervised Hashing for Large-Scale Search,” in IEEE Trans. Pattern Anal. Mach. Intell. , pp. 2393–2406.
-  A. Torralba, R. Fergus, and Y. Weiss, “Small codes and large image databases for recognition,” in In Proceedings of the IEEE Conf on Computer Vision and Pattern Recognition .
-  K. Lin, J. Lu, C.-S. Chen, and J. Zhou, “Learning compact binary descriptors with unsupervised deep neural networks,” in CVPR, 2016.
-  C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, “Pin: Building customized program analysis tools with dynamic instrumentation,” in Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’05, (New York, NY, USA), pp. 190–200, ACM, 2005.
-  H. M. C. Consortium, “"Hybrid Memory Cube Specification 2.1",” 2014.
-  R. M. Russell, “The cray-1 computer system,” Commun. ACM, vol. 21, pp. 63–72, Jan. 1978.
-  S.-W. Moon, K. G. Shin, and J. Rexford, “Scalable hardware priority queue architectures for high-speed packet switches,” IEEE Trans. Comput., vol. 49, pp. 1215–1227, Nov. 2000.
-  R. S. Anand Lal Shimpi, “The intel ivy bridge (core i7 3770k) review.” Accessed: 2016-11-10.
-  V. Garcia, E. Debreuve, and M. Barlaud, “Fast k nearest neighbor search using GPU,” in Computer Vision and Pattern Recognition Workshops, 2008. CVPRW ’08. IEEE Computer Society Conference on, pp. 1–6, June 2008.
-  “Nvidia geforce gtx titan x.” Accessed: 2016-11-10.
-  U. Technologies, “Logic detailed structural analysis of the xilinx kintex-7 28nm fpga (es),” 2011. Accessed: 2016-11-10.
-  K. Puttaswamy and G. H. Loh, “Thermal analysis of a 3d die-stacked high-performance microprocessor,” in Proceedings of the 16th ACM Great Lakes Symposium on VLSI, GLSVLSI ’06, (New York, NY, USA), pp. 19–24, ACM, 2006.
-  M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” CoRR, vol. abs/1603.05279, 2016.
-  K. Nii, N. Watanabe, M. Yamawaki, K. Yoshinaga, M. Wada, and I. Hayashi, “A 28nm 400mhz 4-parallel 1.6gsearch/s 80mb ternary cam,” IEEE, 2014.
-  S. Romanovsky, A. Katoch, A. Achyuthan, C. O’Connell, S. Natarajan, C. Huang, C.-Y. Wu, M.-J. Wang, C. J. Wang, P. Chen, and R. Hsieh, “A 500mhz random-access embedded 1mb dram macro in bulk cmos.,” in ISSCC, pp. 270–271, IEEE, 2008.
-  P. Dlugosch, D. Brown, P. Glendenning, M. Leventhal, and H. Noyes, “An efficient and scalable semiconductor architecture for parallel automata processing,” Parallel and Distributed Systems, IEEE Transactions on, vol. 25, no. 12, pp. 3088–3098, 2014.
-  X. Guo, E. Ipek, and T. Soyata, “Resistive computation: Avoiding the power wall with low-leakage, stt-mram based computing,” SIGARCH Comput. Archit. News, vol. 38, pp. 371–382, June 2010.
-  P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory,” in Proceedings of the 43rd International Symposium on Computer Architecture , pp. 27–39.
-  D.-H. Bae, J.-H. Kim, S.-W. Kim, H. Oh, and C. Park, “Intelligent ssd: a turbo for big data mining,” in Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, CIKM ’13, (New York, NY, USA), pp. 1573–1576, ACM, 2013.
-  S. Cho, C. Park, H. Oh, S. Kim, Y. Yi, and G. R. Ganger, “Active disk meets flash: A case for intelligent ssds,” in Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS ’13, (New York, NY, USA), pp. 91–102, ACM, 2013.
-  B. Gu, A. S. Yoon, D.-H. Bae, I. Jo, J. Lee, J. Yoon, J.-U. Kang, M. Kwon, C. Yoon, S. Cho, J. Jeong, and D. Chang, “Biscuit: A framework for near-data processing of big data workloads,” in Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, (Piscataway, NJ, USA), pp. 153–165, IEEE Press, 2016.
-  J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-in-memory accelerator for parallel graph processing,” in Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA ’15, (New York, NY, USA), pp. 105–117, ACM, 2015.
-  R. Nair, S. F. Antao, C. Bertolli, P. Bose, J. R. Brunheroto, T. Chen, C. Y. Cher, C. H. A. Costa, J. Doi, C. Evangelinos, B. M. Fleischer, T. W. Fox, D. S. Gallo, L. Grinberg, J. A. Gunnels, A. C. Jacob, P. Jacob, H. M. Jacobson, T. Karkhanis, C. Kim, J. H. Moreno, J. K. O’Brien, M. Ohmacht, Y. Park, D. A. Prener, B. S. Rosenburg, K. D. Ryu, O. Sallenave, M. J. Serrano, P. D. M. Siegl, K. Sugavanam, and Z. Sura, “Active memory cube: A processing-in-memory architecture for exascale systems,” IBM Journal of Research and Development, vol. 59, pp. 17:1–17:14, March 2015.
-  D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, “Neurocube: A programmable digital neuromorphic architecture with high-density 3d memory,” in Proceedings of the 43rd International Symposium on Computer Architecture , pp. 380–392.
-  J. D. Roberts, “PROXIMITY CONTENT-ADDRESSABLE MEMORY:AN EFFICIENT EXTENSION TO k-NEAREST NEIGHBORS SEARCH (M.S. Thesis),” tech. rep., Santa Cruz, CA, USA, 1990.
-  T. Kohonen, Self-organization and Associative Memory: 3rd Edition. New York, NY, USA: Springer-Verlag New York, Inc., 1989.
-  P. Kanerva, Sparse Distributed Memory. Cambridge, MA, USA: MIT Press, 1988.
-  R. Shinde, A. Goel, P. Gupta, and D. Dutta, “Similarity search and locality sensitive hashing using ternary content addressable memories,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, (New York, NY, USA), pp. 375–386, ACM, 2010.
-  A. Bremler-Barr, Y. Harchol, D. Hay, and Y. Hel-Or, “Ultra-fast similarity search using ternary content addressable memory,” in Proceedings of the 11th International Workshop on Data Management on New Hardware, DaMoN’15, (New York, NY, USA), pp. 12:1–12:10, ACM, 2015.
-  Q. Guo, X. Guo, Y. Bai, and E. İpek, “A resistive tcam accelerator for data-intensive computing,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, (New York, NY, USA), pp. 339–350, ACM, 2011.
-  Q. Guo, X. Guo, R. Patel, E. Ipek, and E. G. Friedman, “Ac-dimm: Associative computing with stt-mram,” SIGARCH Comput. Archit. News, vol. 41, pp. 189–200, June 2013.
-  D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, “A case for intelligent ram,” IEEE Micro, vol. 17, pp. 34–44, Mar. 1997.
-  J. Gebis, S. Williams, D. Patterson, and C. Kozyrakis, “Viram1: a media-oriented vector processor with embedded dram,” 2004.
-  C. Kozyrakis, J. Gebis, D. Martin, S. Williams, I. Mavroidis, S. Pope, D. Jones, and D. Patterson, “Vector iram: A media-oriented vector processor with embedded dram,” 2000.
-  P. M. Kogge, T. Sunaga, H. Miyataka, K. Kitamura, and E. Retter, “Combined DRAM and logic chip for massively parallel systems,” in 16th Conference on Advanced Research in VLSI (ARVLSI ’95), March 27-29, 1995, Chapel Hill, North Carolina, USA, pp. 4–16, 1995.
-  D. Elliott, M. Stumm, W. M. Snelgrove, C. Cojocaru, and R. McKenzie, “Computational ram: Implementing processors in memory,” IEEE Des. Test, vol. 16, pp. 32–41, Jan. 1999.
-  M. F. Deering, S. A. Schlapp, and M. G. Lavelle, “Fbram: A new form of memory optimized for 3d graphics,” in Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’94, (New York, NY, USA), pp. 167–174, ACM, 1994.
-  A. Lipman and W. Yang, “The Smart Access Memory: An Intelligent RAM for Nearest Neighbor Database Searching,” in In ISCA Workshop on Mixing Logic and DRAM, 1997.
-  S. R. Agrawal, C. M. Dee, and A. R. Lebeck, “Exploiting Accelerators for Efficient High Dimensional Similarity Search,” in Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’16, (New York, NY, USA), pp. 3:1–3:12, ACM, 2016.
-  C. D. Yu, J. Huang, W. Austin, B. Xiao, and G. Biros, “Performance Optimization for the K-nearest Neighbors Kernel on x86 Architectures,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’15, (New York, NY, USA), pp. 7:1–7:12, ACM, 2015.
-  P. Tandon, J. Chang, R. G. Dreslinski, V. Qazvinian, P. Ranganathan, and T. F. Wenisch, “Hardware acceleration for similarity measurement in natural language processing,” in Proceedings of the 2013 International Symposium on Low Power Electronics and Design, ISLPED ’13, (Piscataway, NJ, USA), pp. 409–414, IEEE Press, 2013.
-  P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory,” in Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, (Piscataway, NJ, USA), pp. 27–39, IEEE Press, 2016.
-  K. Hsieh, E. Ebrahim, G. Kim, N. Chatterjee, M. O’Connor, N. Vijaykumar, O. Mutlu, and S. W. Keckler, “Transparent offloading and mapping (tom): Enabling programmer-transparent near-data processing in gpu systems,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 204–216, June 2016.
-  K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand, S. Ghose, and O. Mutlu, “Accelerating pointer chasing in 3d-stacked memory: Challenges, mechanisms, evaluation,” in 2016 IEEE 34th International Conference on Computer Design (ICCD), pp. 25–32, Oct 2016.
-  G. H. Loh, N. Jayasena, M. H. Oskin, M. Nutter, D. Roberts, M. M. Dong, P. Zhang, and M. Ignatowski, “A processing-in-memory taxonomy and a case for studying fixed-function pim,” 2013.
-  C. Xie, S. L. Song, J. Wang, W. Zhang, and X. Fu, “Processing-in-memory enabled graphics processors for 3d rendering,” in 2017 IEEE 23rd International Symposium on High Performance Computer Architecture (HPCA), Feb 2017.
-  D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski, “Top-pim: Throughput-oriented programmable processing in memory,” in Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing , pp. 85–98.
-  D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, “Neurocube: A programmable digital neuromorphic architecture with high-density 3d memory,” in Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, (Piscataway, NJ, USA), pp. 380–392, IEEE Press, 2016.
-  M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “Tetris: Scalable and efficient neural network acceleratino with 3d memory,” in Proceedings of the Twenty Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), ASPLOS ’17, (New York, NY, USA), ACM, 2017.
-  D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski, “Top-pim: Throughput-oriented programmable processing in memory,” in Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC ’14, (New York, NY, USA), pp. 85–98, ACM, 2014.
-  M. Imani, Y. Kim, and T. Rosing, “Mpim: Multi-purpose in-memory processing using configurable resistive memory,” in 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 757–763, Jan 2017.