Large-scale image retrieval has been a mainstay for both academic research and commercial products for several years[2, 17, 18]. A particular emphasis has been put for architectures that are scalable, redundant and most importantly quicker than its predecessors [29, 9, 6]. This has resulted not only in efficient methods for indexing or hashing large databases [12, 27]5]. Tasks that relied on Viola-Jones 
or SVM based classifiers to produce hand-crafted descriptors[13, 28]20]
. The emphasis has lately shifted from “hand-crafted” feature descriptors to “hand-crafted” network architectures. Unlike statistical learning theory only a handful of studies have attempted to move on from this burgeoning resurgence of “deep-learning” architectures to understanding the basic principles on which such networks are based [15, 4, 16]. Nevertheless, deep-neural networks have had immense commercial success.
Whereas efficient algorithms are unequivocally essential to achieve efficient classification, localisation and retrieval, commercial applications introduce new challenges. For commercial applications, a combination of efficient algorithms with a flexible and scaleable architecture is necessary. This paper describes one such architecture (Cortexica) for large-scale image localisation and retrieval in a cost-effective and time-efficient manner. Our aim is to present the architecture and tools needed to develop a fast, scalable and flexible system by focusing on a particular commercial application – a multi-product search algorithm. This application is explained in Section 2. The architecture described has proven to be adequate on a commercial environment (https://www.cortexica.com/). An important property of our system is that is not limited to particular algorithms detailed in this paper. Thus, the same architecture can be used with different algorithms for classification, localisation or retrieval.
2 Multi-product search
It is typical in computer vision that a single image can have more than one item that needs to be identified; an additional aim can be to recommend a similar item from a vendor’s database. Our approach is tailored for fashion items (clothes, accessories, etc.) yet the same approach can be followed for any type of objects, be it in medical imaging, transportation analytics, amongst many others. For example, we have deployed our framework to analyze streaming video – from walk-through at a fashion show. Our algorithm is able to identify all of the clothing items that a ‘fashion model’ wears; additionally, returning similar items from a vendor’s database for each identified item.
One methodology to achieve such an aim is to first localise the items of interest, i.e. detect the items and find the associated bounding boxes, and then perform retrieval for each detected item. In the following section, we provide an overview of the different methods to perform object localisation as well as details on the architecture and the implementation for achieving good classification accuracy.
The first step of the multi-product search is object localisation. Deep learning has proven to yield high accuracy at relatively low computational cost when using GPU-based computing . In this section, three networks for object localisation are introduced, and compared. The most relevant property of our deployment is that it allows us to hot-swap the neural network used (in a production environment), making it straight forward to introduce neural network modifications or for that matter other non-neural network algorithms.
2.1 Image localisation approach
The state-of-the-art neural networks for object detection that we describe below include: Faster-R-CNN (Faster Region-based Convolutional Neural Network), SSD (Single Shot Multi-Box Detector) , and R-FCN (Region-based Fully Convolutional Networks) :
illustrated that a convolution neural network (CNN) framework can be successfully applied for object detection, significantly increasing the probability of detecting an object. As a first step, the Visual Geometry Group (VGG) network is trained for image classification, this is then further fine-tuned for object detection. Faster R-CNN includes a region proposal network (RPN), that builds upon Fast R-CNN  enabling generation of fast proposals with little overhead. The features of the convolutional part of VGG are shared between two tasks – region proposal and classification. Just like the Spatial Pyramid Pooling (SPP) network , the convolutional part of the network works just with images that have fixed aspect ratios. In our experience, based on prior suggestions , training the classifier on a large number of ‘hard’ negatives strengthens the classification performance and leads to better precision.
Single Shot Multi-Box Detector (SSD) , uses an architecture that is equivalent to many class specific RPNs, each working on different feature maps. This to done to improve the detection of objects at different scales, thereby illustrating the benefits of a multi-scale architecture.
SSD uses a fully convolutional layer and an optimized VGG architecture wherein the input aspect ratios are fixed at 500x500 (similar to the classical VGG network). This allows for a competitive execution time, fast enough to be real-time, without compromising on the accuracy of Faster R-CNN. The primary disadvantage of this framework lies in the detection of small objects.
The Region-based Fully Convolutional Network 
has an architecture similar to Faster-RCNN, albeit without the fully connected layer of the network. Not having the fully connected layer allows us to calculate in a single forward pass the loss function for considerably large number of region proposals. These regions are then quickly sorted according to their loss function values, retaining only those regions that supersede a fixed threshold. These are then utilized for gradient computation. This approach is calledonline hard example mining (OHEM) ; such a scheme is generally computationally infeasible to apply on a Faster-RCNN based network. In Faster-RCNN, for each region of interest (ROI), the network calculates a forward pass only when the images propagate to the fully connected layer. Therefore, if there are numerous ROIs, say more than a few hundred, it becomes increasingly time-consuming to train the network.
|Faster R-CNN end2end||141ms||83%||73%||75%||69%||84%||63%||85%||76.3%|
|R-FCN ohem ResNet-50||96ms||88%||78%||80%||77%||90%||68%||90%||81.6%|
|R-FCN ohem ResNet-101||137ms||89%||78%||81%||78%||91%||69%||90%||82.3%|
Removing the fully connected section from Faster RCNN makes the training unstable. Hence, to guaranteee stability the ROIs are subdivided in a 7x7 grid and 49 losses are calculated for each category. Such a scheme allows each part of the grid to recognize features that are typical of that part of the object. Empirically, whilst one does not gain much on classification accuracy by training the network without the fully connected layer, this allows us to use the OHEM method.
To evaluate the performance of the three networks we use a dataset containing 45,000 Street Style images (Figure 1; Table 1) where seven fashion categories have been manually annotated with a bounding box (bb) around the object. The categories used are: jackets (18k bb), dress (9k bb), skirt (15k bb), tops (30k bb), trousers (13k bb), handbags (23k bb) and shoes (50k bb). The 45k images are split into two sets where 40k images are used for training and 5k images are used as a test set.
To train the networks we used the default parameters chosen in the original papers.
Table 1 shows the average precisions (APs) of the results of the models.
Figure 1 shows the object detection of the R-FCN ResNet101 model on one image of the validation set. The two newer methods (SSD and R-FCN) both improve over Faster R-CNN. SSD is especially suited when speed is the main concern. The smaller (300x300) achieves realtime performance. The bigger R-FCN is slower, but achieves a higher precision (Table 1).
The localisation service was designed to be horizontally scalable in order to handle a high number of requests per second. Such scalability was made possible by employing a load-balanced micro-service architecture. The micro-service architecture provides a method of developing software applications as a suite of small, modular and independently deployable services in which each service runs a unique process and communicates through a lightweight and well-defined mechanism.
A load-balancer is used to distribute the requests among any number of micro-service instances distributed across any number of GPU servers. Currently, we use servers with 4x NVIDIA GeForce GTX 1080, each featuring 2560 cores and 8GB of memory. Each server can host up to 16 instances (4 per GPU) while each instance can process approximately 5 requests/second resulting in a total of 80 requests/second per server. Additional instances can be dynamically deployed on more servers to cope with increasing throughput requirements. The deployment of new instances is straight forward since each micro-service runs from within a Docker container that guarantees that the software will always run the same, regardless of its environment. This is because each Docker container comes with the source-code, runtime executable, system tools and libraries that are needed to run the localiser micro-service.
At a very high-level, all queries are sent to a single load-balanced endpoint. The load-balancer then distributes these requests between a number of docker containers that is further distributed amongst a number of GPU-enabled servers. Each docker container runs a software load-balancing (haproxy), which further distributes the requests amongst the number of localiser instances running within a single docker. The maximum number of these instances depend on various constraints, such as GPU compute and memory capacity as well as the size of the model used for localisation. Different docker containers can run on the same or different GPUs as well as run different models for localisation. Various localisation services are made accessible via dedicated ports.
3 Image retrieval approach
After having localised a specific object (e.g., coat, jacket, etc.) corresponding to one of the seven categories used for multi-product search, image retrieval is performed against a particular database of inventory items. This retrieval is performed in terms of visual similarity. The main factors that make clothes perceptually similar are their colour and texture properties. Such properties need to be extracted in a way that is relevant to human perception. In this context, simple approaches as colour histograms are not adequate. Furthermore, relevant features must be extracted in a compute and memory efficient manner.
Building upon prior work on psycho-physics and human neuroscience, our algorithm encodes the texture and the colour of the image in a similar way as the human brain does. This information is encoded and the matching against a large-scale dataset is performed in an efficient way. Two steps are required to perform retrieval. First of all we extract a signature for every image. Second, we match the signature of a query image to a large dataset of images.
3.1 Architecture for signature extraction
In order to build a signature of an image, the following steps are required: detection of points of interest (key-points) in the image, feature extraction and encoding. The key-point detection aims at identifying the most important locations on the image. Features are extracted from the patches around the key-points, which capture the texture and colour characteristics of the patch. In order to have an efficient protocol to match the descriptors of an image against a large database of images, feature encoding is required. A bag of words model 
is used to convert the descriptors of all patches to codeword vectors. This gives a unique signature for each image, which is further bit-encoded to save memory space.
A point of interest in an image is a distinctive location in an image that can be robustly localised from a range of viewpoints, rotations, scales, and illuminations. In order to find these key-points in every image we use biologically-inspired non-linear orientation channels, as described in . The key-points are extracted using a pyramid structure, i.e. four scales are used for each image. These scales are used in order to capture information on different level of abstraction – a harmonic (multi-scale) representation for each image is constructed. For each key-point we also construct an associated saliency; this indicates how distinct that point is. A threshold is defined to select the most salient key-points, and a maximum of 512 keypoints is used.
For every key-point, a patch of a fixed size of 32x32 on each scale is selected; features are then calculated for each patch. The algorithm captures texture and colour information for every patch by applying specific filters to the image data. This enables it to identify texture characteristics and colour information. The processing after the extraction of the texture information includes weighting the texture data of each one of the colour channels with intensity data of the image patch. Texture and colour information is combined to obtain a descriptor which allows similar image patches to be identified.
Based on human visual perception, a set of steerable complex wavelet filters  are used to extract texture features such as edges and lines/bars in different scales and orientations. If no features are identified, for example, if the image is monochromatic with each pixel intensity being the same, the image will still have a texture, albeit a smooth and uniform one. To capture the colour information, the CIE Lab  colour space is used; this mimics the human brain’s perception of colour. Particularly, each colour patch is first converted into CIE Lab colour space and normalized to a range of . 32 band pass filters are applied for every colour channels generating a total of 96 filter responses. Texture weighted colour histograms are subsequently generated. In the end, we obtain a 576 dimensional colour-texture descriptor (3 colour channels x 6 bins x 4 scales x 4 directions x 2 values).
In particular, the texture intensity data describes not only the relative difference between different points in an image patch, but also the absolute intensity of the texture. This enables a descriptor for finding similar images, for example as that perceived by a human viewer, in terms of both texture, intensity and colour. For illustration we present an example scenario in Figure 2. The key-point and feature extraction is implemented using CUDA for compute time efficiency – the entire calculation takes on an average 220 ms.
In order to encode the feature vectors from the different patches in one signature, a codebook needs to be defined. This is formed from a large number of descriptors of various type of images. The centres of the clusters, which emerge from a -means algorithm, are used to map the vector values. The feature vectors of each image are mapped to a signature using the codebook, which consists of 5000 centres. This results in a signature that is approximately 3.5kB after being bit-encoded. In order to compare the similarity between two images the distance between the signatures is calculated. The next section discusses how the matching is operationalized in an efficient manner.
3.2 Architecture for matching
After the bag of words (BOWs) have been produced as described in Section 3.1, they are stored in a specific data structure that maximises search and retrieval times. In particular, all of the BOW files created with the customers’ images are processed and an “Inverted Index” is produced. Here, we explain how this index is created and utilized at query time by a distributed in-memory data-grid.
To understand the structure of the “Inverted Index” we need to first understand the data that needs to be stored. In a single bag of words, for instance for the Colour-Texture descriptor, we usually have 2.5k words in a dictionary of around 5k. This means that if we were to store the BOWs in a big matrix, where on one axis we had the IDs of the images and on the other we had all the words, the matrix would have at least half of its entries being zeroes – a sparse matrix. The logical solution is to store the data not including the zeroes and use an extra array to “index” it. We choose to use an inverted indexing strategy, as is common practice in storing sparse matrices.
More precisely this means that the “Index” array will point to the beginning of each “word” in the data array. The data array is, in the most simple case, just a list of image-IDs, stored continuously for each word. By pointing at the inverted index, we can quickly find images that contain a specific word. Once we have a data array that is indexed by word, it is easy to distribute it across many machines, each of which carry a range of words. In this way, if we have enough instances, using an inverted index gives us the added benefit of making our system robust: in case one of the machines is faulty, the search would still scan through all the customers’ images, and it would ignore just a small range of words present on the faulty server, thereby minimally affecting the query results. The size and distribution of the grid is fully customisable depending on the configuration and number of images one needs to store. Each query runs in parallel in all the grid instances, each of which uses multiple threads to access the index.
4 Combining localisation and retrieval for multi-product search
In order to achieve a multi-product search based on one image or a video stream, the two methods described above are combined. The multi-product search runs sequentially – localisation runs first based on a deep learning architecture and subsequently retrieval based on colour-texture descriptors is initiated.
The input image gets processed by the micro-service for localisation. The output consists of the identified categories and the associated bounding boxes. There is some extra logic in place after the localisation takes place, which ensures that mutually exclusive items do not appear together. For example, if a dress, a top and a skirt are detected in the same area then the one(s) with the higher confidence are kept. This ensures that there are less false positives, and the results are more visually agreeable to the user.
Each cropped image around every detected bounding box is used as a query against the identified category. The final retrieval results for every category are returned. Using the same snapshot of the video in Figure 1, the category of each item localized (query object) is returned followed by items that are similar to this query object. Similar items obtained from two queries are shown in Figure 3.
For each potential category that can be detected by using the trained deep learning model, there is an associated database. Splitting the database according to the object type results in more accurate inference. On the other hand, performing retrieval in a very large database that includes all object types items deteriorates the performance. We use in this case 7 separate databases, one for each product, against which similar items can be found. This means that there are 7 different database IDs and associated inverted files. In our experiment each database includes 100 thousand to 2 million images. Certain categories, like tops, have a vast amount of data available, while for others, like purses, data is limited; this results in databases with varying sizes. The queries against the different databases are performed in parallel. On one hand this keeps the overall timing below 1 on the other hand this timing is not affected by the number of items detected. In addition, the timing of each query against each database is also approximately similar regardless of its size, since the retrieval is performed under a distributed architecture. Overall the process achieves high accuracy and compute time efficiency without requiring any user intervention.
In this paper we have described ‘Cortexica’s multi-product search’
, a framework for large-scale localisation, classification and retrieval. This is being used for a wide variety of commercial applications – from fashion industry, health and safety critical applications to medical imaging. It harnesses software scalability using Docker containers whilst hardware scalability is achieved using GPUs that are deployed on a wide variety of cloud computing providers. Due to the hot-swappable nature of our implementation, one can not only use any number of compute efficient deep-learning framework but also non-parametric Bayesian classification and regression algorithms in near future. This gives us immense flexibility to choose the most efficient statistical and compute efficient algorithm for the application of interest. Our current research aims to have a similar architecture not just for image or video data-streams but also add an added layer of security for sensitive data by employing privacy-preserving machine learning techniques.
-  Anil Bharath and Jeffrey Kwong. Patent on method of image processing, 2014.
-  La Cascia and Sethi M. Combining textual and visual cues for content-based image retrieval on the world wide web. IEEE Workshop on Content-Based Access of Image and Video Libraries Proceedings, pages 24–28, 1998.
-  J. Dai, Y. Li, K. He, and J. Sun. R-FCN: object detection via region-based fully convolutional networks. CoRR, 1605, 2016.
-  Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems 27, pages 2933–2941. 2014.
J. Deng, W. Dong, R. Socher, L-Jia. Li, K. Li, and L. Fei-Fei.
Imagenet: A large-scale hierarchical image database.
IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
-  Steven K. Esser, Paul A. Merolla, and Dharmendra S. Modha. Convolutional networks for fast, energy-efficient neuromorphic computing. Proceedings of the National Academy of Sciences, 113(41):11441–11446, 2016.
-  Li Fei-Fei and Pietro Perona. A Bayesian hierarchical model for learning natural scene categories. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 2, pages 524–531. IEEE, 2005.
-  R. B. Girshick. Fast R-CNN. CoRR, 2015.
-  JE Gonzales, RS Xin, D Crankshaw, A Dave, MJ Franklin, and I Stoica. GraphX: Unifying data-parallel and graph-parallel analytics. OSDI, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision, pages 346–361, 2014.
-  Richard Sewall Hunter. Accuracy, precision, and stability of new photoelectric color-difference meter. In Journal of the Optical Society of America, volume 38, pages 1094–1094, 1948.
-  B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable image search. In IEEE 12th International Conference on Computer Vision, pages 2130–2137, 2009.
-  Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, and T. Huang. Large-scale image classification: fast feature extraction and SVM training. IEEE Conference on Computer Vision and Pattern Recognition, pages 1689–1696, 2011.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. E. Reed. SSD: single shot multibox detector. CoRR, 2015.
-  Stéphane Mallat. Understanding deep convolutional networks. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 374(2065), 2016.
-  Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems 27, pages 2924–2932. 2014.
-  M. Perd’och, O. Chum, and J. Matas. Efficient representation of local geometry for large scale object retrieval. IEEE Conference on Computer Vision and Pattern Recognition, pages 9–16, 2009.
-  F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier. Large-scale image retrieval with compressed Fisher vectors. International Journal of Computer Vision, 115(3):3384–3391, 2010.
-  S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, 2015.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015.
-  A. Shrivastava, A. Gupta, and R. B. Girshick. Training region-based object detectors with online hard example mining. CoRR, 2016.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, 2014.
-  J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International Journal of Computer Vision, 104(2):154–171, 2013.
-  Vladimir Naoumovitch Vapnik. Statistical learning theory. Adaptive and learning systems for signal processing, communications, and control. Wiley, New York, 1998.
-  P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001.
-  J. Wang, S. Kumar, and S.-f. Chang. Semi-supervised hashing for large-scale search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(12):2393–2406, 2012.
-  J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. Computer Vision and Pattern Recognition, pages 1794–1801, 2009.
-  Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: cluster computing with working sets. HotCloud, 10:10–10, 2010.