Semantic segmentation is a fundamental but significant task for scene understanding. It aims to assign semantic labels for each pixel in the image. As semantic segmentation provides diverse information including categories, locations, and shapes of objects, it is critical to a variety of real-world applications, such as robot vision, autonomous driving , medical diagnosis , etc. However, this task is challenging to achieve high accuracy due to label and shape variety.
With the developments of deep neural networks, Shelhameret al.  proposed the Fully Convolutional Network (FCN), which attained an impressive improvement on semantic segmentation accuracy. Due to the effectiveness of the FCN, it is employed as a core framework in the state-of-art semantic segmentation methods. Some efforts [5, 6, 7] aggregate multi-scale contextual information to capture multi-scale objects. Other efforts [8, 9, 10] use an attention mechanism [11, 12] to capture richer global contextual information.
Another factor making progress in the development of semantic segmentation is the availability of semantic segmentation datasets [13, 14]. These datasets were collected from various real-world environments and have been provided for advancing the technique in real-world applications. In particular, several datasets tailored for autonomous driving in real-world urban scenes [15, 16, 17, 18] have been provided and exploited as a primary source of data for navigating autonomous vehicles. As the urban scenes are structured environments with low variations of scenes and illumination, they are relatively easier to be segmented precisely. Thus, a significant development for autonomous driving in urban environments has been achieved.
However, when navigating in off-road, unstructured natural environments, an autonomous platform encounters formidable challenges to recognize its surroundings and objects therein. Such scenarios face not only a wide range of scenes but also significant illumination changes as some examples shown in Fig. 1. Unfortunately, as there are many factors, such as camera sensitivity or lighting conditions, causing the illumination changes, these changes are unavoidable during navigating. At worst, an object becomes as brighter (or darker) as surrounding regions and it is hard to distinguish from them as the building in the ‘Test sample’ of Fig. 2. Therefore, as illumination plays a critical role in capturing the appearance, inconsistent illumination results in performance degradation.
Motivated by the above issues, in this paper, we propose a built-in memory module for semantic segmentation to improve semantic segmentation accuracy for off-road, unstructured natural environments. The memory module stores the significant representations of training images as memory items. Then, the memory items are recalled to cluster the instances of the same class closer within the embedding space learned by training images. Therefore, the memory module mitigates significant variances in embedding features and segmentation networks with the memory module better deal with the unexpected illumination changes as illustrated in Fig. 2. In our experimental configuration, the proposed memory module follows after an encoder to refine the global contextual features (encoder output feature maps) using memory items. Then, the decoder takes the refined features and produces the segmentation masks. In order not to affect the computational cost, the memory module contains a few items. In addition, the triplet loss  is used to make the items far apart and minimize the redundancy of the items. Our proposed memory module is very general thus can be adopted on a wide range of networks.
Experiments are conducted on the Robot Unstructured Ground Driving (RUGD) dataset  and RELLIS  dataset, which are collected by an unmanned ground robot from off-road, unstructured natural environments. The quantitative results show that the proposed memory module improves the performance of existing segmentation networks with equivalent computational cost and network parameters regardless of compact or complex networks. The qualitative results demonstrate the effectiveness in capturing unclear objects over a variety of off-road, unstructured scenes. Our memory module applied to compact segmentation networks delivers improved performance on outdoor scene segmentation in real-time operation, thus it allows better autonomous navigation for resource-limited small autonomous platforms.
Ii Related Work
Ii-a Semantic Segmentation
Driven by the development of Convolutional Neural Networks (CNNs), current semantic segmentation methods always employ deep CNNs (e.g., ResNet, ResNext , etc.) as an encoder to extract feature representations. To improve the performance, various decoder modules are proposed to produce precise segmentation masks.
A variety of early methods [4, 24, 6, 25, 26] have been proposed, but these approaches often resulted in poor performance due to their model simplicity. To boost performance, more advanced methods have been developed. PSPNet  and Deeplab [7, 27] frameworks incorporated multi-scale contextual information using spatial pyramid module while some methods [28, 29, 30] used dense layers in the decoder end to infuse multi-level features for advantage of feature reuse. Some efforts [9, 31, 32, 8, 10, 33, 34] exploited attention mechanisms to capture long-range dependency for richer global contextual information. In addition, a gate mechanism  is employed to selectively fuse multi-level features or multi-modal features to further improve the performance [36, 37, 38, 30, 39]
. Effectiveness of these methods has been verified on datasets collected from structured environments (e.g., Cityscapes
, ADE20K, etc.), but their effectiveness on an off-road, unstructured environment has not yet been verified.
Ii-B Memory Networks
Graves et al. 
introduced a concept called ”Neural Turing Machine”, which combines neural networks with an external memory bank to extend neural networks capability. The external memory is jointly trained with the main branch. The combined architecture uses an attention process to selectively read from and write to the memory. Due to its flexibility, it is adapted to a variety of tasks, such as few-shot learning[41, 42], video summarization 44]45], etc. In our method, a memory module is exploited for semantic segmentation to deal with unexpected illumination changes in an off-road, unstructured environment by storing the significant representations of training images and recalling the representations to correct significant variances in embedded features.
In this section, we first provide an overview of the semantic segmentation framework. Then we present the proposed memory module and the loss function.
Iii-a Overview of Semantic Segmentation Framework
As deep convolutional neural networks are capable of extracting salient features from input images, they are employed as an encoder. To alleviate the loss of the details, they produce high-resolution feature maps to preserve information. It is implemented by replacing some of the last stride convolutions[46, 47] with dilated convolutions  correspondingly. The ratio of input image spatial resolution to the encoder output resolution, denoting as output stride (OS), is often 8 or 16111Lower OS improves the segmentation accuracy but requires more computational cost.. Starting with the input image whose height and width are and , an encoder produces the global contextual feature maps , where are the number of channels, height and width respectively, at the final layer. Then a decoder takes the feature maps as input to produce a segmentation mask. It is often a sub-network, such as Atrous Spatial Pyramid Pooling (ASPP) module in Deeplabv3 , to refine the feature maps for the performance improvement and produce the prediction map. At last, the prediction map is bilinearly upsampled to the resolution of the input image for the final segmentation result , where is the number of the categories. Although the decoder often improves the performance, they are ineffective if the encoder output feature maps do not properly represent the input image. To mitigate this issue, we propose a memory module to refine the feature maps using the memory items before feeding them into the decoder as depicted in Fig. 3. The details of the memory module are presented in the following subsection.
Iii-B Memory Module
The memory module performs read and write operations. The read operation refines encoder output feature maps using stored memory items while the write operation updates the memory items according to the encoder output feature maps. The write operation is only conducted during the training phase. The read operation (Fig. 4) is presented first and then the write operation is followed (Fig. 5).
Given encoder output feature maps of an image, where and each is an individual feature at a spatial position, they are refined by the memory items , where each
is a memory item. The read operation is based on addressing weights, which are obtained by the cosine similarities222The cosine similarity delivers the best performance compared to other similarity functions, such as Manhattan distance (roughly -0.8% in terms of mIoU under the same experimental settings). between each individual feature (for all ) and all memory items and a softmax function. Thus, an addressing weight of an individual feature to the memory item is as follows:
where the cosine similarity is computed as:
As the proposed memory module contains a small number of memory items for compactness, each individual feature addresses all items for the diverse representations instead of the one most similar item. For the feature , the memory module refines it through a weighted sum of all memory items with the corresponding addressing weight as:
Instead of only feeding the refined feature maps into the decoder, they are multiplied by a scale parameter and added to the original feature maps as also contains the significant information of the input image. Thus, the feature maps fed into the decoder is given by:
The parameter is a trainable scalar and initialized as 0.1.
The write operation is the process of updating memory items using the individual features . Different from the read operation that all memory items are involved to refine each individual feature, the write operation updates each memory item using a part of individual features, each of which is most similar to the target memory item. Thus, given a target memory item , we first look for the features which have the highest addressing weight on according to the similarities computed in (2) as:
where contains the indexes of the features which have the highest addressing weight on . Similar to (1), an update weight of a memory item to an individual feature is computed as:
where the cosine similarity is computed as:
Moreover, the update weight is re-normalized by the maximum weight of the features in the set as:
At last, the memory item is updated by the features in the set using a weighted sum with the corresponding update weight as follows:
where is L2 normalization function. If all individual features are involved in updating the memory items, update weights on similar features get diminished as small weights are assigned to the uncorrelated features. As a result, the memory items could be improperly updated. Thus, we select above updating strategy to write memory items.
To train our proposed model, we exploit 2D multi-class cross entropy loss for semantic segmentation as:
are the true category label and the segmentation predicted probability for pixel. In addition, in order to reduce the redundancy of memory items, the triplet loss is used to make the items far apart as:
where and are the first and second most similar memory items to the feature according to the addressing weights computed in (1), and , set as 1.0 in our experiments, is the margin between the two items. To minimize the triplet loss, the feature should be close to the while far away from . Thus the overall loss function consists of the two loss functions as:
is set as 0.05 and the model is trained end-to-end to minimize the overall loss function.
To evaluate the proposed method, a series of experiments are conducted. The dataset and implementation details are introduced first. Then, to look for the optimal experimental settings for achieving the best performance, some studies are delivered. At last, quantitative and qualitative results are presented.
RUGD dataset  is a dataset tailored for semantic segmentation in unstructured environments. It focuses on off-road autonomous navigation scenario. It was collected from a Clearpath Husky ground robot traversing in a variety of natural, unstructured semi-urban areas. It contains no discernible geometric edges or vanishing points, and semantic boundaries are highly irregular and convoluted. As such, off-road driving scenarios present a variety of challenges. It contains 4,759 and 1,964 images for training and testing sets, respectively. It has 24 categories including vehicle, building, sky, grass and etc. The resolution of the images is 688500. The ablation studies are conducted on the training and testing sets.
RELLIS dataset  is another dataset tailored for semantic segmentation in off-road environments. The dataset was collected on the Rellis Campus of Texas A&M University and presents challenges to existing algorithms related to class imbalance and environmental topography. It contains 3,302 and 1,672 images for training and testing sets, respectively. It has 19 categories including fence, vehicle, rubble and etc. The resolution of the images is 19201200. Due to the limitation of computing resources, the images were randomly cropped to 640640 during training.
Iv-B Implementation Details
Iv-B1 Training settings
Following , a poly learning rate policy is adopted. The initial learning rate is set as 0.01 and the learning rate at each iteration is the initial learning rate multiplied by
. The momentum and weight decay rates are set to 0.9 and 0.0001, respectively. The networks are trained with 8 mini-batch sizes per GPU using stochastic gradient descent (SGD). We set 150 epochs for training. As in existing methods, parameters in the encoder are initialized from the weights pretrained from the ImageNet while those in the decoder and the memory module are randomly initialized. To avoid overfitting, data augmentation is exploited during training including horizontal flipping, scaling (from 0.5 to 2.0), and rotation (from -10 to 10).
To verify the effectiveness of our proposed memory module, it is adopted on a variety of networks, such as PSPNet , Deeplabv3  and DANet , with various depths of encoders, such as MobileNetv2 , ResNet18  and HRNet . The decoder denoted as ‘Upsampling’ consists of a single convolutional layer producing a prediction map and a bilinear upsampling operation to resize the prediction map to the resolution of the input image for the final segmentation result. We conduct the ablation studies on Deeplabv3 with a lightweight encoder ‘ResNet18’ with OS16.
Iv-C Ablation Study
Iv-C1 Effect of the Number of Memory Items
To find the optimal number of memory items, we conduct experiments by setting the different number of memory items and the results are shown in Fig. 6. We observe that the memory with 24 items, which is identical to the number of the categories, yields the best performance. It outperforms the baseline with an improvement of 1.59% in terms of mIoU. We observe that although the memory module with less than 24 items outperforms the baseline, it cannot deliver diverse representations to cover a wide range of scenes and all categories. In the case of too many items, it is vulnerable to focus on relevant items and leads to performance degradation.
Iv-C2 Effect of triplet loss
As the triplet loss contributes to the separateness of features, we control the separateness of the memory items to store discriminative representations by weighting the triplet loss using a constant scalar . As shown in Table I, we vary from 0 to 0.2 and achieves the best performance.
To analyze the effectiveness of the triple loss more clearly, we visualize the cosine similarities of all pairs of the items without/with the triple loss in Fig. 7. We observe that the triplet loss makes the items less similar to others. The average cosine similarities of all pairs of the items without and with the triple loss are 0.50 and 0.19, respectively. These results indicate that the triplet loss allows the memory module to reduce the redundancy and store discriminative representations, which improves the segmentation accuracy.
To verify the effectiveness of our memory module, it is applied to diverse decoders with either compact (e.g., ResNet18 and MobileNetv2) or complex (e.g., HRNet and ResNet50) encoder. Table II presents the segmentation performance (mIoU), the number of network parameters (#Param) and the computational cost (GFLOPs333 The GFLOPs is computed with the Pytorch code on
The GFLOPs is computed with the Pytorch code onhttps://github.com/sovrasov/flops-counter.pytorch.). We can observe that our memory module can improve the performance over different networks regardless of compact or complex networks. As we propose a compact and non-parametric memory module, the baselines with our memory module keep the same number of network parameters and equivalent GFLOPs as the baseline. More importantly, “ResNet18 + Deeplabv3” with our memory module outperforms the heavier networks “HRNet + Upsampling” and “ResNet50 + Deeplabv3” without the memory module. It demonstrates that our proposed memory module contributes to significant performance improvement and makes lighter networks perform as well as more complex networks.
Fig. 8 and Fig. 9 give some visualization results from a compact network (MobileNetv2 + Upsampling) and a complex network (ResNet50 + Deeplabv3), respectively. As shown, the results demonstrate the effectiveness of our method for capturing unclear regions and objects as highlighted regions in the images over various off-road, unstructured natural environments.
The effectiveness of our method is further validated on the RELLIS dataset. Compared to the RUGD dataset, the RELLIS dataset does not contain frame sequences with significant illumination changes. Thus the overall quality of images captured is better than that of the RUGD dataset. However, the RELLIS images contain scenes of wide unobstructed views, resulting in distant objects captured by a small number of pixels. As such, accurate semantic segmentation of such objects is difficult. Table III summarizes the test results on a variety of networks. It clearly demonstrates that our method resulted in improvement on each of the network tests. Visualization results from the network “ResNet18 + Deeplabv3” on RELLIS testing images are shown in Fig. 10. While the network without the memory module has difficulties accurately segmenting the fence-post and the distant vehicles (especially the one on the left), the network with our proposed memory module accurately segmented those distant objects.
In this paper, a built-in memory module was proposed to improve the semantic segmentation performance on off-road unstructured natural environments by refining global contextual feature maps. The memory module stored the significant representations of the training images as memory items. Then, the memory items were recalled to cluster instances of the same class together within the learned embedding space even when there were significant variances in embedded features from the encoder. Thus, the memory module contributed to handling the unexpected illumination changes which made objects unclear. Considering real-time navigation of an autonomous platform, the memory module contains a small number of memory items in order not to affect the computational cost (GFLOPs). To make the best use of the memory module, the triplet loss was employed to minimize redundancy, and the memory module stored discriminative representations. We demonstrated the effectiveness of the proposed memory module by applying it to several existing networks. It improved performance while rarely affecting efficiency, and the qualitative results showed that our memory module contributed to capturing unclear objects over various off-road, unstructured natural environments. As the proposed method can be integrated into compact networks, it presents a viable approach for resource-limited small autonomous platforms.
-  K. Asadi, P. Chen, K. Han, T. Wu, and E. Lobaton, “Real-time scene segmentation using a light deep neural network architecture for autonomous robot navigation on construction sites,” in Computing in Civil Engineering 2019: Data, Sensing, and Analytics. American Society of Civil Engineers Reston, VA, 2019, pp. 320–327.
-  M. Siam, M. Gamal, M. Abdel-Razek, S. Yogamani, M. Jagersand, and H. Zhang, “A comparative study of real-time semantic segmentation for autonomous driving,” in
-  F. Zhao and X. Xie, “An overview of interactive medical image segmentation,” Annals of the BMVA, vol. 2013, no. 7, pp. 1–22, 2013.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
-  H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
-  L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
-  H. Zhao, Y. Zhang, S. Liu, J. Shi, C. Change Loy, D. Lin, and J. Jia, “Psanet: Point-wise spatial attention network for scene parsing,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 267–283.
-  J. Fu, J. Liu, H. Tian, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” CoRR, vol. abs/1809.02983, 2018. [Online]. Available: http://arxiv.org/abs/1809.02983
-  Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “Ccnet: Criss-cross attention for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 603–612.
-  D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and
Y. Bengio, “Show, attend and tell: Neural image caption generation with
visual attention,” in
International conference on machine learning, 2015, pp. 2048–2057.
-  R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille, “The role of context for object detection and semantic segmentation in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 891–898.
-  B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ADE20K dataset,” CoRR, vol. abs/1608.05442, 2016.
-  A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
-  G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmentation and recognition using structure from motion point clouds,” in European conference on computer vision. Springer, 2008, pp. 44–57.
-  M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” CoRR, vol. abs/1604.01685, 2016.
-  X. Huang, P. Wang, X. Cheng, D. Zhou, Q. Geng, and R. Yang, “The apolloscape open dataset for autonomous driving and its application,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2702–2719, 2019.
-  M. Wigness, S. Eum, J. G. Rogers, D. Han, and H. Kwon, “A rugd dataset for autonomous navigation and visual perception in unstructured outdoor environments,” in International Conference on Intelligent Robots and Systems (IROS), 2019.
-  K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification.” Journal of Machine Learning Research, vol. 10, no. 2, 2009.
-  P. Jiang, P. Osteen, M. Wigness, and S. Saripalli, “Rellis-3d dataset: Data, benchmarks and analysis,” 2020.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” CoRR, vol. abs/1611.05431, 2016. [Online]. Available: http://arxiv.org/abs/1611.05431
-  R. Vemulapalli, O. Tuzel, M.-Y. Liu, and R. Chellapa, “Gaussian conditional random field network for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3224–3233.
-  H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1520–1528.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” CoRR, vol. abs/1505.04597, 2015. [Online]. Available: http://arxiv.org/abs/1505.04597
-  L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818.
-  P. Bilinski and V. Prisacariu, “Dense decoder shortcut connections for single-pass semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6596–6605.
-  M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “Denseaspp for semantic segmentation in street scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3684–3692.
X. Li, H. Zhao, L. Han, Y. Tong, S. Tan, and K. Yang, “Gated fully fusion for
semantic segmentation,” in
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 418–11 425.
-  Y. Yuhui and W. Jingdong, “Ocnet: Object context network for scene parsing,” arXiv preprint arXiv:1809.00916, 2018.
-  H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal, “Context encoding for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7151–7160.
-  Z. Zhu, M. Xu, S. Bai, T. Huang, and X. Bai, “Asymmetric non-local neural networks for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 593–602.
-  Y. Jin, D. Han, and H. Ko, “Trseg: Transformer for semantic segmentation,” Pattern Recognition Letters, vol. 148, pp. 29–35, 2021.
-  K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” CoRR, vol. abs/1406.1078, 2014. [Online]. Available: http://arxiv.org/abs/1406.1078
-  S. Kong and C. C. Fowlkes, “Recurrent scene parsing with perspective understanding in the loop,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 956–965.
-  H. Ding, X. Jiang, B. Shuai, A. Qun Liu, and G. Wang, “Context contrasted feature and gated multi-scale aggregation for scene segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2393–2402.
-  T. Takikawa, D. Acuna, V. Jampani, and S. Fidler, “Gated-scnn: Gated shape cnns for semantic segmentation,” arXiv preprint arXiv:1907.05740, 2019.
-  Y. Jin, S. Eum, D. Han, and H. Ko, “Sketch-and-fill network for semantic segmentation,” IEEE Access, 2021.
-  A. Graves, G. Wayne, and I. Danihelka, “Neural turing machines,” arXiv preprint arXiv:1410.5401, 2014.
-  A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap, “Meta-learning with memory-augmented neural networks,” in International conference on machine learning, 2016, pp. 1842–1850.
-  Ł. Kaiser, O. Nachum, A. Roy, and S. Bengio, “Learning to remember rare events,” arXiv preprint arXiv:1703.03129, 2017.
-  S. Lee, J. Sung, Y. Yu, and G. Kim, “A memory network approach for story-based temporal summarization of 360 videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1410–1419.
-  C. Chunseong Park, B. Kim, and G. Kim, “Attend to you: Personalized image captioning with context sequence memory networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 895–903.
-  H. Park, J. Noh, and B. Ham, “Learning memory-guided normality for anomaly detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14 372–14 381.
-  C. Kong and S. Lucey, “Take it in your stride: Do we need striding in cnns?” arXiv preprint arXiv:1712.02502, 2017.
-  F. Yu, V. Koltun, and T. Funkhouser, “Dilated residual networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 472–480.
-  F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” inCVPR09, 2009.
-  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520.
-  J. Wang, S. ke, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, W. Liu, and B. Xiao, “Deep high-resolution representation learning for visual recognition,” 08 2019.