Garment recognition is one necessary capability not only for the automation of tasks involving its manipulation, such as garment folding, within robotic systems  but many other applications as well: online e-commerce platforms  that make suggestions based on image information , intelligent surveillance systems that track people based on the clothing description, etc. However, recognition of clothes, or (highly) deformable objects in general, is a challenging task due to the many poses and deformations that a flexible object may exhibit.
We consider the scenario wherein an image containing a single piece of clothing, flat, wrinkled and semi-folded exists on a clean background and a robotic system wants to find the garment and good grasping points. Our goal is then to perceive, at the global level, a piece of garment existing in an image, by localizing and classifying it. And, at a local level, by identifying and localizing its landmarks, e.g., neckline-left, right-armpit, right-sleeve-inner, is an image point. Each garment class has a different amount and types of landmarks e.g., towels have four, whereas t-shirts and long t-shirts have both 12. Because of such variances, garment+landmark detection can be formulated using two different approaches:garment finding as object localization, followed by conditional, class specific, landmarks finding, also as object localization. Finding all landmarks existing in the image independently of the garment class as object detection, and the garment piece as object location. Although approach 1, is a simpler solution to build using off-the-shelf models, and is commonly seen in current literature  , it is more inefficient because: It requires one different sub-model for each garment category, being that many landmarks are shared between garment category e.g., a sleeve of a hoody is similar to a sleeve in a jacket.
On the other hand, when using Neural Networks, these work by building more complex representations as its depth increases, through the combinatorial effect of chaining multiple layers together. Therefore, a Network that recognizes a piece of clothing, should in principle, recognize somewhere in its hidden layers some of its landmarks. This means that in approach1 multiple redundant hidden features are learned, representing extra parameters to be stored in memory, and more operations to be computed during execution time. Adding to this, in the context of robotics e.g., a top view of a laundry bin, the garment global perception might not be possible with good accuracy, but recognizing some local landmarks might be just as valuable for the robot, as it could grasp the piece of clothing by one good landmark and then perform further recognition using the same model.
Therefore, we address the detection of landmarks and classification+localization of garment simultaneously, with a Convolutional Neural Network (CNN) composed of one common trunk and two separate branches. We then introduce a bridge connection that feeds the landmarks detection output into the garment localizer branch, resulting in a decrease from 56.7% to 32.0% in the error rate, that demonstrates the advantages in considering both tasks together. We balance and augment our dataset with Gaussian and hue noise, and perform one last training achieving: 0% and 17.8% error rate on classification and classification+localization respectively; and 36.2% mean Average Precision (mAP) on landmark detection.
2 Related Work
handling clothes happen in Robotics with the folding task, and the problem domain is constrained enough to avoid the necessity of classification i.e., only one type of clothing is considered. In  towels are considered, and depth discontinuities are explored to detect its borders and corners. With that information, a PR2 robot is able pick them from a random dropped position and, following a predefined sequence of steps, folds them.
Machine Learning methods
are latter used, not only in robotic tasks but other software applications as well e.g., in 
real-time classification and segmentation of people’s clothing appearing in a camera video feed are addressed. However, using the raw image i.e., each pixel is a feature, would result in a low performance results due to the curse of dimensionality, that many Machine Learning (ML) methods suffer from. To overcome this challenge, one common approach is to use the Bag of Visual Words (BoW), that extracts handcrafted features e.g., Scale-Invariant Feature Transform (SIFT) or Histogram of Oriented Gradients (HOG) and feeds these into a classifier e.g., Support Vector Machine (SVM) or k-Nearest Neighbours (k-NN). In the authors use this approach to design a two layer hierarchical classifier to determine the category and grasping point (equivalent to pose, in this work) of a piece of clothing. Other works address the extraction of domain specific features from images. In , a set of Gabor filters are used to filter edges magnitude and orientation that are representative of wrinkling and overlaps and with that information, the authors propose three types of features: Position and orientation distribution density, cloth fabric and wrinkle density; and existence of cloth-overlaps.
Deep Learning (DL) methods.
In 2012, AlexNet  achieves a notorious improvement of 11% in the ILSVRC2012 image classification competition, when comparing against the next best solution. In , to address object detection, region proposal techniques are combined with CNNs, resulting in R-CNN, an hybrid method that combines a CNN with SVMs. Then, in , two main improvements are made: the RoI pooling layer is introduced, and the SVM is replaced by a softmax classifier. The improved model Fast R-CNN model is then a two headed CNN, optimized using a multi-task loss, quicker to train and faster test-time. In , the authors further introduce Region Proposal Networks (RPNs), making this architecture finally an end-to-end trainable model. Another simpler architecture, YOLO, is introduced in  and improved in , consisting of only two direct branches on the top of a regular fully CNN: one for positions coordinates regression and another for classes attribution. The system is capable of real-time object detection.
DL in garment perception.
After the successes using DL methods, some works that address garment perception also explore their potential, mainly considering classification problems. One example is , that addresses the same problem as : pose and category recognition. The solution is also similar, a two layer hierarchical classifier. But here, instead of BoW and SVMs, one CNN is used to determine the garment class, followed by a class specific CNN that determines the garment pose index. The authors compare the model against others using hand engineered features and report gains of 3% in accuracy. Similarly, in  a robotic system composed of two arms and a RGB-D camera, uses an hierarchy of CNNs with three specialization levels. At a first step one CNN classifies the garment piece, and then two others are used to find the first and second grasping points.
In contrast to these approaches, ours leverages the intrinsic characteristics of CNNs and the architecture patterns from both classification/localization and detection models to perform the global and local perception in one step with a single CNN. Our model can also exploit prior knowledge gained to detect landmarks, to enhance the global garment perception. Our approach is therefore more memory and processing efficient than hierarchical solutions presented above. This happens because, in our approach, lower level layers are shared between the Global and Local perception components, as discussed in section 1. Exploring these intrinsic characteristics of CNNs, in , the authors propose Hierarchical Cnvolutional Neural Networks (H-CNN), to address hierarchical classification. With coarse/top-level classifications being extracted from hidden layers while finer/more-specific classes being predicted by the latest layers of the network.
3 Network architecture
Our network, GarmNet, at a macro level can be summarized into three blocks: Feature extractor, Landmark detector and Garment localizer.
We implement the feature extraction module with a Fully Convolutional Neural Network (FCNN), a 50-layer ResNet
. The model is pre trained on ImageNet, to which we remove the last Fully Connected (FC) layers, resulting in a
output tensor. Yet, because in some cases we have multiple landmarks close to each other, we preferred a larger output size, that would result in a higher number of anchors in the landmark detector. We achieve this by probing the ResNet at the end of theconv4_x block, which has an output size of .
Responsible for classifying and localizing all the landmarks present in the image, this module is a small, , sliding FCNN, similar to the Region Proposal Network (RPN) introduced in 8] activation, followed by two heads: one for localization and other for classification, see 2.
The localizer head is a Multi-Layer Perceptron (MLP), that outputs the predicted landmark relative coordinates such that
where is a predicted landmark location, stands for the sliding network position the localization head output, and
a stride value that we define to spread the base referential (or anchor points, as introduced in) evenly across the input image (we set it as 18, that together with a 26 landmark area matches the 224 image input size).
The classifier head, a convolutional layer with filters, where is the number of landmark classes and the plus one answers for background, non positive, landmarks. We apply a softmax
activation along the depth dimension and, therefore the output of this layer can be interpreted, at each position, as the probability of the associated landmarkbeing of a certain class, or background.
To perform the localization of the piece of clothing present in the image, we use a two head three-layer fully connected network, similar to the sliding window used for landmark detection. The intermediate layer is a 512-d FC with ReLU activation, followed by the regression and classification heads. The regression head outputs four values: x, y, with and height; and the classification head outputs values, where is the number of garment classes, that we remap to probabilities with a softmax activation. We retrieve the predicted landmarks by computing the argmax over the classifier output tensor depth dimension, and associating it with the point predicted in same spacial position on the regression head. We further discard all the landmarks that have a confidence value i.e., the value that motivated the argument to be the maximum, lower than 0.5.
Our implementation was performed using Keras111https://keras.io/
framework with the TensorFlow222https://www.tensorflow.org/
back-end. All experiments were carried out on a laptop, with GPU support enabled, equiped with an Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz, 16GB DDR4 RAM, and a Nvidia Geforce GTX 1060 6GB GDDR5 GPU. We initialize kernels with random values drawn from a normal distribution, and bias with ones. Optimization is performed using Adadelta with 1.0 learning, a batch size of 30, and 40 epochs per experiment. At test time our model runs at roughly 30 FPS. For classification and localization evaluation we use error rate, while for detection we use the mean Average Precision metric (mAP) as proposed in. The source code has been made publicly available333 https://github.com/danfergo/garment.
We adapt the CTU Color and Depth Image Dataset of Spread Garments . This dataset is divided into two groups: “Flat and wrinkled”, with 2050, and “Folded”, with 1280 examples. Each example contains one image of a piece of clothing placed on a wooden floor and is annotated with the stereo disparity; garment category; interest points coordinates and category; and other meta-information. We merge both groups, and because it only contains information regarding each landmark position, we extend its annotation with the garment bounding box as follows:
where is the set of Landmarks, and are the top-left and bottom-right corners of the bounding box. There is a total of 27 landmark categories, distributed among 9 types of garment. Some landmark categories are shared among classes. We then create two splits, with 300 randomly chosen images for validation and the remaining the ones used for training. The remaining 2318, make the training split. Results are reported over the validation split.
4.2 Landmark detection anchors
For training the landmark detector heads we transform the landmark locations into small squared areas and follow a strategy similar to the anchor boxes described in . To all the anchor boxes that intercept a landmark box with , we consider it a positive for the respective class. If it does not intercept any with
we set it to background. Because the ratio between positive and negative anchor boxes is high, we create a binary mask that is used to filter the anchors that effectively are considered in the loss function. This mask selects all the class positive and 10 randomly chosen background anchors.
4.3 Loss functions
For the classification head, we apply the cross-entropy loss function to all active anchors, being that at each anchor the landmark class is one-hot encoded.
With the garment classes represented in a one hot encoded vector, we use the cross-entropy loss on the classification head and, for the regression head, we use the mean squared error.
4.4 One landmark class per sample, constraint
One important peculiarity of landmark detection to consider is the fact that, per image, only one class of landmarks exist. Therefore, we can introduce this constraint into the loss function and promote parameter combinations that tend to predict only one landmark per class. We implement this constraint by also applying cross-entropy over the spacial dimension, resulting in the loss function 8
. However, because cross-entropy expects its input to be a probability distribution, we must firstly applysoftmax accordingly. We therefore, place two softmax after the last convolutional layer activation: the first, (regular) depth wise, the second, spacial wise; and pass each output to the correspondent cross-entropy. At test time, we average the two softmax, similarly to 8. Because for each garment category, only a few landmarks are active, we further create a second mask, that is the ground-truth max spacial value, and we use it to ignore the loss spacial component for the landmark classes that are not applicable.
Although with the spacial constraint loss addition, we achieve a 2% lower detection mAP score, dropping from 37.8% to 35.7%. Yet, we are able to achieve lower duplicated predictions, as illustrated in 4.
4.5 Using landmarks within garment localization
We investigate the gains in feeding the landmark detector output features into the garment localize intermediate layer, expecting that these would help better frame the garment bounding box and. We flatten the tensor outputted by the classifier block from the landmark detector branch, and concatenate it with the flatten Feature Extractor output, before feeding it to the 512-d intermediate layer, resulting in GarmNet-B, represented in the figure 5. With this bridge connection, the network achieves 32.0% classification+localization error rate, a 24.7% improvement when comparing with the individual garment detector training.
4.6 Final optimization with Augmented Data
We perform one last training, without loading any previous learned parameters (with the exception of the feature extractor ImageNet parameters) and using augmented data. The data augmentation is achieved by repeating examples of less numerous classes and adding Gaussian and hue noise. The obtained results are: 0% classification and 17.8% classification+localization error rate, and 36.2% landmark detection mAP. The complete classification accuracy can be justified by the almost constant background and the few, often differently colored, garments per class.
In this work, we proposed a novel deep neural network model named GarmNet that can be optimized in an end-to-end manner, for performing simultaneous garment local and global perception. Approaches as ours are important for robotics applications, as these offer scalable to many classes, memory and processing efficient solutions, enabling real-time perception capabilities. We evaluate our solution using an augmented dataset assembled using the two collected by CTU during the CloPeMa project . The experiments showed the effectiveness and side effects of introducing domain specific knowledge into the loss function being optimized, at both quantitative and qualitative levels. We finally demonstrate the improvements on garment localization by considering the landmark detection as an intermediate step.
In the future work, more experiments will be done to further optimise the network architecture and its hyper-parameters configuration. A more challenging dataset, with higher number of images and variability e.g., , will also be used. Within the context of garment perception for robotic laundry folding, the work will be extended to garment folding with a robot arm-hand setup, supported by the garment perception done in this paper and also possibly assisted by tactile sensors [13, 14, 10].
This work was supported by the EPSRC project “Robotics and Artificial Intelligence for Nuclear (RAIN)” (EP/R026084/1).
Corona, E., Alenyà, G., Gabas, A., Torras, C.: Active garment recognition and target grasping point detection using deep learning. Pattern Recognition74, 629 – 641 (2018). https://doi.org/https://doi.org/10.1016/j.patcog.2017.09.042, http://www.sciencedirect.com/science/article/pii/S0031320317303941
-  Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR09 (2009)
-  Engels, G., Heckel, R., Sauer, S.: Uml - a universal modeling language? LNCS (10 2000). https://doi.org/10.1007/3-540-44988-4_3
Everingham, M., Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vision88(2), 303–338 (Jun 2010). https://doi.org/10.1007/s11263-009-0275-4, http://dx.doi.org/10.1007/s11263-009-0275-4
-  Girshick, R.B.: Fast R-CNN. CoRR abs/1504.08083 (2015), http://arxiv.org/abs/1504.08083
-  Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR abs/1311.2524 (2013), http://arxiv.org/abs/1311.2524
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015), http://arxiv.org/abs/1512.03385
-  Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2012)
-  Lecun, Y., Bengio, Y.: Convolutional networks for images, speech, and time-series. The Handbook of Brain Theory and Neural Networks (01 1995)
-  Lee, J.T., Bollegala, D., Luo, S.: ”Touching to See” and” Seeing to Feel”: Robotic Cross-modal Sensory Data Generation for Visual-Tactile Perception. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2019)
-  Li, Y., Chen, C.F., Allen, P.K.: Recognition of deformable object category and pose. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2014)
-  Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)
-  Luo, S., Bimbo, J., Dahiya, R., Liu, H.: Robotic tactile perception of object properties: A review. Mechatronics 48, 54–67 (2017)
-  Luo, S., Mou, W., Althoefer, K., Liu, H.: iCLAP: Shape recognition by combining proprioception and touch sensing. Autonomous Robots pp. 1–12 (2018)
-  Maitin-Shepard, J., Cusumano-Towner, M., Lei, J., Abbeel, P.: Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding. In: 2010 IEEE International Conference on Robotics and Automation. pp. 2308–2315 (May 2010). https://doi.org/10.1109/ROBOT.2010.5509439
-  Mariolis, I., Peleka, G., Kargakos, A., Malassiotis, S.: Pose and category recognition of highly deformable objects using deep learning. In: 2015 International Conference on Advanced Robotics (ICAR). pp. 655–662. IEEE (jul 2015). https://doi.org/10.1109/ICAR.2015.7251526, http://ieeexplore.ieee.org/document/7251526/
-  Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: Unified, real-time object detection. CoRR abs/1506.02640 (2015), http://arxiv.org/abs/1506.02640
-  Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. CoRR abs/1612.08242 (2016), http://arxiv.org/abs/1612.08242
-  Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497 (2015), http://arxiv.org/abs/1506.01497
-  Seo, Y., shik Shin, K.: Hierarchical convolutional neural networks for fashion image classification. Expert Systems with Applications 116, 328 – 339 (2019). https://doi.org/https://doi.org/10.1016/j.eswa.2018.09.022, http://www.sciencedirect.com/science/article/pii/S0957417418305992
-  Wagner, L., K.D., Smutný, V.: Ctu color and depth image dataset of spread garments. Tech. Rep. CTU–CMP–2013–25, Center for Machine Perception, K13133 FEE Czech Technical University, Prague, Czech Republic (September 2013)
-  Yamazaki, K.: Instance recognition of clumped clothing using image features focusing on clothing fabrics and wrinkles. 2015 IEEE International Conference on Robotics and Biomimetics, IEEE-ROBIO 2015 pp. 1102–1108 (2016). https://doi.org/10.1109/ROBIO.2015.7418919, http://dx.doi.org/10.1007/s10514-016-9559-z
-  Yang, M., Yu, K.: Real-time clothing recognition in surveillance videos. In: Macq, B., Schelkens, P. (eds.) ICIP. pp. 2937–2940. IEEE (2011), http://dblp.uni-trier.de/db/conf/icip/icip2011.html#YangY11
|Left leg outer||38.5||33.7||33.0||38.2|
|Left leg inner||47.6||39.5||0||44.1|
|Right leg inner||46.0||42.6||50.5||39.9|
|Left leg inner||45.3||41.4||43.7||40.9|
|Right sleave inner||44.3||38.7||60.2||37.3|
|Right sleave outer||38.2||30.4||53.6||42.2|
|Left sleave inner||43.9||41.5||46.5||48.3|
|Left selave outer||34.5||30.7||46.0||35.7|