With the progress of information society today, images have become more and more important. Among them, skin detection plays an important role in a wide range of image processing applications from face tracking, gesture analysis, content-based image retrieval systems to various human-computer interaction domains[1, 2, 3, 4, 5, 6]. In these applications, the search space for objects of interests, such as hands, can be reduced through the detection of skin regions. One of the simplest and commonly used human skin detection methods is to define a fixed decision boundary for different colour space components [7, 8, 9]
. Single or multiple ranges of threshold values for each colour space components are defined and the image pixel values that fall within these pre-defined range(s) are selected as skin pixels. In this approach, for any given colour space, skin colour occupies a part of such a space, which might be a compact or large region in the space. Other approaches are multilayer perceptron[10, 11, 12]
, Bayesian classifiers[13, 14, 15]
and random forest.
These aforementioned solutions that use single features, although successfully applied to human skin detection; they still suffer from: Low Accuracy False skin detection is a common problem when there is a wide variety of skin colours across different ethnicity, complex backgrounds and high illumination in image(s). Luminance-invariant space Some robustness may be achieved via the use of luminance invariant colour space [1, 17], however, such an approach can withstand only changes that skin colour distribution undergo within a narrow set of conditions and also degrades the performance . Require large training sample In order to define threshold value(s) for detecting human skin, most of the state-of-the-art work requires a training stage. One must understand that there are tradeoffs between the size of the training set and classifier performance. For example,  required 2 billion pixels collected from 18,696 web images to achieve optimal performance.
In this paper, we propose a novel approach - fusion framework that uses product rules on two features; the smoothed 2D histogram and Gaussian model to perform automatic skin detection. First of all, we employ an online dynamic approach as in  to calculate the skin threshold value(s). Therefore, our proposed method does not require any training stage beforehand. Secondly, a 2D histogram with smoothed densities and a Gaussian model are used to model the skin and non-skin distributions, respectively. Finally, a fusion strategy framework using the product of two features is employed to perform automatic skin detection. To the best of our knowledge, this is the first attempt that employs a fusion strategy to detect skin in colour image(s).
The image pixels representation in a suitable colour space is the primary step in skin segmentation in colour images. A better survey of different colour spaces (e.g. RGB, YCbCr, HSV, CIE Lab, CIE Luv and normalised RGB) for skin colour representation and skin-pixel segmentation methods is given by Kakumanu et al. . In our approach, we do not employ the luminance-invariant space. Indeed, we choose the log opponent chromaticity (LO) space . The reasons are twofold: first, colour opponency is perceptually relevant as it has been proven that the human visual system uses an opponent colour encoding [22, 23]; and secondly, in this LO colour space, the use of logarithms renders illumination change to a simple translation of coordinates. Most of the aforementioned solutions claimed that illumination variation is one of the contributing factors that degrade the performance of skin detection systems. However, our empirical results and  showed that the absents of luminance component does not affect the system performance. The remainder of the paper is structured as follows: Section II gives a brief description of related work in human skin segmentation. Section III derives our proposed fusion strategy. Section IV presents the experimental results using three different datasets and Section V concludes the paper with discussions and future work.
Ii Related Work
Skin detection is the process of finding skin-colour pixels and regions in an image or video. In images and videos, skin colour is an indication of the existence of humans in media. In one of the early applications, detecting skin-colour regions was used to identify nude pictures on the Internet for content filtering. In another early application, skin detection was used to detect anchors in TV news videos for the sake of video automatic annotation, archival, and retrieval. Interested readers are encouraged to refer to [20, 24] for a detailed background review.
A skin detector typically transforms a given pixel into an appropriate colour space and then uses a skin classifier to label the pixel whether it is skin or non-skin. A skin classifier defines a decision boundary of the skin colour class in the colour space based on a training database of skin-colour pixels. For example, Sobottka and Pitas  used fixed range values on the colour space where the pixel values belong to skin pixels in the range of = [0, 50] and = [0.23, 0.68]. Wang and Yuan  used threshold values in space and space where threshold values are set to be within the range = [0.36, 0.465], = [0.28, 0.363], = [0, 50], = [0.20, 0.68] and = [0.35, 1.0] to differentiate skin and non-skin pixels. In these approaches, high false skin detection is a common problem when there are a wide variety of skin colours across different ethnicity, complex backgrounds and high illumination. Fig. 1 shows that the skin colour of people belongings to Asian, African, Caucasian groups is different from one another and ranges from white, yellow to dark. Some robustness may be achieved via the use of luminance invariant colour spaces [1, 17], however, such an approach can only cope if the change in skin colour distribution is within a narrow set of conditions .
. In multilayer perceptron based skin classification, a neural network is trained to learn the complex class conditional distributions of the skin and non-skin pixels. Brown et al. proposed a Kohonen network-based skin detector where two Kohonen networks; skin only and skin plus non-skin detectors were trained from a set of about 500 manually labelled images to obtain an optimal result. Sebe et al. 
used a Bayesian network with training data of 60,000 samples for skin modelling and classification. Friedman et al.
proposed the use of tree-augmented Naive Bayes classifiers for skin detection. The Bayesian decision rule to minimum cost is a well-established technique in statistical pattern classification. Jones and Rehg used the Bayesian decision rule with a 3D histogram model built from 2 billion pixels collected from 18,696 web images to perform skin detection. Readers are encourage to read [20, 24] for a detailed state of the art review. Although these solutions had been very successful, they suffer from a tradeoff between precision and computational complexity.
In summary, our proposed method has two advantages in comparison to the state-of-the-art solutions. First of all, our proposed skin detection method employs an online dynamic threshold approach. With this, a training stage can be eliminated. Secondly, we select a fusion strategy for our skin detector. To the best of our knowledge, this is the first attempt that employs a fusion strategy to detect skin in colour image(s).
Iii Our Method
Fig. 2 shows the proposed framework for automatic skin detection. First, an approach similar to Fusel et al.  is adopted to obtain the face(s) in a given image. Secondly, a dynamic method is employed to calculate the skin threshold value(s) on the detected face(s) region. Thirdly, two features - the 2D histogram with smoothed densities and Gaussian model are introduced to represent the skin and non-skin distributions, respectively. Finally, a fusion framework that uses the product rule on the two features is employed to obtain better skin detection results. In this paper, the RGB colour space is converted to the log opponent chromaticity space  to mimic visual human perception .
In the pre-processing steps, for any given image(s), where is the number of images, we first locate human eyes as Fusel et al. . Then, an elliptical mask model as illustrated in Fig. 3 is used to generate the elliptical face region in the image(s). Here, is the centre of the ellipse as well as the eyes symmetry point. Minor and major axes of the ellipse are represented by and respectively, where is the distance between two eyes. For a more detailed description, interested readers are encouraged to read .
The detected face regions include smooth (i.e. skin) and non-smooth (i.e. eyes, eye brown, mouth etc.) textures. As we are only interested in smooth regions, Sobel edge detection is employed to remove non-smooth regions. The choice of Sobel edge detection method is due to computational simplicity. Then, the detected edge pixels are further dilated using a dilation operation to get the optimal non-smooth regions. Finally, we obtain a new image(s), that only consist(s) of face regions.
Iii-B Colour Space
It is well established that the distribution of colours in an image is often a useful cue. An image can be represented in a number of different colour space models (i.e. , , ). These are some colour space models available in image processing. Therefore, it is important to choose the appropriate colour space for modelling human skin colour.
In this paper, we propose the use of the log opponent chromaticity colour space , the reason is twofold: first, colour opponency is perceptually relevant as it has been proved that the human visual system uses an opponent colour encoding [22, 23]; and secondly, in this colour space, the use of logarithms renders illumination change to a simple translation of coordinates.
Iii-B1 Log Opponent Chromaticity Space
The theory of opponent colours was first studied by Hering  in 1892. He observed that certain colours are never perceived together in the human visual system. For instance, we never see yellowish-blue or reddish-green. Based on this theory, the log opponent chromaticity (LO) is a representation of colour information by applying logarithms to the opponency model so that it is simple to model illumination changes. As illumination changes, log component chromaticity distributions undergo a simple translation. These distributions are coded by using means and first
-moments found using principle component analysis.
Iii-C Skin Detection
Iii-C1 Dynamic threshold with smoothed 2D histogram
Human skin colour varies greatly between different ethnicity . Nonetheless, skin appearance in colour image(s) can also be affected by illumination, background image, camera characteristic etc. Therefore, a fixed or pre-learned threshold for detecting skin boundaries is not a feasible solution.
In our approach, we employ an online dynamic approach as to  to calculate the skin threshold value(s) on the face images, . The assumption is that the face and body of a person always share the same colours. However, instead of using the 1D histogram as illustrated in Fig. 4(a); we introduce a 2D histogram (Fig. 4(b)) with smoothing densities .
In this paper, the feature vector for the smoothed 2D histogram,is represented by the combination of and . The smoothed 2D histogram-based skin segmentation, , at pixel is given as:
Iii-C2 Gaussian Model
The Gaussian model is a sophisticated model that is capable of describing complex-shaped distributions and is popular for modelling skin colour distributions. The threshold skin colour distribution in the 2D histogram is modelled through elliptical Gaussian joint probability distribution functions defined as:
where is the colour vector of , , is the mean vector and is the diagonal covariance matrix, respectively. refers to the mixing weights, which satisfy the constraint .
The result of Gaussian model-based skin detection, , can be obtained by using Fig. 5. is the center of the Gaussian model, while is the angle between x-axis and line . Let be the coordinate of pixel and is position on the red dot along line . Distance of , and angle, are calculated as Eq. 3 and 4.
and are the distances between and center, at x-axis and y-axis, respectively. and are the coordinate of at x-axis and y-axis, respectively. Distance between the boundary and center of the Gaussian model at x-axis and y-axis, and at given angle, are:
are the variance of x-axis and y-axis for Gaussian model. Distance,is represented as:
Therefore, is given:
Iii-D Fusion Strategy
In order to increase the effectiveness and robustness of the skin detection algorithm, a fusion strategy is proposed by integrating the two incoming single features into a combined single representation. Both modals will vote for classification of skin and non-skin pixels. This can be done by using product rule to both modals. Let , and denote the matching results produced by the smoothed 2D histogram, and Gaussian model, respectively. The combined matching results using the fusion rules can be obtained as the following:
where is the selected fusion rule, which represents the product . In order to make the fusion issue tractable, the individual features are assumed to be independent of each other.
In this section, the performance of the proposed approach under different conditions, such as fusion strategy, colour spaces and a comparison with the state-of-the art methods in terms of qualitative and quantitative performance. We only perform quantitative analysis on the Stottinger dataset  as ground truth videos are only available for this dataset.
Experiments are conducted using three public databases: Pratheepan’s dataset ; ETHZ PASCAL dataset  and the Stottinger dataset . The Pratheepan’s dataset  consists of a set of images downloaded randomly from Google. These random images are captured with a range of different cameras using different colour enhancements and under different illuminations. The ETHZ PASCAL dataset  consists of 549 images from PASCAL VOC 2009. The dataset consists mainly of amateur photographs with difficult illuminations and low image quality; and finally the Stottinger dataset  consists of popular public video clips from web platforms. These are chosen from the community (top-rated) and cover a large variety of different skin colours, illuminations, image quality and difficulty levels.
Iv-B Results and Analysis
The detection results for each dataset are illustrated in Fig. 6 - 9, respectively. When there is no face detected on the image, it will return a blank image (black). Therefore, for testing purposes, we assumed that true face(s) are detected in the image. Conclusions are drawn as follows; First, qualitatively, the proposed method has better detection accuracy in comparison to  and . For instance, in the image sample (row 4) of Fig. 6 and Fig. 7, respectively, both the  and  methods fail to detect the skin region accurately. One can notice from both images, that the particular image sample in Fig. 7 is a fairly complex environment with high illuminations and comprises of humans from different ethnicity. However, our proposed approach is still able to capture almost all skin regions and has the least noise. Secondly, the proposed approach has better robustness in terms of illumination, background image, camera characteristic and different ethnicity. The images in the Pratheepan dataset  are captured with a range of different cameras using different colour enhancements. However, it can be qualitatively noticed that our proposed approach has the least effects on these factors. Nonetheless, in other datasets that are highly complex and illumination, our approach also achieves a better discrimination power than  and . Finally, the proposed approach does not require any training stage, and hence is more effective in terms of computational cost as opposed to approaches listed in .
Iv-C Comparison between different colour spaces
In this section, we analyse 7 different combinations of feature vectors: , , , , , and . The results for each feature vector are presented in Fig. 8 using images from Pratheepan and ETHZ datasets. It can be noticed that shows better true positive rate than the rest. Further, a better quantitative analyse result is shown in Table I using Stottinger dataset as this dataset only contains ground-truth. It can be seen from Table I that the results from and HS are comparable. However we selected as our colour space as it shows higher true positive rate and lower false negative rate than HS. Also it has been proven that the human visual system uses an opponent colour coding.
Iv-D Fusion strategy results
In this section, we show the comparison results of using single feature - smoothed 2D histograms (s2D) or Gaussian mixture models (GMM) only, and multiple features (the fusion of s2D and GMM). The results are as illustrated in Fig.10 and Table II. Fusion Approach has the highest Accuracy and F-score. Moreover, it can also be visualised that the fusion strategy has lower false positive rate compare to the single feature approach. For instance, the smoothed 2D histogram is able to detect most of the skin regions; but it is highly occluded with noise. The Gaussian model in the meantime, has very high false positive rate (i.e. clothing which is classified as skin region).
Iv-E Quantitative Analysis
Table III shows comparisons between our method and other state-of-the-art methods on the Stottinger dataset . Apart from Random Forest , other methods do not include training. a total of 2985 frames were extracted from 7 videos for our experiment. For Random Forest , 1990 image frames are randomly chosen for training and remaining images are used for testing. From those 1990 images, around 3 million pixels were randomly chosen and 15 tress were trained. Each tree extracts 70% of the pixels respectively for training. Figure 11 shows the time in minutes required for 15 trees to be trained in random forest solutions. It took around 17 minutes for 15 trees to be trained.
Figure 12 shows the comparisons between random forest and our method using the Pratheepan dataset. Here Random Forest  is trained with Stottinger dataset and tested with Pratheepan dataset. It shows that random forest does not work well. In order to increase the accuracy of random forest on this dataset, huge training samples and/or more trees will be needed. This will cause higher computational power as number trees increases and time consuming during training. Our method is able to maintain quality on skin segmentation.
|Random Forest ||0.9305||0.7307||0.7661||0.6984|
|Our Proposed Method||0.9039||0.6490||0.6403||0.6580|
|Dynamic Threshold ||0.8935||0.5922||0.6133||0.5725|
|Static Threshold ||0.8334||0.4745||0.4133||0.5570|
A quantitative analyse is shown in Table III using Stottinger dataset. The accuracy is a score that uses the sum of true positive and true negative as measurement. While, is the product of and . Table III shows that our proposed method has an acceptable score as compared to Random Forest  even no training stage is required. Nonetheless, it is able to cope with large variable of illumination and complex backgrounds variations.
In our experiments, we noticed that the final result of our work depends greatly on the outcome of the pre-processing phase (the eye detector ). If the algorithm detects a false face region, poor result will be returned. Fig. 13 shows the result of skin segmentation for false face region detected. When a false face region is obtained, false dynamic thresholds will be generated. Therefore, false classifications will be processed, where non-skin regions are classified as skin regions. In our future work, we will investigate the face detector algorithm to overcome this problem.
V Concluding Remarks
In this paper, a fusion framework based on smoothed 2D histogram and Gaussian model has been proposed to automatic detect human skin in image(s). As exhibited in experiments, the proposed method outperforms state-of-the-art methods in terms of accuracy in different conditions: background model, illumination and ethnicity. With this, it shows the potential to be applied to a range of applications such as gesture analysis. One drawback of the proposed approach is that its success relies on eye detector algorithms. However, this is the general problem faced by all other researchers who work in this domain. Our future work is focused on building a better pre-processing method; to use field programmable gate arrays to implement a hardware scheme.
P. Vadakkepat, P. Lim, L. De Silva, L. Jing, and L. L. Ling, “Multimodal approach to human-face detection and tracking,”Industrial Electronics, IEEE Transactions on, vol. 55, pp. 1385–1393, March 2008.
-  C. S. Chan, H. Liu, and D. J. Brown, “Recognition of human motion from qualitative normalised templates,” Journal of Intelligent and Robotic Systems, vol. 48, no. 1, pp. 79–95, 2007.
-  N. Kubota and K. Nishida, “Perceptual control based on prediction for natural communication of a partner robot,” IEEE Transactions on Industrial Electronics, vol. 54, pp. 866–877, April 2007.
-  O. Linda and M. Manic, “Fuzzy force-feedback augmentation for manual control of multi-robot system,” IEEE Transactions on Industrial Electronics, pp. 1–1, August 2010.
C. S. Chan, H. Liu, and D. J. Brown, “Human arm-motion classification using
qualitative normalized templates,”
Lecture Notes in Artificial Intelligence (LNAI), vol. 4251, no. Part I, pp. 639–646, 2006.
-  G. Pratl, D. Dietrich, G. P. Hancke, and W. T. Penzhorn, “A new model for autonomous, networked control systems,” Industrial Informatics, IEEE Transactions on, vol. 3, no. 1, pp. 21–32, 2007.
K. Sobottka and I. Pitas, “A novel method for automatic face segmentation, facial feature extraction and tracking,”Signal Processing: Image Communication, vol. 12, no. 3, pp. 263 – 281, 1998.
-  H. Bae, S. Kim, B. Wang, M. H. Lee, and F. Harashima, “Flame detection for the steam boiler using neural networks and image information in the ulsan steam power generation plant,” Industrial Electronics, IEEE Transactions on, vol. 53, no. 1, pp. 338–348, 2005.
-  Y. Wang and B. Yuan, “A novel approach for human face detection from color images under complex background,” Pattern Recognition, vol. 34, no. 10, pp. 1983 – 1992, 2001.
-  D. Brown, I. Craw, and J. Lewthwaite, “A som based approach to skin detection with application in real time systems,” in BMVC’01, 2001.
-  M.-J. Seow, D. Valaparla, and V. K. Asari, “Neural network based skin color model for face detection,” Applied Image Pattern Recognition Workshop,, vol. 0, p. 141, 2003.
-  S. L. Phung, D. Chai, and A. Bouzerdoum, “A universal and robust human skin colour model using neural network,” in IJCNN ’01, vol. 4, 2001, pp. 2844–2849.
-  N. Sebe, I. Cohen, T. S. Huang, and T. Gevers, “Skin detection: A bayesian network approach,” in ICPR’04, 2004, pp. 903–906.
-  N. Friedman, D. Geiger, and M. Goldszmidt, “Bayesian network classifiers,” Mach. Learn., vol. 29, pp. 131–163, November 1997.
M. J. Jones and J. M. Rehg, “Statistical color models with application to skin
Int. J. Comput. Vision (IJCV), vol. 46, no. 1, pp. 81–96, 2002.
-  R. Khan, A. Hanbury, and J. Stoettinger, “Skin detection: A random forest approach,” in ICIP, Hong Kong, 2010, pp. 4613–4616.
-  U. Yang, B. Kim, and K. Sohn, “Illumination invariant skin color segmentation,” in Industrial Electronics and Applications, 2009. ICIEA 2009. 4th IEEE Conference on, May 2009, pp. 636–641.
-  S. Jayaram, S. Schmugge, M. C. Shin, and L. V. Tsap, “Effect of colorspace transformation, the illuminance component, and color modeling on skin detection,” in CVPR’04, vol. 2, 2004, pp. 813–818.
-  P. Yogarajah, A. Cheddad, J. Condell, K. Curran, and P. McKevitt, “A dynamic threshold approach for skin segmentation in color images,” in ICIP’10, 2010, pp. 2225–2228.
-  P. Kakumanua, S. Makrogiannisa, and N. Bourbakis, “A survey of skin-color modeling and detection methods,” Pattern Recognition, vol. 40, no. 3, pp. 1106–1122, 2007.
-  J. Berens and G. Finlayson, “Log-opponent chromaticity coding of colour space,” in ICPR’00, vol. 1, Barcelona, Spain, 2000, pp. 206–211.
-  E. Hering, Outlines of a Theory of the Light Sense. Havard University Press, 1964.
-  L. M. Hurvich and D. Jameson, “An opponent-process theory of color vision,” Psychological Review, vol. 64, pp. 384–404, Nov 1957.
-  S. Mitra and T. Acharya, “Gesture recognition: A survey,” IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews (TSMC-C), vol. 37, no. 3, pp. 311 –324, May 2007.
-  A. M. Elgammal, C. Muang, and D. Hu, “Skin detection,” in Encyclopedia of Biometrics, 2009, pp. 1218–1224.
-  I. Fasel, B. Fortenberry, and J. Movellan, “A generative framework for real time object detection and classification,” Comput. Vis. Image Underst., vol. 98, pp. 182–210, April 2005.
-  C. Kumar and A. Bindu, “An efficient skin illumination compensation model for efficient face detection,” in IEEE Industrial Electronics, IECON 2006 - 32nd Annual Conference on, 2006, pp. 3444–3449.
-  D. A. Forsyth and M. M. Fleck, “Automatic detection of human nudes,” Int. J. Comput. Vision (IJCV), vol. 32, pp. 63–77, August 1999.
-  P. H. Eilers and J. J. Goeman, “Enhancing scatterplots with smoothed densities,” Bioinformatics, vol. 20, no. 5, pp. 623–628, 2004.
-  J. Stottinger, A. Hanbury, C. Liensberger, and R. Khan, “Skin paths for contextual flagging adult video,” in ISVC’09, 2009, pp. 903–906.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge 2009 (voc2009),” 2009.
-  A. Cheddad, J. Condell, K. Curran, and P. McKevitt, “A skin tone detection algorithm for an adaptive approach to steganography,” Journal of Signal Processing, 2009.