There has been an increased interest in health status monitoring of computer users during the last decade, especially when laptops with embedded web cameras are becoming more and more popular. Everyday we can see employees, students, or iTers spend most of their time sitting in front of the computers stiffly and staring at the screen continuously. Under such circumstances, only very few of them have a planned periodical mini-breaks. Consequently, a large number of computer users are suffering from musculoskeletal Repetitive Stress Injuries (RSIs) . In order to solve this problem, an effective mechanism should be built to remind computer users in advance. As a warning, however, people will generally feel tired after long time competitive working. Therefore, fatigue can serve as a good mark for the reminding.
This work focuses on the fatigue detection and recommendation making for computer users by means of a single web camera. To achieve the aim, non-rigid face tracking method  was firstly applied to the real-time camera video. As a result, the position data relating to the eyes, mouth and head areas were obtained. Secondly, blink detection, yawn detection and 3D head pose analysis were performed respectively on each frame to get the three fatigue features. Finally, the fuzzy logic system was used to fuse the three features and give health recommendations to computer users hereby.
There are several ongoing research work on assistive robots  to help elders, and people with special needs. Instead of building a robot, the computer can serve as the best robot for posture correction and improvements for people who suffer from repetitive stress injuries, since the injuries are rooted from using the computers. The goal of our system is to reduce the health risks by providing suggestions to improve user posture and their productivity for the short and long term. An additional advantage is the strong computation power of the computers comparing to embedded system in robots. Our work has the potential to turn the computer into a personalized care taker of human beings.
Our contributions are: 1) We propose a three layer system to prevent the user from potential health issues. 2) In the tracking layer, we improve the non-rigid face tracking algorithm by removing jitters on landmarks and reinitializing the tracking automatically. 3) In the feature layer, we propose an self-adaptive blink detection method. 4) In the recommendation layer, our inference framework using fuzzy logic to combine the expert rules and user’s feedbacks give users dynamic and personalized recommendations.
The paper is organized as follows: Section 2 outlines the recent work in the vision based face tracking and assistive systems. Our system and algorithm are presented in Section 3 and Section 4 and Section 5. Experimental evaluations are shown in Section 6. Finally, we conclude the paper and discuss the future work in Section 7.
2 Related Work
The proposed system may serve as an intelligent assistant for computer users. This is realized by detecting the users’ fatigue status using a web camera and making health recommendations for them. The related work involves the estimations of 3D head pose, blink rate and yawn frequency respectively.
3D head pose estimation. Gaze direction is deeply related with a user’s attention . Therefore, accurate estimation of 3D head pose helps the system to judge if a user is in correct position when using the computer and then give recommendations thereby. However, there is few generic solutions for identity-invariant head pose estimation. Detailed discussion of the inherent difficulties and evolution of this field were surveyed by Murphy  in 2009.
Blink rate estimation. In order to estimate blink rate, blink should be identified firstly. Global template matching  is a common method for that. First of all, eye regions are detected. After the open status of eyes is trained, a global template is formed and can be applied to compare with real eye appearance. And the blink can be finally identified. In 2002, Morris et al. 
proposed a real-time blink detection system, in which variance maps were used to find eye-feature points. In 2005, Chau et al. brought forward a blink detection system using the correlation with an online template. Yang et al.  modeled the shape of eyes using a pair of parameterized parabolic curves, and then fit the model globally to find potential eye regions. In addition to template matching, there are still other methods regarding blink detection, which include statistical algorithm by Pan et al. , optical flow based algorithm by Divjak et al. , and facial feature based algorithm by Moriyama et al. .
Yawn frequency estimation. Yawn serves as an important factor for the judgement of fatigue when combined with the 3D head pose and blink rate. Mohanty et al.  proposed a non-rigid estimation algorithm for yawn detection, in which the degree of lip shape deformation was quantified. Du et al.  proposed a kerneled fuzzy rough sets based yawn detection algorithm. In Omidyeganeh’s work, yawn was detected based on the aspect ratio of the extracted mouth area as compared with an experimentally tuned threshold . Abtahi’s algorithm  focused on the calculation of mouth geometric feature changes.
3 Tracking Layer
3.1 Hardware and Software Fundamentals
Web camera is widely available on personal computers. It is used in our system to track facial expressional features. And a non-rigid tracking algorithm is proposed based on Active Appearance Model (AAM)  . Overview of our tracking system is shown in Fig. 1
. Seventy six landmarks are annotated manually on each image, from which linear shape model, correlation patch model and face detection model are trained. By combining these three model files, the tracking model is obtained and applied to track faces.
In order to improve the generalization performance and tracking accuracy, 385 face images with different age, expression and ethnicity groups are used for the training. Fifty eight of them are captured in our lab, and the remaining 327 images are selected out of the Biwi 3D Audiovisual Corpus of Affective Communication database  from the Swiss Federal Institute of Technology Zurich (ETHZ). As is shown in Fig. 2, 76 facial landmarks are located for tracking, of which 15 lie on the chin, 6 on each eyebrow, 9 on each eye, 12 on the nose and 19 on the mouth.
3.2 Shape and Patch Models
Figure 1 shows that shape models use a linear representation of the facial geometry to illustrate how landmarks vary across different people and expressions. The goal of linear modeling is to find a low-dimensional subspace within -dimensions (
represents the total number of landmarks). Principal Component Analysis (PCA) is applied to find the best subspace. As for the correlation-based patch models, we generate a group of image patches that will produce strong responses at the exact location of landmarks based on the annotated dataset. Cross-correlation is calculated on patch models to estimate the feature locations and correct facial shape models.
The tracking procedure always suffers landmark jittering, which makes it difficult to detect facial expressional movements like blink and yawn accurately. Occlusion is another problem in the tracking. Once part of the face is occluded, reinitiating of the tracking cannot be performed automatically.
3.3 Jittering Removal
Jittering of landmarks is very common during the tracking. Therefore its pattern can be learned by means of Maximum Likelihood Estimation (MLE) and Bayesian Minimum Error Estimation (BMEE). In order to eliminate the ambiguity between facial movement and jittering, human face should keep still during the learning period for about 0.5 second in our system.
Let be the event that the th landmark jitters, and represent the offset of the th landmark in the consecutive frames (), where
is the total number of landmarks. Supposing that landmark jittering agrees with the Gaussian distribution, that is. We have
where stands for the total number of the training frames.
According to Bayesian Minimum Error Estimation, instead of comparing the values of the posterior probability densityand , we can compare the values of and , which is equivalent to comparing the values of with . Given that and are relatively fixed, and its difficult to get , we can replace with a fixed threshold
. Therefore, the classifier can be defined as the following rules:
R1: IF , THEN the offset is caused by facial movement.
R2: IF , THEN the th tracking point jitters.
3.4 Automatic Reinitiating of Face Tracking
shows that once the tracker fails in detection, tracking cannot be reinitiated automatically. This causes the landmark drift phenomenon. To solve this problem, Support Vector Machine (SVM) is used to discover the gold point for reinitiating. There are two factors that will affect tracking accuracy. One is the position of the 76 landmarks. And the other is the response value obtained from template matching. Consequently, the response value and the relative (, ) coordinates of each tracking point compose a 228 dimensional feature vector together, in which the relative coordinates are calculated by subtracting the spatial coordinates of the tracking point from its mass center coordinates. As is shown in Fig. 3(b), when the tracker loses the face, a cascade face detection  is restarted until it tracks the objects successfully. What is more, the model training and auto-reinitiating can be performed simultaneously in real-time. Its performance can be referred to the evaluation section.
4 Feature Layer
4.1 3D Head Pose Classification
The pin-hole camera model is applied in our system. And 10 facial landmarks (No. 38-44, 46, 47, 67) are selected to estimate 3D head pose, because they mostly express rigid movement. Assuming the 67th landmark is the origin in 3D space, the relative coordinates of other landmarks are obtained thereby. Then the algorithm in  is used to estimate the camera rotation and translation parameters. The corresponding relationship between the 3D space and the image plane is defined by
where is a fixed coefficient. is the camera intrinsic parameter matrix, which can be computed through calibration. is the camera extrinsic parameter matrix. is the rotation matrix and is the translation vector. represents the coordinates in the image plane, and represents the coordinates in 3D space. 3D head pose is indicated by the extrinsic parameter matrix.
It has been found in our experiment that the intrinsic parameters of most web cameras are approximately the same. Therefore, the camera intrinsic parameters are set to fixed values in our system instead of repeated calibration.
An SVM classifier is applied to determine whether the user is in a correct pose while using computer. Several classes are defined as:
Pose1: The user is not looking at or not in front of the computer.
Pose2: The user is in a correct pose.
Pose3: The user is too close to the screen.
Pose4: The user is with his/her head askew to the left.
Pose5: The user is with his/her head askew to the right.
The 6 dimensional feature vector contains the rotation and translation parameters. The classification data obtained in this layer will be used in the following section to make recommendations for the users.
4.2 Self-Adaptive Blink Detection
A self-adaptive algorithm is designed to better detect the blink under different conditions, as is shown in Algorithm 1. The idea comes from the intuition that eye closure state only occupies a small proportion of the working time. And the eyeball patch color is completely different between closed and open eye state. Therefore, the average color of the eyeball patch can be obtained and its changes can be monitored in real-time. In addition, is normalized as to minimize the environmental disturbances during the tracking, where and represent expectation and variance of . In this way, a fixed threshold can be applied to predict whether the eyes are open or closed. If the system find that the user’s eye state has switched from open to closed, then blink is detected.
4.3 Yawn Detection
A SVM classifier is designed to determine whether the mouth is open or closed. The SVM feature vector includes the coordinates of mouth landmarks (No. 48-66). If the mouth keeps open for a preset time threshold , then a yawn is detected.
5 Recommendation Layer
5.1 Recommendation Framework
After the features are gathered from the above layer, health recommendations can be made for the users in front of web cameras. Firstly, several rules are defined, which may be obtained from ergonomics experts or doctors. A few examples are given as follows:
: IF the user works more than 30 minutes, THEN take a break.
: IF the user keeps in a bad pose for more than 10 minutes, THEN raise the alarm.
: IF the user yawns more than 5 times in a 10 minutes period, THEN take a break.
Each rule consists of a set of premises and a consequence, which is a recommendation generated by the system. The fuzzy logic system is used to formulate the recommendation logic in the following way:
where is the number of rules and is the number of inputs in each rule. We assume that each premise has a score, which is denoted by the confidence from the feature layer. is the predicted output given an input data-tuple . is the confidence score associated with the th premise in the rule . is the weight associated with the rule . All the weights need to be learned before the recommendation system works. And represents the recommendation. For instance, means taking a break and means keeping working. The weights can be obtained by means of batch least square estimation.
For a rule with premises, we define:
Therefore, , where . The unknown can be easily resolved using the least squre estimation. In order to train the recommendation system, rules and at least confidence scores and corresponding actions are needed.
5.2 Dynamic Adaptation
Users may provide explicit or implicit feedbacks to the system. Explicit feedbacks are gathered from the users’ actions as clicking a dislike button. Implicit feedbacks are learned from visual clues such as the users continue working after the system suggests to take a break. In our system, only explicit feedbacks are considered. Vector is updated by means of the Equation 6.
where is calculated by solving the normal equation using the users’ feedback data. is the adaptation rate, which means the system will gradually learn to adapt to the users’ preferences.
Experiments are performed on 5 randomly picked volunteers, which includes 4 males and 1 female. They were asked to do regular computer work in front of their PCs for at least 10 minutes under different conditions. camera resolution is tested, and the volunteers’ faces may be occluded by glasses or hands. Since many people work in dim environments with computers, three of the tests are performed under the poor illuminated condition, which may lead to blur in facial features like eyes. The accuracy of blink detection, yawn detection are evaluated quantitatively and all the data are integrated to provide suggestions in our system.
As for the blink and yawn detection, we manually calculate the hitting rate and false detection rate of the system on all volunteers with the blink threshold and mouth open time threshold seconds for best performance. Ninety five blinks and twenty five yawns are taken into consideration, among which 20 blinks and 5 yawns are from each volunteer. The hitting rate are and for the blink and yawn respectively, and the false detection rate are and respectively.
To provide suggestions, the working time is divided into 10 minutes periods separately and the users’ status in each period are determined based on the blink count, yawn count and 3D head pose. Figure 6 shows the outcome of our system when tracking a volunteer working for more than 6 hours in front of the computer. When comparing the outcome with the real condition, it can be found that the user’s absence in front of the computer in 10:18-10:28, 11:31-11:42 and 13:06-13:27 periods is due to mini-breaks. The user’s absence in 12:13-12:55, 15:02-15:12 and 15:23-15:33 periods is due to lunch, paper work and group discussion respectively. Moreover, the right subfigure shows that the user keeps in a bad pose most of the time and works continuously for more than 30 minutes, which will definitely trigger the system’s alarm. It also shows that the user is in a potential fatigue condition during 14:30-15:00 due to the increase of yawn rate. It is because the user used to have a nap at noon but did not that day due to the experiment, which made him feel tired in the afternoon.
In this work, a system is presented which combines non-rigid face tracking with feature analysis to determine the working status of computer users. Jittering removal and auto-reinitiating methods are designed to improve the performances of traditional face tracking algorithms, and statistical learning methods are applied in the feature analysis. By using the blink, yawn detection and 3D head pose analysis solution, the working status of the computer users can be predicted and the recommendation rules can be made. Future work will focus on developing a user-based model to generalize the performance of the system, which will improve the tracking accuracy of tracking and feature detection.
-  Arthur Saltzman and CA San Bernardino, “Computer user perception of the effectiveness of exercise mini-breaks,” in Proceedings of the Silicon Valley Ergonomics Conference and Exposition, 1998, pp. 147–151.
-  Yang Wang, Simon Lucey, J Cohn, and Jason Saragih, “Non-rigid face tracking with local appearance consistency constraint,” in IEEE international conference on automatic face and gesture recognition (FG 08), 2008.
-  J. Hoey, P. Poupart, C. Boutilier, and A. Mihailidis, “POMDP models for assistive technology,” Tech. Rep., Proceedings of the AAAI Fall Symposium on Caring Machines, 2005.
-  Yoshio Matsumoto and Alexander Zelinsky, “An algorithm for real-time stereo vision implementation of head pose and gaze direction measurement,” in Automatic Face and Gesture Recognition, 2000. Proceedings. Fourth IEEE International Conference on. IEEE, 2000, pp. 499–504.
Erik Murphy-Chutorian and Mohan M Trivedi,
“Head pose estimation in computer vision: A survey,”Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 31, no. 4, pp. 607–626, 2009.
Kristen Grauman, Margrit Betke, James Gips, and Gary R Bradski,
“Communication via eye blinks-detection and duration analysis in
Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on. IEEE, 2001, vol. 1, pp. I–1010.
-  T Morris, Paul Blenkhorn, and Farhan Zaidi, “Blink detection for real-time eye tracking,” Journal of Network and Computer Applications, vol. 25, no. 2, pp. 129–143, 2002.
-  Michael Chau and Margrit Betke, “Real time eye tracking and blink detection with usb cameras,” Tech. Rep., Boston University Computer Science Department, 2005.
-  Fei Yang, Xiang Yu, Junzhou Huang, Peng Yang, and Dimitris Metaxas, “Robust eyelid tracking for fatigue detection,” in Image Processing (ICIP), 2012 19th IEEE International Conference on. IEEE, 2012, pp. 1829–1832.
-  Gang Pan, Lin Sun, Zhaohui Wu, and Shihong Lao, “Eyeblink-based anti-spoofing in face recognition from a generic webcamera,” in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on. IEEE, 2007, pp. 1–8.
-  Matjaz Divjak and Horst Bischof, “Eye blink based fatigue detection for prevention of computer vision syndrome.,” in MVA, 2009, pp. 350–353.
-  Tsuyoshi Moriyama, Takeo Kanade, Jeffrey F Cohn, Jing Xiao, Zara Ambadar, Jiang Gao, and Hiroki Imamura, “Automatic recognition of eye blinking in spontaneously occurring behavior,” in Pattern Recognition, 2002. Proceedings. 16th International Conference on. IEEE, 2002, vol. 4, pp. 78–81.
-  Mihir Mohanty, Aurobinda Mishra, and Aurobinda Routray, “A non-rigid motion estimation algorithm for yawn detection in human drivers,” International Journal of Computational Vision and Robotics, vol. 1, no. 1, pp. 89–109, 2009.
-  Yong Du, Qinghua Hu, Degang Chen, and Peijun Ma, “Kernelized fuzzy rough sets based yawn detection for driver fatigue monitoring,” Fundamenta Informaticae, vol. 111, no. 1, pp. 65–79, 2011.
-  Mona Omidyeganeh, Abbas Javadtalab, and Shervin Shirmohammadi, “Intelligent driver drowsiness detection through fusion of yawning and eye closure,” in Virtual Environments Human-Computer Interfaces and Measurement Systems (VECIMS), 2011 IEEE International Conference on. IEEE, 2011, pp. 1–6.
-  Shabnam Abtahi, Behnoosh Hariri, and Shervin Shirmohammadi, “Driver drowsiness monitoring based on yawning detection,” in Instrumentation and Measurement Technology Conference (I2MTC), 2011 IEEE. IEEE, 2011, pp. 1–4.
-  Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor, “Active appearance models,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 23, no. 6, pp. 681–685, 2001.
-  Daniel Lélis Baggio, Shervin Emami, David Millán Escrivá, Khvedchenia Ievgen, Naureen Mahmood, Jason Saragih, and Roy Shilkrot, Mastering OpenCV with Practical Computer Vision Projects, Packt Pub., 2012.
-  Gabriele Fanelli, Juergen Gall, Harald Romsdorfer, Thibaut Weise, and Luc Van Gool, “A 3-d audio-visual corpus of affective communication,” Multimedia, IEEE Transactions on, vol. 12, no. 6, pp. 591–598, 2010.
-  Paul Viola and Michael Jones, “Rapid object detection using a boosted cascade of simple features,” in Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on. IEEE, 2001, vol. 1, pp. I–511.
-  Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua, “Epnp: An accurate o(n) solution to the pnp problem,” International Journal of Computer Vision, vol. 81, no. 2, pp. 155–166, 2009.