Report: Dynamic Eye Movement Matching and Visualization Tool in Neuro Gesture

12/27/2017 ∙ by Qiangeng Xu, et al. ∙ Columbia University 0

In the research of the impact of gestures using by a lecturer, one challenging task is to infer the attention of a group of audiences. Two important measurements that can help infer the level of attention are eye movement data and Electroencephalography (EEG) data. Under the fundamental assumption that a group of people would look at the same place if they all pay attention at the same time, we apply a method, "Time Warp Edit Distance", to calculate the similarity of their eye movement trajectories. Moreover, we also cluster eye movement pattern of audiences based on these pair-wised similarity metrics. Besides, since we don't have a direct metric for the "attention" ground truth, a visual assessment would be beneficial to evaluate the gesture-attention relationship. Thus we also implement a visualization tool.



There are no comments yet.


page 9

page 10

page 13

page 14

page 18

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It is well studied that people’s gazing position would reveal the visual stimuli that people receive. The saliency of an item, defined by the state or quality by which it stands out relative to its neighbors is considered to be a key attention mechanism that facilitates in many learning tasks [1, 2]

. Several recent neural science studies have discovered specific tissues and neurons that would support such mechanism. For example, Snow et al.

[3] studied pulvinar nuclei which modulates physical/perceptual salience in attentional selection. Besides, Baliki et al.[4] discovered D1-type medium spiny neurons and D2-type medium spiny neurons within the NAcc shell assigns aversive motivational salience to aversive stimuli.

Several previous studies concentrated on the relation between viewer’s attention and visual stimuli such as Loschky et al.[5], Dewhurst et al.[6] and Valuch et al.[7]. They used eye tracking data in order to discover the similar eye trajectories across viewers. More interestingly, the study introduced by Burleson-Lesser et al.[8] has revealed that video material also synchronises judgments and the similarity of eye movement could help predict viewer’s preference.

Thus, under the assumption that higher similarity of eye movement indicates higher attention of a group of audience, we try to use eye tracking data to find the similarity of eye movement and cluster the audiences into subgroups based on their eye movement similarities. We believe such effort would help us to find the more ”attractive moment” in a lecture video that would get higher attention from audiences, indicated by more ”homogeneous” eye movement clustering result. On the other hand, we would also expect such clustering could help us to find some outlier among audiences(e.g. some people don’t pay attention to anything).

Another work we have done is to re-design the visualization tool previously implemented in Visual Studio. To cope with the diverse working environment that our team members have, we conduct frame-work selection between a few popular candidates. We select Qt in the end and added EEG graph into the visualization tool. The contributions of our work to the Neuro Gesture project are two folds:

We apply Time Warp Edit Distance methods to calculate the pair-wise similarity between each eye movement trajectories pair. Then we use the similarity matrix to cluster the audience eye movement pattern by using relevant communities clustering method introduced by Le Martelot et al.[9]

We implement a visualization tool. The main functionalities include: 1. eye fixation and trails superimposing on the video and 2. 64 channels of EEG graph on sync with the video lapse. After considered several framework, we select Qt, the most compatible cross-platform SDK.

The rest of the report is organized as follows. We introduce the similarity matrix and clustering method in Section 2 and visualization tool in section 3. In section 2, the related work and model selection would be discussed in 2.1, the detailed algorithm of similarity matrix would be shown in 2.2 and the parameter tuning would be shown in 2.3. The clustering algorithm would be shown in 2.4. The study of correlation between similarity matrix and question answer correctness would be shown in 2.5. In section 3, we introduce the visualization tool’s design in 3.1 and illustrate the display in 3.2. In the end, we summarize our work and future challenges in section 4.

2 Eye Movement Similarity Matching and Clustering

The form of eye-tracking data is essentially multi-dimensional time-series data of multiple objects. Thus here we discuss several time-series matching methods and their strength and weakness if applied our task.

2.1 Related Works and Model Selection

2.1.1 Inter-subject correlation of eye movement

is an emerging topic which would lead toward a better understanding of correlated activity between individuals under stimuli. Some previous studies such as Burleson-Lesser et al.[8] employed tools from statistical mechanics which is very powerful to explain emergent properties of the local interactions between subjects. They have demonstrated that the audiences’ eye movement would be distributed under a balance between randomness and alignment. They modeled the distribution of direction as it is entirely random yet the observed mean and pair-wise correlations could be derived. The model is also studied by Mora et al.[10]:

Adopting same measurement in the statistic mechanics studies such as Cavagna et al.[11], Bialek et al.[12] and Tkačik et al.[13], they computed the correlation among eye movement using direction of every pair of two viewers:

However, considering our setting of the experiment(we let each viewer look at the lecture video in a individual room), the homophily of eye movement could only be explained by the video content and the attention of an individual at a certain moment. The ”interaction” between audiences doesn’t have a physical meaning but only serves as an analogy. More importantly, on purpose of inferring the attention of audience to visual stimuli, we choose to rely on fixation instead of eye movement direction. Since We have discovered many cases that would cause viewer having different eye movement direction while actually getting attracted by the same subject. For example, when the camera is shifting and the scene is switching, people’s eye gazing position would be random at first, then, concentrated on the same object that camera is switching to. The movement direction during this period would be very diverse, which hinders the fact that viewers actually have similar ”attention”. Besides, the method used in [8] would provide little tolerance to mismatch on temporal dimension. From our observation, individual viewers always have different response time to visual stimuli. Many medical studies such as Jain et al.[14] has discovered such variation between individuals. Thus we explore the methods in time series matching by using fixation data to serve our research.

2.1.2 Time series similarity measures

is a central task in the domain of analysis, prediction or classification of information unfolding over time. Given time series and , a similarity function calculate one point of and one corresponding point of having the same timestamp belongs to the family of Lock-Step Measure. The lock-step measures including Euclidean Distance matching introduced by Faloutsos et al.[15] along with its variants using -norms(including Manhattan norm, Maximum norms, etc.) introduced by Yi et al.[16]. The advantage of these methods is their simplicity. The matching map is shown in 111Figure 1 is adapted from [17] Figure 1(a). However, in the case of eye-tracking data, because of the time inaccuracy in eye fixation measurement and different response time among individuals, these methods would give us large distance between an audience having faster response and an audience who is relatively slower, even both their trajectories are following the same pattern.

Another rather novel family of approaches is Threshold-Based Measure. For example, the TQuEST distance introduced by Aßfalg et al.[18] is using a threshold parameter and transform a time series to threshold-crossing time intervals. Each time interval is treated as a point in two dimensional space. the Minkowski sum of the two sequences served as the similarity. The matching map is shown in Figure 1(b). In general this approach would be more suitable to data that has similar amplitude value(fixation of eye gazing point) but different temporal value. However, the eye-tracking data we collect shows diverse pattern both in temporal value and amplitude value.

Other families of approaches also have been well studied. The Feature-Based Measures such as Fourier coefficients matching introduced by Oppenheim et al.[19]

used discrete Fourier transforms of the raw time series to filter out high-frequency coefficients, making it a efficient methods to remove rapidly fluctuating signal components. Yet, in our case, it also suffered from not being able to adapt to the differences among individual eye responses. The family of

Pattern-Based Measure including SpADe introduced by Chen et al.[1] matches segments within entire time series by adjusting(shifting or scaling) value in both temporal and amplitude dimensions. Although this method concentrates on matching patterns between time series, in our case, shifting the amplitude value(e.g. fixation) would be problematic. Two audiences would generally pay attention to same object(e.g. the lecturer or picture in slides), thus similar fixation, if they are both attracted by the lecture.

The most effective family of matching methods for our data might be Elastic Measure. Methods in this family generally would allowed matching between one-to-many and one-to-none points. In our case, some of the fixation data points are missing and some people’s attention would stay relatively longer in an object or move slower than others. The most popular method in this family would be dynamic time warping(DTW) introduced by Berndt et al.[20] The method aligns the time series in the temporal domain that minimize the cost of the matching distance after the alignment. Dynamic programming could be applied recursively to solve this problem.

In the study of Wang et al.[17],Very few measures have been proposed that systematically outperform DTW for a number of different data sources. The matching map is shown in Figure 1(c). However, the DTW method would entirely ignore the time lag between eye trails, which might also be an unwanted behavior. Instead, we study the edit distance time warping methods such as Edit distance on real sequence introduced in [21], Edit Distance with Real Penalty (ERP) introduced in [22] and Time Warp Edit Distance(TWED) introduced by Marteau et al.[23]. Among those extensions, TWED is the most suitable one to the eye-tracking data since it incorporates both time stamp differences and editable series matching. TWED includes a deletion penalty and stiffness parameter to cope with three different conditions each step. The whole objective could be summarized as follows:

The matching map of TWED is shown in Figure 1(d). By looking at the form of TWED, we can easily find the DTW, ERP, LCSS methods are all TWED’s special case. We will illustrate the detail of TWED algorithm applied to eye-tracking data in the next section.

Figure 1: The distance measure is proportional to the length of the gray lines. Method shown in (a) is a lock step measure. The ”one to one” mapping is enforced. The method illustrated in (b) is a Threshold measure. In (c), an Elastic measure such as DTW is shown. The method would allowed ”one-to-many” mapping of the data points, but each data points must be matched. In (d), the methods such as TWED is illustrated. Additional to ”one-to-many” mapping of the data points, it also allows the possibility of not matching points.

2.1.3 Clustering based on pair-wise relationship between data points

is a well studied problem with various solutions. However, since the number of group or the granularity of the cluster is unknown, adopting traditional clustering method would force us to determine this number. However, to seek the optimal number of clusters might be an unwanted operation since we want to make the clustering criteria consistent through whole lecture(we separate video to several sub sections to do the clustering independently). Some part of the lecture might simply be more attractive and people would pay more attention to similar object, other parts may not be delivering interesting information thus people starting looking around. Under certain scenario, the cluster number should not be the same but vary accordingly. Thus we adopt community detection to solve this problem.

Community detection, compared to traditional clustering method, the number of cluster(community) is always unknown. Most of the methods rank communities by using a criterion. Also studied by Simon et al.[24], a dataset often have several levels of granularity resulting various clustering at several resolutions. Small clusters could be generated by fine scale analysis but a large scale analysis could result in larger clusters. Many studies including Le Martelot et al.[25] and Huang et al.[26] have proposed criteria designated to suffice multi-scale analysis. However, they are limited in their efficiency and accuracy. In stead, we find the method proposed by Le Martelot et al.[25] could enable fast multi-scale community detection on large networks with global and local criteria. Their method have also been effectively adopted in eye-tracking clustering by Burleson-Lesser et al.[8]. The detail implementation and result would be introduced in following sections.

2.2 Method and implementation of eye movement similarity matrix

Our project code Eye-movement-similarity-clustering are released at

2.2.1 Data Preprocessing

Our data come from the proprietary software which runs the eye-tracking machine while recording the audience eye fixation. The output file information and format are illustrated in Appendix A. We adopt the preprocessing implementation in Burleson-Lesser et al.[8]. We filter the fixation data with 80 millisecond triangular window. Participants would be excluded from the video if they had over 20

missing data in a video. All missing data would be set to 0. A sparse principal component analysis was run on each dimension. The operation inserted a linear interpolation over viewers to missing samples. We also divide the data for a video into 30 second data files. The sample file is shown in the project’s data folder. Each file represent x or y position for an object in a 30 seconds session.

2.2.2 Similarity Matrix

The frame rate of the eye tracking data could be very high(capable to be more than 1000 frames per second). Even after preprocessing, we still have eye-tracking frame rate around 120 per second. However, since the video’s(we use TED talk as an example) frame rate are only 32 frames per second. Thus first of all, we down-sample the eye-tracking data to the same frame rate as the video. After that, we will calculate the similarity matrix by using TWED. TWED has two hyper parameters: deletion penalty and stiffness parameter in order to cope with three different conditions each step. The ”deletion” penalty or according to Serra et al.[27] the ”mismatch” penalty is a constant cost setting to compensate the situation whenever we decide one of the series would stay at the same point while the other one would step forward. The stiffness parameter would add a cost proportional to the distance between two series temporal distance at a certain step. In the original literature of the algorithm in [23], the ”elastic cost” has the form . Here, since the distance between and or and are proportional to the difference of index no matter what time unit we choose, we can just write the distance of time as the distance of frame indices so that the time unit could be absorbed into . Now, the simpler equation for each step is:

Here we just fill in all the possible steps from bottom left corner to upper right corner(see Fig.2). Each element contains the minimum possible cost to reach that element from the origin.

Figure 2: the path contains lowest distance between series A and B would be selected as the result.

The algorithm of TWED in our application is listed in Appendix B The implementation is inside the file in our project repository.
After we enumerate through all combination of eye trail pairs, we would get a N by N distance matrix in which each cell contains the TWED result for eye trail i and j. To transform the distance matrix to similarity matrix, we simply normalize each element by dividing the largest value among all the elements in the matrix and subtracting it by 1. Thus, the similarity matrix we get for a video in a time window would contains all pair-wise similarity values between 0 to 1. The matrix is a symmetric matrix with all value 1 on its diagonal(since the ith eye trail would have 100% similarity with itself).

2.2.3 Parameter Selection of TWED

The hyper-parameter in TWED algorithm would affect the importance of temporal step matching. With both set to zero, the TWED algorithm would automatically select the matching between two time series with minimum distance. However, in our case, such matching strategy would be problematic if we match two viewer’s gazing fixations at an object’s position before and after a shifting of that object in the video scene. Here we will exam the impact of the parameter selection by visual evaluation. We pick 6 6 combination of ( and iterate with values [0, 1000, 5000, 10000, 20000]). Then we pick several window of 5 second for the TED talk Carol and visualize the movement of different pairs and their values under different hyper-parameter selections. We implement both 3d and 2d visualization of 2 eye movement trails, a sample visualization is shown in Fig. 3. We also select parula color map to get a better illustration of the similarity matrix. We would like to get some quantitative ground truth to facilitate the parameter selection, but seems the visual assessment is the only feasible method for now.

Figure 3: These are the samples of visualization and similarity matrix. We select the time window from the 33 second to the 38 second in the TED Talk Carol. The similarity matrix is calculated when setting . We pick object 4 and object 0 and marked the similarity value in the matrix by white dots. (a) shows the 2D trail visualization. Two trails have different colors and the colors are proportionally getting stronger along temporal dimension. (d) shows the 3D trail visualization. The colors are also proportionally getting stronger along temporal dimension. The time dimension is explicitly shown as T axis.

So far, depends on visual assessment, we would select as 5000 and as 5000. More comparisons between different parameter settings could be found in Appendix C. After selecting the optimal parameters, we set the optimal parameters and run through all the video segments with time window of 5 seconds as well as 30 seconds. We then, save the result under the ”result” folder in our project repository.

2.2.4 Eye Movement Clustering

We adopt most of the clustering code in [8]. Besides replacing the vector direction adjacenct matrix by our similarity matrix, we also directly use the as the in [8]. To better visualize the clustering, we also map the object’s index in TWED result file to the object’s name. The algorithm is shown in Appendix D and the matlab code is located in the ”clustering” folder in our project repository. After saving the clustering result into the location: clusteringclustering_result, we also use the Gephi to visualize the clustering result. We select a 30 second window staring the 290 second and a 30s window starting the 290s in TED Talk Carol video. The clustering result is shown at Fig.4

Figure 4: (a) is the clustering result of 200s to 230s in Carol using TWED similarity matrix (b) is the clustering result of 200s to 230s in Carol using adjacent matrix introduced in [8] (c) is the clustering result of 290s to 320s in Carol using TWED similarity matrix (d) is the clustering result of 290s to 320s in Carol using adjacent matrix introduced in [8]

2.2.5 Correlation Between Similarity Matrix and Question Answer Correctness

We also examine the relationship between the TWED similarity matrix and question answer correctness.
Because of lacking ground truth of ”attention”, we explore the possibility to use question answer correctness as the evaluation standard. We select a 30 seconds window(the 405s to the 435s) in TED talk Simon video in which the content is related to 5 questions(s3C, s1B, s1E, s3F, s3G). The timing of these questions could be found at and the correctness of these questions could be found at
Under the assumption that more similar the viewers answering the questions, more likely they pay the same level of attention to the lecture. We calculate a N by N answer distance penalty matrix holding the answer difference penalty between each pair of viewers. If both viewers have correct answer to one question, we would add 1 to their distance penalty value. If both or one of the viewers have incorrect answer, we don’t add value to their distance penalty. The higher the distance penalty, the more similar the two viewers’ attention is in this time window. We then, along with the similarity matrix of this time window, plot the value of answer distance penalty as x value and similarity as y value. The result is showing in Fig. 5

Figure 5: The x axis represents the question answer similarity. The y axis shows the similarity value of the eye movements

As we can see, since the number of questions is too small, it is relatively hard to apply a regression to reveal the correlation. However, by only looking at this plot, we can see in this period of time, the similarities of eye movement between viewers are relatively low and the similarities of their answers are low at the same time.

3 Eye-Tracking Overlay and EEG Visualization Tool

The code of the visualization tool could be found at The newest developed version is inside its qt5_structuralize folder.

3.1 Design and Framework

Due to compatibility issue across different development environment, we decide to implement a visualization tool by using PyQt5. QT is basically the most flexible framework for cross-platform GUI development. In addition to PyQt5, we also rely on pyqtgraph, a library help plot dynamic data. The detail of dependencies is shown in Appendix E.
The requirement of our visualization tool includes a video player, a eye fixation canvas super-imposed above the video player and a EEG data graph that can dynamically update EEG data for all channels.
An important concept of Qt GUI is widget, which is a complex development unit that contains various basic functionalities. After a few structure changes, we customize our own widget by inheriting the most basic widget and adding corresponding functions. The tool includes three components: 1.a eye movement and video playing widget; EEG widget; 3.a main window widget that in charge of the GUI layout and the inter-widget communication. We apply observer-listener pattern to register several events to make the eye movement widget and EEG widget on sync. A widget level design is shown in Fig. 6

Figure 6: The events include: syncSlider, pauseVideo, startVideo, loadVideo, loadEyeTracking, loadEEG, etc.

To include with the functionality of eye movement super-imposing, we choose not to create a separate widget but adding a graph video element as the video player’s mirror inside a graph view and dynamically add and delete graphic elements such as dots and lines to illustrate the eye movement. The ”open Eye” button will read in several eye tracking files, preprocess the fixation data and generate a dictionary. The dictionary holds every object’s eye gazing dot, the eye trail lines of previous steps for each eye movement record and a designated color to draw these elements. Whenever the slider of the video moved, the drawing will be triggered to add, delete, move or transform those graphic elements(gazing dots, trail lines). Moreover, the gazing dots would change their size depends on the duration of gazing at each record.
By selecting audiences in the comboBox, the corresponding eye movement graphic elements would be added to or removed from the graphic view. The position of the fixation has been normalized according to the original screen’s resolution and the video player’s resolution.

We use the pyqtgraph library to develop the EEG graph widget. The EEG data contains 64 channels, but we allow user to select any numbers of channels to be shown in the canvas. Their plots would be added or removed accordingly. If the video lapse slider change its position, the EEG graphic would also be triggered to update the corresponding EEG graph. We choose to display all EEG values between 5 seconds before and after the current video time. The detail of tool’s functions is shown in Appendix F

3.2 The Display of the Visualization Tool

In this section, we show several screen shots for all functionalities and views. To better present the eye trail lines, we select a moment when viewers are reading the slides. Similarly, we choose a moment the viewers are gazing at the speaker to show the fixation. Two screen shots are illustrated in Fig.7

Figure 7: (a) is the display when viewers are busy reading the slides. (b) is the display when viewers are only gazing at the speaker

We also show the combobox when user selecting the audience’s eye movement and the channels of EEG in Fig.8

Figure 8: (a) is the display when selecting audiences’ eye data. (b) is the display when selecting the channels

4 Conclusions and Future Works

We have compared different methods to calculate the similarity of eye movement between pairs of viewers. After studied several approaches including mechanical statistics, etc., we decide to use time series matching methods. Adopting TWED method, we are able to get similarity matrix for any specified time window. However, due to the fact we don’t have reliable attention ground-truth, the hyper-parameter selection and the assessment of the similarity matrix remains a challenge. We also try to use question answer correctness as attention ground-truth but also find it’s unreliable due to lack of enough questions. However, the visual assessment by showing eye movement path indicates promising trend. By using the similarity matrix, we are able to cluster the audience by using community detection algorithm. Finding the attention ground-truth seems to be a critical task for future study. By tackle this problem, we would be able to better select the model, hyper-parameters and assess the similarity matrix and clustering result. Another work we have done is developing the visualization tool. Although the functionality is relatively basic, but after exploring the suitable architecture and pattern, we would spent much less effort when adding clustering visualization or other functionalities in the future.


  • [1] Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Chua, T.S.: Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. arXiv preprint arXiv:1611.05594 (2016)
  • [2] Xu, Q., Qin, Z., Wan, T.: Generative cooperative net for image generation and data augmentation. arXiv preprint arXiv:1705.02887 (2017)
  • [3] Snow, J.C., Allen, H.A., Rafal, R.D., Humphreys, G.W.: Impaired attentional selection following lesions to human pulvinar: evidence for homology between human and monkey. Proceedings of the National Academy of Sciences 106(10) (2009) 4054–4059
  • [4] Baliki, M.N., Mansour, A., Baria, A.T., Huang, L., Berger, S.E., Fields, H.L., Apkarian, A.V.: Parceling human accumbens into putative core and shell dissociates encoding of values for reward and pain. Journal of Neuroscience 33(41) (2013) 16383–16393
  • [5] Loschky, L.C., Larson, A.M., Magliano, J.P., Smith, T.J.: What would jaws do? the tyranny of film and the relationship between gaze and higher-level narrative film comprehension. PloS one 10(11) (2015) e0142474
  • [6] Dewhurst, R., Nyström, M., Jarodzka, H., Foulsham, T., Johansson, R., Holmqvist, K.: It depends on how you look at it: Scanpath comparison in multiple dimensions with multimatch, a vector-based approach. Behavior research methods 44(4) (2012) 1079–1100
  • [7] Valuch, C., Ansorge, U.: The influence of color during continuity cuts in edited movies: an eye-tracking study. Multimedia Tools and Applications 74(22) (2015) 10161–10176
  • [8] Burleson-Lesser, K., Morone, F., DeGuzman, P., Parra, L.C., Makse, H.A.: Collective behaviour in video viewing: A thermodynamic analysis of gaze position. PloS one 12(1) (2017) e0168995
  • [9] Le Martelot, E., Hankin, C.: Fast multi-scale detection of relevant communities in large-scale networks. The Computer Journal 56(9) (2013) 1136–1150
  • [10] Mora, T., Bialek, W.: Are biological systems poised at criticality? Journal of Statistical Physics 144(2) (2011) 268–302
  • [11] Cavagna, A., Cimarelli, A., Giardina, I., Parisi, G., Santagati, R., Stefanini, F., Viale, M.: Scale-free correlations in starling flocks. Proceedings of the National Academy of Sciences 107(26) (2010) 11865–11870
  • [12] Bialek, W., Cavagna, A., Giardina, I., Mora, T., Silvestri, E., Viale, M., Walczak, A.M.: Statistical mechanics for natural flocks of birds. Proceedings of the National Academy of Sciences 109(13) (2012) 4786–4791
  • [13] Tkačik, G., Mora, T., Marre, O., Amodei, D., Palmer, S.E., Berry, M.J., Bialek, W.: Thermodynamics and signatures of criticality in a network of neurons. Proceedings of the National Academy of Sciences 112(37) (2015) 11508–11513
  • [14] Jain, A., Bansal, R., Kumar, A., Singh, K.:

    A comparative study of visual and auditory reaction times on the basis of gender and physical activity levels of medical first year students.

    International Journal of Applied and Basic Medical Research 5(2) (2015) 124
  • [15] Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. Volume 23. ACM (1994)
  • [16] Yi, B.K., Faloutsos, C.: Fast time sequence indexing for arbitrary lp norms, VLDB (2000)
  • [17] Wang, X., Mueen, A., Ding, H., Trajcevski, G., Scheuermann, P., Keogh, E.: Experimental comparison of representation methods and distance measures for time series data. Data Mining and Knowledge Discovery (2013) 1–35
  • [18] Aßfalg, J., Kriegel, H.P., Kröger, P., Kunath, P., Pryakhin, A., Renz, M.: Similarity search on time series based on threshold queries. In: EDBT, Springer (2006) 276–294
  • [19] Oppenheim, A.V.: Discrete-time signal processing. Pearson Education India (1999)
  • [20] Berndt, D.J., Clifford, J.: Using dynamic time warping to find patterns in time series. In: KDD workshop. Volume 10., Seattle, WA (1994) 359–370
  • [21] Chen, L., Özsu, M.T., Oria, V.: Robust and fast similarity search for moving object trajectories. In: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, ACM (2005) 491–502
  • [22] Chen, L., Ng, R.: On the marriage of lp-norms and edit distance. In: Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, VLDB Endowment (2004) 792–803
  • [23] Marteau, P.F.: Time warp edit distance with stiffness adjustment for time series matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(2) (2009) 306–318
  • [24] Simon, H.A.: The architecture of complexity. In: Facets of systems science. Springer (1991) 457–476
  • [25] Le Martelot, E., Hankin, C.: Multi-scale community detection using stability optimisation. International Journal of Web Based Communities 9(3) (2013) 323–348
  • [26] Huang, J., Sun, H., Liu, Y., Song, Q., Weninger, T.: Towards online multiresolution community detection in large-scale networks. PloS one 6(8) (2011) e23829
  • [27] Serra, J., Arcos, J.L.: An empirical evaluation of similarity measures for time series classification. Knowledge-Based Systems 67 (2014) 305–314

Appendix A Output Format of Eye-tracking data

The files are located in
There two kinds of files came out of the eye-tracking machine:

1.The asc files, which contains the raw data out of the machine.
Each audience for a lecture would have his or her own file.
The file’s name usually is arranged as {objname}_{lecturename}.asc, for example, msb_c.asc.
The file could be opened as simple txt file. Although this kind of file contains all the raw information, most of the useful information inside would also be included in the xls file with better aggregation. The most useful information to us in the file is MSG 0 DISPLAY_COORDS 0 0 {WIDTH} {HEIGHT}. For example, 0 DISPLAY_COORDS 0 0 1023 767.

2. The xls files, which contains the aggregated data from the raw data.
Each audience for a lecture would have his or her own file.
The file’s name usually is arranged as {objname}_{lecturename}.xls, for example, an1_carol.xls. The file could be opened as a simple excel file.
The file contains 222 columns. Many of them containing event data such as CURRENT_FIX_BUTTON_0_PRESS, which in most case is empty. Among with columns, CURRENT_FIX_DURATION, CURRENT_FIX_START, CURRENT_FIX_END, CURRENT_FIX_X and CURRENT_FIX_Y are most important to us. The unit of the columns CURRENT_FIX_DURATION, CURRENT_FIX_START and
CURRENT_FIX_END are all millisecond. The unit of the position columns
CURRENT_FIX_X and CURRENT_FIX_Y are the pixel location relative to the left up corner of the screen.

Appendix B TWED Algorithm in our application

Modified iterative implementation of the TWED distance

float TWED(t1_data[1 to n], t2_data[1 to m], lam, nu):
Ψint result(result)
Ψresult = init_matrix(result)
    for i := 1 to m
 ΨΨresult[0,i] := infinity;
 Ψfor i := 1 to n
 ΨΨresult[i,0] := infinity;
 ΨΨresult[0,0] := 0;
    n = len(t1_data)
    m = len(t2_data)
    for p = 1 to n:
        for q = 1 to m:
            insertion = result[p - 1][q]
            ΨΨΨΨ+ Dist(t1_data[p - 1], t1_data[p]) + nu  + lam
            deletion = result[i][q - 1]
            ΨΨΨΨ+ Dist(t2_data[q - 1], t2_data[q]) + nu + lam
            match = result[p - 1][q - 1]
            ΨΨΨ+ Dist(t1_data[p], t2_data[q])
                     + 2 * nu * (abs(p - q))
                     + Dist(t1_data[p - 1], t2_data[q - 1])
            result[p][q] = min(insertion, deletion, match)
    return result[n - 1][m - 1]

code modified based on ”Iterative implementation of the TWED distance” in Marteau et al.[23]

Appendix C TWED Hyperparameter Setting Comparison

Figure 8: (a) shows the eye trails of audience 0 and 5 during 45s-50s in TED Talk Carol,(b) shows the eye trails of audience 1 and 8 during 45s-50s in TED Talk Carol. (c) and (d) are the similarity matrix under corresponding setting when having and .(e) and (f) are the similarity matrix under corresponding setting when having and .(g) and (h) are the similarity matrix under corresponding setting when having and

Appendix D Fast Multi-Scale Community Detection Algorithm

Initialise current community partition with a node per community:
                                          com = list of all nodes
for all scale parameters p do
    Compute initial Q value given p: Q = computeQ(com, p)
    while changes can be made do
        while nodes can be moved do
            nlist = list of all nodes
            while nlist is not empty do
                n = pick a random node in nlist
                ncom = neighbour communities of n
                best_\deltaQ = 0
                for all communities nc in ncom do
                     Compute the \delta Q that moving n into nc
                                                  would produce
                     if \deltaQ > best_\deltaQ and move does
                                  not break a community then
                         best_\deltaQ = \delta Q
                         best_c = nc
                     end if
                 end for
                 if best_\delta Q > 0 then
                     Update com: move node n to community best_c
                     Update total value of Q: Q = Q + best_\delta Q
                 end if
            end while
        end while
        while clusters can be merged do
            clist = list of all current communities
                while clist is not empty do
                     c = pick a random community in clist
                     ncom = neighbour communities of c
                     best_\delta Q = 0
                     for all communities nc in ncom do
                        ΨCompute the \delta Q that merging c and nc
                        Ψ                             would produce
                        if \deltaQ > best_\deltaQ then
                             best_\delta Q = \delta Q
                             best_c = nc
                        end if
                     end for
                     if best_\deltaQ > 0 then
                         Update com: merge communities c and best_c
                         Update total value of Q: Q = Q
                                               + best_\deltaQ
                     end if
                end while
        end while
    end while
    Store com and Q for p
end for
return Community sets and associated Qs

Appendix E Dependencies of Development

Package name Version Notes
Python 2.7 2.6 or 3.4 no guarantee
PyQt5 5.6.0 anaconda support up to 5.6.0, higher version should be compatible
pyqtgraph 0.10.0 anaconda supported
numpy 1.13.3 anaconda supported, higher version should be compatible
pandas 0.20.3 anaconda supported, higher version should be compatible
Table 1: These are the packages that should be installed in order to develop. Using anaconda package-manager is recommended.

Appendix F GUI Operation

GUI Element Event Description
Open click Load the new video files, multiple format supported
Open EYE Click Pop up a multi-files select window. load the eye tracking data and preprocess them. The file should be csv or excel, sample files are located under project repo’s fixation folder
Open EEG Click Pop up a single file select window. load the EEG tracking data and preprocess them. The file should be txt, sample files are the AZZ1_v2_Carol_30sec.txt inside the project repo
Select Objects Click Pop up the audiences selection combobox
Select Objects Combobox Check options Add elements of the checked audience or removed the elements of the unchecked audience
Select Channels Click Pop up the channels selection combobox
Select Channels Combobox Check options Add plot of the checked channel or removed the plot of the unchecked Channels
Play Click Video starts to play if any video uploaded. The button’s label will change to Pause
Pause Click Video stops if any video uploaded. The button’s label will change to Play
Video Slider Drag Video stops and the video lapse would change to the corresponding position.
Video Slider Auto Update Video, Eye movement elements and EEG would be all on sync with the slider’s position.
Table 2: GUI element and their triggered functions