Addressing the Mystery of Population Decline of the Rose-Crested Blue Pipit in a Nature Preserve using Data Visualization

03/03/2019 ∙ by Benyamin Ghojogh, et al. ∙ 0

Two main methods for exploring patterns in data are data visualization and machine learning. The former relies on humans for investigating the patterns while the latter relies on machine learning algorithms. This paper tries to find the patterns using only data visualization. It addresses the mystery of population decline of a bird, named Rose-Crested Blue Pipit, in a hypothetical nature preserve. Different visualization techniques are used and the reasons of the problem are found and categorized. Finally, the solutions for preventing the future similar problems are suggested. This paper can be useful for getting introduced to some data visualization tools and techniques.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 The Addressed Problem

This paper addresses the VAST (Visual Analytics Science and Technology) 2017 challenge (Visual Analytics Community, 2017). This challenge is about the mystery of population decline of a specific type of bird, named Rose-Crested Blue Pipit, in a hypothetical nature preserve. This preserve is close to a hypothetical city named Mistford. In this project, we aim to find out the possible reasons of this population decline using the provided dataset. This problem is sourced from various reasons which we address them one by one using pattern exploration by data visualization. This paper can be useful for the reader to get introduced to some important data visualization tools and techniques. The visualizations in this paper are done in the R programming language.

1.2 Dataset

The dataset provided in (Visual Analytics Community, 2017) includes several data subsets. In this section, we briefly introduce the dataset and its data subsets; however, we defer the thorough explanation of the data to the relevant sections to analyze that piece of data.

Figure 1: The map of the nature preserve. The entrances, general gates, gates, ranger stops, and camping locations are identified by light green, light blue, light red, yellow, and orange colors. The nine chemical sensors are shown in red color and the factories at the south of the preserve are highlighted with dark green color.
  • Data subset 1: This subset of data is provided for the mini-challenge 1 in the VAST challenge. A ()-pixel map of the preserve is provided where five different color-coded types of gates in the preserve are shown in it. The five types of gates are entrances, general gates, gates, ranger stops, and camping. Moreover, the records of passing through these gates are reported in a file. These records are for seven different types of vehicles, i.e., two-axle car (or motorcycle), two-axle truck, three-axle truck, four-axle (and above) truck, two-axle bus, three-axle bus, and two-axle ranger truck. records are provided overall, each of which has information of time/date, car ID, car type, and gate name.

  • Data subset 2: This subset of data is provided for the mini-challenge 2 in the VAST challenge. The recorded data from nine different sensors, whose coordinates are shown in a provided map, are available. These sensors have collected samples of four specific chemicals emitted to the air by four factories close to the preserve. The coordinates of factories are also provided. records are provided overall. Moreover, meteorological data are also provided which show the date, wind direction, and wind speed. There are meteorological records available.

  • Data subset 3: This subset of data is provided for the mini-challenge 3 in the VAST challenge. Twelve multi-spectral six-channel images are provided which are taken from the preserve in different seasons of past few years. The channels of the images are blue, green, red, Near Infrared (NIR), Short-Wave Infrared (SWIR) 1, and Short-Wave Infrared (SWIR) 2. The size of every image is pixels. Moreover, a map of Boonsong Lake, a lake within the preserve (which is a hypothetical lake), is provided to help us find out the scale and orientation of the multi-spectral images.

As can be seen in the above explanations, this dataset includes different types of features, including numerical features, categorical features, time series, images, maps, etc, making it suitable for data visualization and pattern exploration.

2 Analysis of Vehicle Activities

The data of vehicle activities and traffic through the preserve are given in order to investigate the possible unusual traffic patterns responsible for the decline of the Blue Pipit.

2.1 The Given Data

The data subset 1 provides the information of the map of preserve where five different gates are shown on it. Moreover, the data of traffic of the vehicles of various types are given for analysis.

Gate Allowed Vehicle Role
Entrance All vehicles Entring and leaving the preserve
General Gate All vehicles Recording flow of traffic
Gate Rangers Ranger activities beyond roadways
Ranger stop All vehicles Representing working areas for rangers
Camping All vehicles Recording visitors
Ranger base Rangers Rangers stay there when not working
Table 1: The different gates in the preserve.
Figure 2: The map and location of gates and roads in the nature preserve.

2.1.1 The Map of Nature Preserve

The map of the nature preserve is shown in Fig. 1. As can be seen in this map, there are five different gates named entrances, general gates, gates, ranger stops, and camping. When the vehicles pass these gates, the date and information of the passage are recorded. There is one additional gate, the ranger base gate, where the preserve rangers stay at when they are not working. Table 1 lists the existing gates in the preserve as well as explaining their roles. As this table shows, merely rangers are allowed to pass through the “gates” and the ranger base.

A ()-pixel bitmap image of the map is also given by the dataset. The gates are shown by colored pixels. Different gates are color-coded. The roads also have a different color. We read this bitmap image and by discriminating the colors, we find out the coordinates of the roads and different gates. Note that in this image, the pixel is the south-west corner while in real map, is at the north-west corner. Therefore, some manipulation is done for correcting the coordinates. The obtained map is shown in Fig. 2. The gates are also color-coded in this map.

Figure 3: The number of vehicles in different vehicle types.
Figure 4: The trajectory of a sample type-1 vehicle. The starting and ending points are shown by a circle and a square, respectively.

2.1.2 The Traffic of Different Vehicles

The recorded data for the traffic are categorized by the type of vehicle. Seven types of vehicles are type 1 (two-axle car or motorcycle), type 2 (two-axle truck), type 3 (three-axle truck), type 4 (four-axle and above truck), type 5 (two-axle bus), type 6 (three-axle bus), and type 7 (two-axle truck for rangers). Figure 3 shows the number of recorded vehicles in the different types. Type 1 has the most cardinality because most of regular visitors use this type of vehicle. Other types are mostly for construction or surveillance.

The records of sensors are available in the dataset reporting the time, car ID, car type, and the gate for different passages during various time periods. For every specific car ID, we extract the data of that car; therefore, for every car a time series is obtained having the time of passing and the passed gate. If we replace the gate names by their 2D coordinates, this time series can be shown as a trajectory on the map. the trajectory of an example car is shown in Fig. 4. The trajectory connects the passed gates together in the order they have been passed. In the display of trajectory, we use a circle and a square for indicating the start and end points of the traversed path, respectively. This display gives us valuable information about the gates and the regions which are visited by the vehicle.

2.2 Analysis of Vehicle Traffic Based on Vehicle Type

2.2.1 Analysis of Vehicle Trajectories Based on Vehicle Type

It is expected that most of the cars in the same type behave similarly and traverse similar paths. Therefore, for the cars of the same type, we plot the trajectory of the vehicles where the trajectories are transparent (alpha-blended) so that the trajectories which are repeated a lot get bold. In this way, the patterns of trajectories become obvious. The starting and end points are also shown so that we understand where the cars of a type have started and ended their trajectories. These trajectories are shown in Fig. 5.

Figure 5: The trajectories of vehicles in different types. The trajectory of every vehicle is shown by transparent blue color so that patterns and repeats can be obvious by bold blue color. The red trajectory is the mean trajectory of the cars in that type.

In the following, we analyze the trajectories of every vehicle type (the types of vehicles are determined by the dataset as mentioned in Section 1.2):

  • Type 1:

    • Entering and leaving: Most of vehicles enter and leave from the entrance gates which is expected. However, some of the trajectories end at the camping gates or a general gate. This is unusual meaning that not any more passing is recorded after they have camped. This shows that some suspicious activities might have been done there.

    • Middle passed gates: The vehicles have passed through camping and general gates. This is expected because most of visitors, which come for camping, use type-1 vehicles. The middle ranger stops are also visited which is fine; however, there is one trajectory passing the ranger stop at the top-left corner. This ranger stop is at the end of road and is usually passes by rangers. Hence, this activity is also suspicious.

  • Type 2:

    • Entering and leaving: Most of vehicles enter and leave from the entrance gates which is expected. However, some of the trajectories end at some of the camping gates. This is unusual and suspicious meaning that not any more passing is recorded after they have camped.

    • Middle passed gates: The vehicles have passed through camping and general gates which is fine.

  • Type 3:

    • Entering and leaving: Most of vehicles enter and leave from the entrance gates which is expected. However, some of the trajectories end at some of the camping gates. This is unusual and suspicious meaning that not any more passing is recorded after they have camped.

    • Middle passed gates: The vehicles have passed through camping and general gates which is fine.

  • Type 4:

    • Entering and leaving: All vehicles enter and leave from the entrance gates which is expected.

    • Middle passed gates: The vehicles have not passed through camping gates which is expected because they are heavy vehicles used for construction, etc. Passing through general gates is also fine. However, some vehicles have passed through gates which are only allowed to be passed by the rangers. This definitely is illegal and suspicious. Moreover, some trajectories have visited the ranger stop at the end of road located at the up-right corner. This ranger stop is supposed to be passed mostly by rangers because of its location. This is also suspicious.

  • Type 5:

    • Entering and leaving: Most of vehicles enter and leave from the entrance gates which is expected. However, some of the vehicles have ended their trajectories at general gate in the middle of the preserve. This is completely unusual and suspicious.

    • Middle passed gates: The vehicles have passed through general gates and middle ranger stops which is fine.

  • Type 6:

    • Entering and leaving: All vehicles enter and leave from the entrance gates which is expected.

    • Middle passed gates: The vehicles have passed through general gates and middle ranger stops which is fine.

  • Type 7:

    • Entering and leaving: All vehicles enter and leave from the ranger base which is expected because vehicles of this type are all rangers.

    • Middle passed gates: The rangers are allowed to pass all gates. However, not any ranger has visited and inspect two gates and a general gate in the east of the preserve. This gives some opportunity for the abusers to do some suspicious activities in the east of the preserve around those areas. The fact that they have missed those areas in their inspections is also suspicious.

Moreover, the mean trajectory of the vehicles in a type might have interesting information. However, there is a problem that the lengths of trajectories are not necessarily equal. Therefore, we first find the longest trajectory of the vehicle type and then pad every trajectory of that type by repeating the last point of the trajectory at the end of it in order to make all trajectories of a type equal in length. Thereafter, we plot the mean trajectory on the map in order to show the overall behaviour of the vehicles of a type (see Fig.

5).

We can see that the mean and the bold trajectories of the types 1, 2, and 3 are similar; and the trajectories of the types 4, 5, and 6 are similar. This makes sense because vehicles of types 1, 2, and 3 have less number of axles and the types 4, 5, and 6 are heavy vehicles with more number of axles. Also, we can see that the type-7 vehicles are visiting more places of the preserve which makes sense because they are supposed to inspect in the preserve.

2.2.2 Analysis of Traffic Time Based on Vehicle type

In order to analyze the data of traffic in terms of time, we plot the histogram of the dates when the vehicles of different types have passed through the preserve. These histograms for different types of vehicles are shown in Fig. 6.

Figure 6: The histogram of the dates when the vehicles of different types have passed through the preserve.

In the following, we list the analysis of the histograms:

  • The histogram of types 1, 2, and 3 are similar, whereas the patterns of histograms of types 4, 5, and 6 are similar. This makes sense because vehicles of types 1, 2, and 3 have less number of axles and the types 4, 5, and 6 are heavy vehicles with more number of axles.

  • As expected, the rangers (type 7) have visited the preserve regularly in different times because they are supposed to inspect in the preserve in a regular schedule.

  • The number of vehicle records in the first types is much more than the number of last types of cars. This makes sense because the preserve is visited more by ordinary visitors than the construction trucks.

  • Around June 2015, the traffic of all types of vehicles has increased a lot. This might be destructive to the nature preserve or might have had bad effects on the quality of life of flora and fauna in the preserve.

2.3 Analysis of Vehicle Traffic Based on Traffic Clusters

Another way of analyzing the data is to cluster the traffic and then analyze the behaviours of the clusters rather than comparing the types of vehicles.

Figure 7: Scree plot for determining the number of principal axes to choose.

2.3.1 Clustering the Traffic

As the given data are huge and its processing takes a significant amount of time, we sample randomly from the data in order to have a subset of data. For a fair sampling, we randomly choose vehicles from every type and extract the data and time series of those cars from the dataset. This sampling is stratified sampling

where the strata are the types of vehicles. It is known that stratified sampling is better than simple random sampling in terms of variance of estimation

(Barnett, 1974).

Figure 8: Pairs plot of the first five dimensions of the data projected onto PCA subspace. The data of different clusters are color-coded.

After sampling from the dataset, we make the lengths of trajectories of the sampled vehicles equal by padding the last visiting point to the end of short trajectories. Then, we concatenate the and

coordinates of the visited gates to make the trajectory of every vehicle as a vector. Putting the vectors of trajectories next to each other gives us a matrix.

Afterwards, we apply Principal Component Analysis (PCA)

(Friedman et al., 2009)

on this matrix-form data. The scree plot of the eigenvalues, which shows

, is shown in Fig. 7. As can be seen, the first eigenvectors seem to carry most of the information of variation of data. Therefore, we take the first principal axes and after projecting data onto these principal axes, we obtain the -dimensional projected trajectories.

After applying PCA on data, we cluster the trajectories using K-means method

(Friedman et al., 2009). The number of clusters is reasonable to be because of seven different number of types of vehicles. The pairs plot (scatter plot matrix) of the first five dimensions of the projected data onto PCA subspace is shown in Fig. 8. Every point in a scatter plot is the trajectory of a car. The points in this figure are color-coded according to their assigned clusters. As can be seen in this pairs plot, the clusters contain trajectories with different characteristics and values of dimensions. Hence, the clusters are properly dividing the trajectories into several groups.

In order to see the proportion of the number of vehicles of different types in every cluster, we plot the heat map of cardinality of each cluster in Fig. 9. As this figure shows, the rangers are mostly in one of the clusters, i.e., cluster 5. This makes sense because the behaviour of rangers, who should inspect in the preserve, is very different from other vehicles. Other types of vehicles are almost spread uniformly in the clusters because of close behaviour. This does not hurt our analysis because even some small variations between these clusters may have important information for us to discover.

Figure 9: Heat map of the proportion of cars of every type in different clusters.
Figure 10: The trajectories of vehicles in different clusters. The trajectory of every vehicle is shown by transparent blue color so that patterns and repeats can be obvious by bold blue color. The red trajectory is the mean trajectory of the cars in that cluster.

2.3.2 Analysis of Vehicle Trajectories Based on Traffic Clusters

For the cars of the same cluster, we plot the trajectory of the vehicles where the trajectories are transparent (alpha-blended) so that the trajectories which are repeated a lot get bold and the patterns show off. The starting and end points are also shown so that we understand where the cars of a type have started and ended their trajectories. These trajectories are shown in Fig. 10. The mean trajectories of clusters are also shown in this figure.

Figure 11: The histogram of the dates when the vehicles of different clusters have passed through the preserve.

In the following, we analyze the trajectories of every cluster:

  • Clusters 1, 6, and 7 do not visit ranger base which makes sense because according to Fig. 9, they do not include any ranger.

  • Cluster 5 mostly covers all the preserve, especially observed by looking at the mean trajectory. This makes sense because according to Fig. 9, this cluster mostly includes rangers who are responsible to inspect in all regions of the preserve.

  • Cluster 5, which mostly includes the rangers, is missing visiting the regions in the east of preserve. This is especially observed by the mean trajectory. This gives the abusers an opportunity to do suspicious activities in the east of the preserve. This coincides with our observation in type-7 vehicles in Fig. 5.

  • Clusters 2 and 3 almost visit all regions which makes sense because according to Fig. 9, they include vehicles from all clusters. However, the patterns of clusters 2 and 3 are different from the pattern of cluster 5 in terms of the mean trajectory. The mean trajectories of clusters 2 and 3 do not cover as much area as the mean trajectory of cluster 5. This is because they are not merely containing rangers.

  • Except for clusters 2 and 3, all clusters do not concentrate on the east. This is strange. Also, by noticing that trajectories of cluster 5 (rangers) do not concentrate on the east, we can conclude that vehicles visiting the east in clusters 2 and 3 are not rangers. This gives us an alert that some non-ranger vehicles have visited the east a lot while the rangers have not inspected there. Hence, some suspicious activities might have happened there.

2.3.3 Analysis of Traffic Time Based on Traffic Clusters

In order to analyze the data of traffic in terms of time, we plot the histogram of the dates when the vehicles of different clusters have passed through the preserve. These histograms for different clusters are shown in Fig. 11. In the following, we analyze the histograms:

  • Cluster 5 visits the preserve at various times. This makes sense because this cluster includes rangers and the rangers are supposed to visit the preserve regularly.

  • All clusters except cluster 5 mostly have visited the preserve around June 2015. This might be destructive to the nature preserve or might have had bad effects on the quality of life of flora and fauna in the preserve. This observation coincides with our observation from Fig. 6.

Chemical Characteristics
Appluimonia Airborne odor, not serious injury, influences life quality
Chlorodinine Corrodes body tissues, harmful if inhaled or swallowed, used for sterilizing
Methylosmolene Volatile organic solvent, toxic side effects, must be neutralized before disposal
AGOC-3A Environment friendly solvents, less harmfull, low Volatile Organic Compound (VOC)
Table 2: Chemicals recorded by the sensors.
Company Coordinate Production
Roadrunner Fitness Electronics (89,27) Fitness trackers, heart rate monitors, sport-related products
Kasios Office Furniture (90,21) Metal and composite-wood office furniture
Radiance ColourTek (109,26) Solvent-based metallic flake paints
Indigo Sol Boards (120,22) Skate-boards and snow-boards
Table 3: Manufacturing factories near the nature preserve.

3 Analysis of Impact of Surrounding Factories

3.1 Manufacturing Factories and the Sensors

The data subset 2 provides us the data of nine sensors recording four different chemicals. The nine sensors are located at the south of the preserve. These sensors are depicted in Fig. 1. The four chemicals recorded by these sensors are Appluimonia, Chlorodinine, Methylosmolene, and AGOC-3A. The characteristics of these chemicals are mentioned in Table 2. As reported in this table, Chlorodinine and Methylosmolene are more dangerous chemicals.

Four manufacturing factories are also located in the south of the preserve, surrounded by the nine sensors. The location of these plants are also shown in Fig. 1. The coordinates and the products of these factories are reported in Table 3. As seen in this table, the two factories Kasios Office Furniture and Radiance ColourTek seem to be more risky to environment compared to Roadrunner Fitness Electronics and Indigo Sol Boards.

Figure 12: The ridge plot (Joyplot) of the sensor failures in recording from the chemicals.

3.2 Analysis of Performance of Sensors

By carefully observing the data of sensors, we figure out that the sensors have some lacks and failures in their recordings. In other words, at every time slot, the sensor is supposed to record the four chemical values. For every sensor, we check whether all the chemicals have been recorded or not. The ridge plot (Joyplot) of the sensor failures in recording from the chemicals is shown in Fig. 12. As can be seen in this plot:

  • Mostly Methylosmolene is not recorded at some time slots.

  • Sensor 1 does not have any failure but other sensors have failed in recording Methylosmolene at some times.

  • The more strange fact is that

    these failures have occurred mostly at three time periods, i.e., April and June 2016, August and September 2016, and December 2016. The sensors have failed at the same times. Probably this has happened because of a problem in the electricity or supplies of the sensors.

Note that some more analysis of the performance of sensors is done in the following sections.

Figure 13:

The quantile plots of the sensor records from the chemicals. Rows and columns correspond to the sensors and chemicals, respectively, with the order mentioned in Fig.

1 and Table 2.

3.3 Analysis of Chemicals

In this section, we analyze the recorded chemicals by the nine sensors. The metric of records is parts per million.

3.3.1 Distribution of The Recorded Chemicals

The quantile plot of the recorded of every chemical by every sensor is depicted in Fig. 13. In the following, we analyze the quantile plots:

  • For some chemicals and sensors, most of recordings are very small as it is expected according to the rules of environmental protection. However, we can see some outliers which are showing high values of the chemicals. For example,

    for all sensors, we have very high values (30 to 80 parts per million) of Methylosmolene and AGOC-3A.

  • We also have some outliers (relatively high values) of Appluimonia and Chlorodinine recorded by different sensors. For example, sensor 2 has recorded some high values of Chlorodinine.

  • For most of the sensors, the recordings of Appluimonia and Chlorodinine are

    very skewed around zero

    . This makes sense because the values are supposed to be small.

  • There is some strange granularity in distribution of Appluimonia and Chlorodinine recorded by sensor 4. This probably shows some sensor failures in the recording or some strange patterns of chemicals in that region.

  • In the recorded data by sensor 6, we can see some strange breaking knee points. This again might be because of failures of the sensor or the odd patterns of chemical values. However, as this has happened for all chemicals, the hypothesis of sensor problem is stronger.

3.3.2 The Recorded Chemicals Over Time

Figure 14: The time series plots of the sensor records from the chemicals. Rows and columns correspond to the sensors and chemicals, respectively, with the order mentioned in Fig. 1 and Table 2. The red curves are the smoothed curves of the time series with time span .

The time interval of the recordings of the sensors is from April 1st, 2016 to December 31st, 2016. The recordings of the four chemicals by the nine sensors are shown as time series in Fig. 14. This figure also shows the smoothed curves of the time series by red color. We analyze the different time series in the following:

  • Again, we see that for all sensors, we have very high values (30 to 80 parts per million) of Methylosmolene and AGOC-3A at some times. These picks have happened at some specific times and not all times. The more accurate report of these times are presented in the next sections.

  • We can also see some high values of Appluimonia and Chlorodinine at some specific times. The more accurate report of these times are presented in the next sections.

  • There is some strange granularity in the time series of Appluimonia and Chlorodinine recorded by sensor 4. But this issue does not exist for records of Methylosmolene and AGOC-3A by this sensor, suggesting that this might not be because of sensor failure. According to the increasing behaviour of records of Appluimonia and Chlorodinine by sensor 4, we can conclude that these chemicals have been accumulated around the regions covered by sensor 4.

  • The records of sensor 6 for all four chemicals are more frequently having high values, giving the clue that some odd chemical accumulation might have happened at different months of 2016. This might explain the reason of breaking knee points in the quantile plots of this sensor in Fig. 13.

3.3.3 Unusually High Values of Chemicals

Using the interactive “loon” package in the R programming language (Waddell & Oldford, 2018), we plot the pairs plot (scatter plot matrix), parallel axes plot, and star plot of the recorded chemicals while the plots are linked together. These plots are shown in figures 15 and 16. For every sensor, the three plots are shown.

Figure 15: The pairs plot, parallel axes plot, and star plot for chemical recordings by sensors (continue in Fig. 16). In the parallel axes and star plots, the data is scaled by variable. The red, green, brown, and orange points indicate the large values of Appluimonia, Chlorodinine, Methylosmolene, and AGOC-3A, respectively. Note that in some cases, a combination of colors show the large values for a chemical.
Figure 16: Continue of Fig. 15.

In the plots of every sensor, we highlight the unusually high values of the four chemicals. We use different colors for the different chemicals. The utilized colors are red, green, brown, and orange for large values of Appluimonia, Chlorodinine, Methylosmolene, and AGOC-3A, respectively.

Afterwards, using the manually determined colors, we label the recorded data as usual and high values for each chemical recorded by every sensor. Therefore, we have the indices of high chemical values for every chemical in every sensor record. These indices are used in the following sections where we analyze the high values of chemicals.

The correlation of the recorded chemical values can be discussed here. Here, the chemicals are the dimensions of data. The correlation of the variables (dimensions) can be analyzed using either scatter plots, parallel axes, or the star plot. In scatter plot, if the points almost form a line with positive/negative slope, the correlation is positive/negative. In the parallel axes or the star plot, if the high/low values of a variable correspond to the high/low values of the other variable, those two variables are positively correlated; however, if the high/low values of a variable correspond to the low/high values of another one, we have negative correlation.

From the plots, we can see that:

  • In sensor 1, if we ignore some rare outliers (high values of chemicals), we can say that Appluimonia and Methylosmolene, Chlorodinine and Methylosmolene, and Methylosmolene and AGOC-3A are almost independent. That is because given a variable, the value of other variable does not change in the scatter plot.

  • In sensor 2, if we ignore some rare outliers (high values of chemicals), we can say that Appluimonia and Chlorodinine, Appluimonia and Methylosmolene, Chlorodinine and Methylosmolene are almost independent. That is because given a variable, the value of other variable does not change in the scatter plot.

  • In sensor 3, the correlation of Appluimonia and Chlorodinine, the correlation of Appluimonia and AGOC-3A, and the correlation of Chlorodinine and AGOC-3A are relatively small compared to other correlations because the points of the scatter plot roughly (but not completely) cover many parts of the plot, similar to the scatter plot of two independent variables.

  • In sensor 4, the variables Appluimonia and Chlorodinine, Appluimonia and AGOC-3A, and Chlorodinine and AGOC-3A are positively correlated.

  • In sensor 5, the variables Appluimonia and Chlorodinine, Appluimonia and AGOC-3A, and Chlorodinine and AGOC-3A are negatively correlated. Also, we can see from the pattern of scatter plots that Appluimonia and Methylosmolene, Chlorodinine and Methylosmolene, and Methylosmolene and AGOC-3A have some interesting and specific dependence.

  • In sensor 6, we can see from the pattern of scatter plots that all the chemicals have some interesting and specific dependence.

  • In sensor 7, if we ignore some rare outliers (high values of chemicals), we can say that Appluimonia and Chlorodinine, and Chlorodinine and AGOC-3A are almost independent. Also, we can see that Appluimonia and AGOC-3A are almost negatively correlated.

  • In sensor 8, if we ignore some rare outliers (high values of chemicals), we can say that Appluimonia and Methylosmolene, Chlorodinine and Methylosmolene, and Methylosmolene and AGOC-3A are almost independent.

  • In sensor 9, we can say that Appluimonia and AGOC-3A are almost negatively correlated. Also, we can see from the pattern of scatter plots that Methylosmolene and AGOC-3A have some interesting and specific dependence.

3.4 Analysis of Meteorological Data and Factories

3.4.1 Distribution of The Meteorological Data

The dataset provides us the meteorological data for a time period. The given meteorological data include both wind direction and speed. The wind speed is in meters per second and the wind direction is where the wind is originating from, using a north-referenced azimuth bearing (Rutstrum, 2000) where 360/000 is true north. To better explain the wind direction in this standard, note that:

  • Wind direction 0 (or 360) means wind is blowing from north to south.

  • Wind direction 90 means wind is blowing from east to west.

  • Wind direction 180 means wind is blowing from south to north.

  • Wind direction 270 means wind is blowing from west to east.

We plot the scatter plot of the meteorological data versus time in Fig. 17. From this figure, we find out that the given meteorological data contain information for three months, i.e., April 2016, August 2016, and December 2016.

Figure 17: The meteorological data provided for three different months, i.e., April 2016, August 2016, and December 2016.
Figure 18: The hexagonal 2D histograms of meteorological data, i.e., the wind direction versus the wind speed during months April, August, and December 2016.

As the meteorological data contain April 2016, August 2016, and December 2016, we separate the meteorological data of these three months from the dataset. Figure 18 shows the hexagonal 2D histograms of meteorological data, i.e., the wind direction versus the wind speed during the months April, August, and December 2016. From these histograms, we can see the density and concentration of the wind direction and speed in the three months:

  • In April 2018, the wind mostly blows from north to south and the speed is mostly 1 m/s. Moreover, in some cases, we have wind with speed less than 1 to 2 m/s blowing from east to west.

  • In August 2018, the wind mostly blows from south-west to north-east and the speed is mostly 1 to 2 m/s.

  • In December 2018, the wind mostly blows either from south to north with speed 1.5 m/s or from north-west to south-east with speed 4 m/s.

3.4.2 The Meteorological Data During Time

Figure 19: The time series of meteorological data, i.e., wind direction and speed during months April 2016 (first column), August 2016 (second column), and December 2016 (third column). The green curves are the smoothed curves of the time series with span .

It is also useful to show the meteorological data in terms of time because Fig. 18 does not encode the detailed time within the months. Figure 19 shows the times series of meteorological data, including both wind direction and speed, for the three months. We can conclude from this figure that:

  • For April 2016, on average, the wind direction is from north to south but in the middle of April, we have many different directions of wind. The speed is around 1 or 2 m/s but is slowly decreasing.

  • For August 2016, on average, the wind blows from south-west to north-east with speed 1 or 2 m/s.

  • For December 2016, on average, the wind blows from south-west to north-east with speed 2 or 3 m/s. The wind direction and speed have fluctuation in this month.

We summarize the meteorological information extracted from the Figures 18 and 19 in Table 4.

Month During the month On average
April north to south, 1 to 2 m/s some fluctuation in mid April north to south, 1 or 2 m/s
August south-west to north-east, 1 to 2 m/s south-west to north-east, 1 to 2 m/s
December south to north, 1.5 m/s north-west to south-east, 4 m/s south-west to north-east, 2 or 3 m/s
Table 4: Meteorological information in April, August, and December 2016.

3.4.3 Analysis of High Chemical Values with the Meteorological Data

Sensor Time Wind Direction Suspect factories
1 April 26
1 August 1
1 August 20 all
2 April 4
2 April 14
2 April 29 all
2 August 1
2 December 7 all
3 April 6 Roadrunner, Kasios
3 April 20
3 April 26
3 August 1 Roadrunner, Kasios
3 August 20 Roadrunner, Kasios
3 August 25 Roadrunner, Kasios
3 December 18 Roadrunner, Kasios
4 August 1
4 December 8 all
5 August 13
5 December 8 Roadrunner, Kasios
5 December 28 Roadrunner, Kasios
6 April 29 Roadrunner, Radiance
6 August 1
6 December 8 Kasios
7 April 29 Roadrunner, Kasios
7 August 1
7 December 22
7 December 29
8 April 15 Roadrunner, Kasios
9 December 5 all
9 December 18 Radiance, Indigo
9 December 25 Radiance, Indigo
Table 5: Analyzing which factories are responsible for high values of Appluimonia chemical.
Sensor Time Wind Direction Suspect factories
1 April 16
1 August 1
2 August 1
2 August 20 all
3 August 1
3 December 1 all
3 December 5
3 December 10
3 December 29 all
4 April 5 Roadrunner, Kasios
4 December 5 Radiance, Indigo
4 December 18 Roadrunner, Kasios
4 December 25 Roadrunner, Kasios
5 August 15 all
5 December 12 Roadrunner, Kasios
5 December 22 Roadrunner, Kasios
6 April 4 Roadrunner, Kasios
6 April 10 Roadrunner, Kasios
6 April 26 Roadrunner, Kasios
6 December 22 Roadrunner, Kasios
7 December 5
8 April 14 Roadrunner, Kasios
8 April 26 Roadrunner, Kasios
9 December 18 all
Table 6: Analyzing which factories are responsible for high values of Chlorodinine chemical.
Sensor Time Wind Direction Suspect factories
1 December 8
2 April 16 Roadrunner, Kasios
2 August 1
2 August 20 all
3 August 1
3 December 12 Roadrunner, Kasios
4 April 6 Roadrunner, Kasios
5 August 10 Radiance, Indigo
6 April 3 Roadrunner, Kasios
6 April 10 Roadrunner, Kasios
6 December 3 Roadrunner, Kasios
6 December 9 Kasios
7 April 15 Roadrunner, Kasios
7 December 5
8 April 15 Roadrunner, Kasios
9 April 11 Radiance, Indigo
Table 7: Analyzing which factories are responsible for high values of Methylosmolene chemical.
Sensor Time Wind Direction Suspect factories
1 December 8
2 April 16 all
2 August 1
2 August 20 all
2 December 8 all
3 August 13
3 August 17 Roadrunner, Kasios
3 August 29
3 December 12 all
3 December 26
3 December 28 Roadrunner, Kasios
4 April 6 Roadrunner, Kasios
4 December 1 Radiance, Indigo
4 December 12 Roadrunner, Kasios
4 December 18 Roadrunner, Kasios
5 April 6 all
5 August 1 Roadrunner, Kasios
5 December 17 all
6 April 15 Roadrunner, Radiance
7 April 15 Roadrunner, Kasios
7 April 19 Roadrunner, Kasios
7 August 1
7 December 15
8 April 15 Roadrunner, Kasios
9 August 12 all
9 August 25 Radiance, Indigo
Table 8: Analyzing which factories are responsible for high values of AGOC-3A chemical.

In this section, we combine the information extracted from the records of chemical values and the meteorological data in order to figure out which manufacturing factories were responsible for high values of which chemicals and when those happened. We plot the scatter plots of the meteorological data in the three months of April, August, and December in Figures 20 and 21. In these figures, we do this plotting for all the nine sensors where the points in the scatter plots are color-coded by whether they are small or unusually high values of chemicals. Hence, for every sensor, we have three scatter plots for the three months where the unusual values of chemicals are shown. The colors of the unusual values of chemicals match the colors used in Figures 15 and 16.

Figure 20: The scatter plot of wind speed and direction while the unusual high values of chemicals recorded by the sensors are highlighted (continue in Fig. 21). The color of points match the used colors in Figures 15 and 16. The opposite red rectangle, the green diamond, the brown rectangle, and the orange triangle correspond to high values of Appluimonia, Chlorodinine, Methylosmolene, and AGOC-3A, respectively. The small blue points correspond to the normal chemical amounts.
Figure 21: Continue of Fig. 20.

The fine values of chemicals are colored transparent (alpha-blended) blue in Figures 20 and 21. Therefore, the concentration and density of the wind direction and speed can also be observed in these plots. The analysis is the same as the analysis of the concentration of data in Fig. 18.

The information of detailed time within the month is missing in plots of Figures 20 and 21. Therefore, it is also useful to have another set of plots where the time is also encoded. Figures 22 and 23 include the time series plots of the meteorological data where the rows an columns correspond to the nine sensors and the three months, respectively. The same color coding as in Figures 20 and 21 is utilized. The question is whether plotting the time series of wind direction or wind speed is better. For figuring out which factory was responsible for which chemical, the information of wind direction is more essential. Therefore, we plot the time series of wind direction.

Figure 22: The time series of wind direction while the unusual high values of chemicals recorded by the sensors are highlighted (continue in Fig. 22). The color of points match the used colors in figures 15 and 16. The opposite red rectangle, the green diamond, the brown rectangle, and the orange triangle correspond to high values of Appluimonia, Chlorodinine, Methylosmolene, and AGOC-3A, respectively. The small blue points correspond to the normal chemical amounts.
Figure 23: Continue of Fig. 22.

From Figures 20, 21, 22, and 23, we can infer which factories are responsible for the high values of chemicals. In order to analyze in a better way, we write the information obtained from Figures 20, 21, 22, and 23 in Tables 5, 6, 7, and 8 each of which for a chemical. For every chemical, we list the sensor, the date, and the wind direction in the tables. for more convenience and better visualization, the wind direction is shown by arrow glyphs.

Using the wind direction and the sensor index and by observing the location of sensors in Fig. 1, we list the suspect factories which might have caused the high values of chemicals. Notice that some wind directions for some sensors do not give us any suspect or can accuse all the factories. In the tables, we write a dash and “all” for these two cases, respectively. As can be seen in Tables 5, 6, 7, and 8, for different months and for almost all the four chemicals, the two factories Roadrunner and Kasios are suspect. According to Table 3, Roadrunner Fitness Electronics produces fitness trackers, heart rate monitors, and sport-related products and Kasios Office Furniture makes metal and composite-wood office furniture. It is interesting that the first glance at Table 3 makes us suspicious to Radiance ColourTek and Kasios Office Furniture because producing their products seem to be more harmful to environment. However, the analysis of data using data visualization gives us clue that the two factories Roadrunner Fitness Electronics and Kasios Office Furniture are suspect. Kasios Office Furniture makes sense to make some trouble because of their metal products. The reason that Roadrunner Fitness Electronics might have made some trouble is probably because of some chemicals they are using for building their sport-related products.

From Tables 5, 6, 7, and 8, we can also see that:

  • The high values of Appluimonia have often happened in August and December 2016.

  • The high values of Chlorodinine have often happened in April and December 2016.

  • The high values of Methylosmolene have often happened in April and December 2016.

  • The high values of AGOC-3A have often happened in all April, August, and December 2016.

4 Analysis of Aerial Images

4.1 The Multi-Channel Images

There are multi-channel aerial images of the preserve provided in the data subset 3. The sizes of images are pixels and they are taken from the preserve in different seasons of years 2014 to 2016. The channels of the images are blue, green, red, Near Infrared (NIR), Short-Wave Infrared (SWIR) 1, and Short-Wave Infrared (SWIR) 2. These channels include different bands in the electromagnetic spectrum. These bands are reported in Table 9 where their characteristics are also mentioned.

Band Color Wavelength (nm) Useful for Mapping
B1 Blue 450-520 penetrates water, shows thin clouds
B2 Green 520-600 shows different types of plants
B3 Red 630-690 shows vegetation color and mineral deposits
B4 NIR 770-900 partially absorbed by water, shows chlorophyll and vegetation
B5 SWIR 1 1550-1750 absorbed by liquid water, shows moisture of soil and vegetation
B6 SWIR 2 2090-2350 insenitive to vegetation, shows differences in soil mineral
Table 9: Different bands of multi-channel images and their characteristics.

4.2 Estimation of Scale and Orientation of Images

Figure 24: The Boonsong lake within the preserve. Left: the provided RGB image of this lake in the dataset. Right: The cropped image of the lake from the RGB image of June 2016.

In the dataset, there is an RGB image of the Boonsong lake, located within the preserve. This RGB image is shown in Fig. 24. Furthermore, this lake was visually found in the RGB image of June 2016 and cropped from the image. The cropped image is also shown in Fig. 24. The size of the cropped image is pixels. The length of this lake is given feet. On the other hand, using Pythagorean theorem, we have:

This gives us the scale and resolution of the provided multi-channel images:

The orientation of the images can also be found using the orientation of the image of Boonsong lake. This lake is oriented north-south; hence, the images are oriented from north-east at top to south-west at bottom.

4.3 Analysis of RGB Images

Using the three first channels, we can have the RGB images shown in Fig. 25. As can be seen in this figure, for every year, we have four images with the order of late winter, summer, early fall, and late fall. The images of winter (first column) and late fall (last column) are showing snow, ice, and frozen lake as expected.

Two of the images which are images of November 2014 and November 2015 are covered by clouds completely. Some small clouds also exist in images of August 2014, September 2015, March 2016, June 2016, and September 2016.

From the RGB images, it can be seen that in summer and early fall, we have more vegetation as expected. However, in winter and late fall, the vegetation declines slightly because of the cold.

4.4 Analysis of Changes In Vegetation Health

According to Table 9, the bands B3 and B4 represent the vegetation and chlorophyll. Therefore, if we make false-color images using the three channels of [B4, B3, B2], we can show the changes in plant health. These images are shown in Fig. 26. The red colors in these false-color images show the good plant health.

Another way to assess the vegetation health is to plot the Normalized Difference Vegetation Index (NDVI). The NDVI can be calculated as this ratio (NASA Earth Observatory, 2000):

This ratio contains both B3 and B4 bands which are sensitive to vegetation and plant health. In Fig. 27, we plot this ratio across the image frame in order to assess the vegetation. In this figure, the NDVI index above the threshold out of is highlighted by green color to show good health of vegetation.

Note that in Figures 26 and 27, images of November 2014 and November 2015 might not be valid for consideration because of existence of thick clouds. As can be seen in images of Figures 26 and 27:

  • In summer and early fall, the vegetation health is better which is expected because of suitable weather conditions.

  • By noticing March 2014 and February 2015, the vegetation of winter 2015 has declined compared to summer 2014.

  • By noticing August 2014 and June 2015, the vegetation of summer 2015 has improved compared to summer 2014.

  • By noticing images of years 2015 and 2016, the overall vegetation of 2016 has improved compared to 2015.

  • By noticing images of June 2015 and June 2016, the vegetation of up-left corner of image has suspiciously declined. According to orientation of images, the vegetation of north and north-west of preserve has declined.

  • Noticing all images of 2014, 2015, and 2016, we see that the overall vegetation of preserve is declining across the years.

4.5 Analysis of Dry Areas

According to Table 9, the bands B4 and B5 represent the dry areas because they are absorbed by liquid water. Therefore, if we make false-color images using the three channels of [B5, B4, B2], we can show the dry regions of the preserve. These images are shown in Fig. 28. The red colors in these false-color images show the dry or burned regions.

Another way to observe the dry regions of the preserve is to merely consider the band B5 which is completely absorbed by liquid water (see Table 9). In Fig. 29, we plot this band across the image frame in order to observe the dry regions. In this figure, the value of B5 above the threshold out of is highlighted by red color to show the dry regions. In the images of this figure, some areas are falsely highlighted because of the clouds. These areas are shown by rectangles to be excluded from our analysis.

As can be seen in figures 28 and 29:

  • Noticing images of August 2014 and June 2015, we can say that 2015 has been drier than 2014.

  • Noticing images of June 2015 and June 2016, we can say that 2016 has had more moisture than 2015 in summer.

  • Noticing images of September 2015 and September 2016, we can say that 2016 has been drier than 2015 in fall.

  • Noticing all images of 2014, 2015, and 2016, we see that overall, the preserve is becoming drier and drier across the years.

4.6 Analysis of Soil Mineral Content

According to Table 9, the band B6 is sensitive to the soil mineral content. In Fig. 30, we plot this band across the image frame in order to observe the mineral content of soil. In this figure, the value of B6 above the threshold out of is highlighted by yellow color to show the rich mineral amount of soil. Note that the mineral of soil is useful for plant to grow. In the images of this figure, some areas are falsely highlighted because of the clouds. These areas are shown by rectangles to be excluded from our analysis.

As can be seen in Fig. 30:

  • Noticing images of August 2014 and June 2015, we can say that 2016 has been slightly richer in soil mineral than 2015 in summer.

  • Noticing images of June 2015 and June 2016, we can say that 2016 has been poorer in soil mineral than 2015 in summer.

  • Noticing images of September 2015 and September 2016, we can say that 2016 has been slightly richer in soil mineral than 2015 in fall.

  • Noticing all images of 2014, 2015, and 2016, we see that overall, the preserve is becoming slightly richer in soil mineral.

If we combine the observation of Figures 27 and 30, we see that the vegetation is declining but soil mineral is improving across the years. This shows that the reason of vegetation decline is not because of the soil but it must be the chemicals in the air. This gives us some clues back to the previous analysis of the chemicals in last section.

5 Conclusion

5.1 Summary of The Findings

In the previous sections, we mentioned various large and small suspicious activities and events which may have happened in the preserve. Here, we summarize the most important findings and do not re-mention the small details or not very important conclusions.

The main important conclusions from the analysis of vehicle traffic through the nature preserve were:

  • The data of some two-axle cars, two-axle trucks, three-axle trucks, and two-axle buses have not been recorded after some camping and general gates.

  • Some four-axle trucks have passed through the gates illegally. These gates are supposed to be passed only by rangers.

  • Rangers have missed inspecting in the east of the preserve mostly. However, non-ranger vehicles have visited the east significantly.

  • Around June 2015, the traffic of all types of vehicles has increased a lot. This might be destructive to the nature preserve or might have had bad effects on the quality of life of flora and fauna in the preserve.

The main important conclusions from the analysis of surrounding factories:

  • Some sensor failures have occurred mostly at three time periods, i.e., April and June 2016, August and September 2016, and December 2016.

  • At some periods of time, we have had very high values (30 to 80 parts per million) of Methylosmolene and AGOC-3A. These two chemicals are very dangerous and toxic so they are very destructive to the life of flora and fauna in the preserve.

  • Some odd chemical accumulation of Appluimonia and Chlorodinine have happened at different months of 2016 recorded by sensor 6. The values of these two chemicals have increased in 2016. The sensor 6 is close to all the factories.

  • In different months of 2016, the two factories Roadrunner and Kasios are suspect to generating the four chemicals Appluimonia, Chlorodinine, Methylosmolene, and AGOC-3A higher than the acceptable standards.

The main important conclusions from the analysis of aerial multi-channel images:

  • Overall, the nature preserve is becoming drier and drier across the years 2014 to 2016.

  • Overall, the nature preserve is becoming poorer in vegetation across the years 2014 to 2016.

  • The vegetation of the preserve is declining but the soil mineral is improving across the years. This shows that the reason of vegetation decline is not because of the soil but it must be the chemicals in the air.

5.2 Combining The Findings

If we see the different findings mentioned in the previous section, we can come up with some combined findings which might be useful for final analysis:

  • The significant number of visits to the preserve in June 2015 might have had destructive influence on the preserve resulting in drier soil and decline in vegetation.

  • The increase of toxic chemicals generated by the two factories Roadrunner and Kasios in 2016 might have had bad impact on the breath of plants resulting in decline of vegetation.

  • The increase of toxic chemicals might also have had destructive effect on the life of animals and birds including the Blue Pipit.

  • The odd accumulation of the chemicals recorded by sensor 6 and the strange failures in the performance of sensors might have some relations.

5.3 The Possible Reasons of Population Decline of The Blue Pipit

Finally, we can list the possible reasons of the decline in population of the Blue Pipit:

  • The high number of visits to the preserve in June 2015.

  • Some strange activities in the east of the preserve while the inspection of rangers was not satisfactory there.

  • Some illegal passes through the gates. The destructive activities might have happened in those cases.

  • Some strange ending points of trajectories for some vehicles. The destructive activities might have happened in those cases.

  • Some odd problems in performance of the sensors at the south of the preserve in different months of 2016. The sensors might have been manipulated for evil reasons by either the visitors or the factory managers.

  • Generation of toxic chemicals by the two factories Roadrunner and Kasios in different months of 2016.

  • The soil of the preserve has become dry and the vegetation health has declined.

5.4 Suggested Solutions For Preventing Further Problems

As a conclusion to the paper, we suggest some solutions for preventing similar problems in the preserve:

  • Controlling the number of visitors.

  • Permanent investigation in “all” regions of the preserve and in “various times”.

  • Increasing security at the gates.

  • Controlling whether the visitors have left the preserve or not after a reasonable time.

  • Controlling and investigating the chemicals generated by the factories more carefully.

  • Using fertilizers regularly in order to make the soil of the preserve richer for better vegetation. Better vegetation and flora helps to have better life for fauna.

Figure 25: The RGB aerial images from the preserve during years 2014 to 2016.
Figure 26: The images obtained from bands [B4, B3, B2] from the multi-channel images of the preserve during years 2014 to 2016.
Figure 27: The Normalized Difference Vegetation Index (NVDI) of the preserve during years 2014 to 2016. The threshold for highlighting the healthy vegeterated areas is 0.1 out of 1.
Figure 28: The images obtained from bands [B5, B4, B2] from the multi-channel images of the preserve during years 2014 to 2016.
Figure 29: The images obtained from the band B5 of the multi-channel images of the preserve during years 2014 to 2016. The threshold for highlighting the dry or burned areas is 0.7 out of 1. The areas which are falsely colored red because of clouds, lake, or ice are shown by rectangles.
Figure 30: The images obtained from the band B6 of the multi-channel images of the preserve during years 2014 to 2016. The threshold for highlighting the rich areas in minerals is 0.7 out of 1. The areas which are falsely colored yellow because of clouds, lake, or ice are shown by rectangles.

Acknowledgment

The authors hugely thank Prof. Wayne Oldford, professor of Statistics and Actuarial Science at the University of Waterloo, for his helpful guides, fruitful discussions, and his course “Data Visualization”.

References

  • Barnett (1974) Barnett, Vick. Elements of sampling theory. English Universities Press, London, 1974.
  • Friedman et al. (2009) Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert. The elements of statistical learning, volume 2. Springer series in statistics New York, NY, USA:, 2009.
  • NASA Earth Observatory (2000) NASA Earth Observatory. Measuring vegetation. https://earthobservatory.nasa.gov/features/MeasuringVegetation/measuring_vegetation_2.php, 2000. Accessed: 2018-06-10.
  • Rutstrum (2000) Rutstrum, Calvin. The wilderness route finder: the classic guide to finding your way in the wild. University of Minnesota Press, 2000.
  • Visual Analytics Community (2017) Visual Analytics Community. Vast challenge 2017. http://www.vacommunity.org/VAST+Challenge+2017, 2017. Accessed: 2018-06-10.
  • Waddell & Oldford (2018) Waddell, Adrian and Oldford, Wayne. Loon: An interactive statistical visualization toolkit. http://waddella.github.io/loon/, 2018. Accessed: 2018-06-10.