I Introduction
During the months between October 2015 and March 2016, Greece has witnessed an unprecedented influx of refugees across the sea borders with Turkey. Tens or even hundreds of boats, each carrying 50 people or more, were landing on a daily basis throughout the Greek islands that are closest to the Turkish coastline. One of the most important factors in forecasting the influx rate within a short time frame of 2448 hours is the weather conditions in these sea passages, specifically wind intensity and direction, as these are directly associated with the severity of sea condition and, hence, the danger involved in the crossing. The most lethal events of boats capsizing and sinking happened mostly in days and nights when winds were growing in power or shortly after steady strong winds had already produced high waves in these areas.
Unfortunately, weather conditions is only one of the factors affecting the forecasting of influx rates. Compared to longer trips, e.g. between Libya and Italy, the case of Greece involves a relatively short trip of 57 n.m. and a couple of hours at most, for a fully loaded boat and working engine. This means that even in bad weather, some groups of refugees dared to come across (sometimes forced to by the smugglers) even in very risky conditions. The result was 3,771 registered deaths and many more missing from hundreds of capsizing and sinking events during 2015 alone. According to the United Nations High Commissioner for Refugees (UNHCR) [1], the International Organization for Migration (IOM) [2] and the Medecins Sans Frontieres (MSF) [3], during 2015 more than a million people reached Europe from Turkey and North Africa, seeking safety and asylum.
In order to better organize Search & Rescue (SAR) sea operations and logistical support for the humanitarian relief teams at the first reception islands, it is crucial that some level of influx forecasting is available. Indeed, previous works have shown that this is a realistic goal that can be addressed with datadriven methods in the shortterm [4, 5, 6], regardless of the more general factors and political decisions that affect the refugee flows in the longterm. Hence, it is extremely important to have appropriate tools for uptodate weather forecasting that uses local sources and weather stations, specifically for wind speed and direction at these sea passages. The idea is to have these forecasts available as inputs for the influx prediction modules, together with proper timeseries analysis on both the weather data as well as the influx data itself.
It should be noted that the alternative of using full weather data from sources like the Greek Meteorological Service (EMY) or NOAA is good for postmortem analysis and training datasets, but inappropriate in actual deployment mode for three main reasons: (a) these data are usually massive, small mobile devices cannot parse them, (b) they require reliable Internet access for large downloads, (c) they are available mostly as forecasts 36 hours beforehand, instead of processing realtime data from local weather stations. Therefore, having simple prediction models for winds that can run fast and offline is extremely useful if they are to be deployed in actual SAR operations.
In this study, five areas of firstreception refugee influx ‘hotspots’ is used as the baseline for simple predictive modeling in the Aegean Sea. More specifically, local weather stations in the islands of Lesvos (Petra and Thermi), Chios, Samos and Kos, the five most active sea passages and landing areas during the specific highpeak period, are used as input data for analyzing the statistics of winds (average speed, gust, direction). Furthermore, these five input sources are used in designing simple linear regressors for covering other intermediate areas of interest via crosssite interpolation. Experimental results are presented separately for each location, as well as test cases of two linear regressors, one for nearspot and one for farspot wind prediction.
Ii Material and Data Overview
Iia Weather data series
The data used in this study are a subset of the 20152016 weather data feeds from Meteo.gr, an openaccess Internet portal [7] maintained by the National Observatory of Greece for providing weather data from ground stations. These data were collected and aggregated in the special dataset GRRWL1O15J16 [8], already used in other works for training predictive models that combine localized weather with refugee influx time series.
As mentioned above, five locations with local weather stations were used as the testbed for this study, specifically in the islands of Chios, Kos, Lesvos (Petra and Thermi) and Samos. These were the five most active sea passages and landing areas during the specific highpeak period (1Oct2015 to 31Jan2016), associated with more than 85% of the total influx in the Aegean Sea, and they are used as input data for analyzing the statistics of winds, namely the daily values of average speed, gust and direction. Figure 1 shows the islands and the approximate locations (center of circle) of the local weather stations used as data feeds in this study.
IiB Software packages and hardware
The main software packages that were used in this study were:

Mathworks MATLAB v8.6 (R2015b), including: Signal Processing Toolbox, System Identification Toolbox, Statistics & Machine Learning Toolbox
[9]. 
Additional toolboxes for MATLAB (own & thirdparty) for specific algorithms, as referenced later on in the corresponding sections.

WEKA v3.7.13 (x64). OpenSource Machine Learning Suite [10].

Spreadsheet applications: Microsoft Excel 2007, LibreOffice Sheet 5.1 (x64).

Custombuilt programming tools in Java and C/C++ for data manipulation (import/export).
The data experiments and processing were conducted using: (a) Intel i7 quadcore @ 2.0 GHz / 8 GB RAM / MSWindows 8.1 (x64), and (b) Intel Atom N270 dualcore @ 1.6 GHz / 2 GB RAM / Ubuntu Linux 16.04 LTS (x32).
Iii Methods and Algorithms
The statistical and frequency properties of the daily arrivals data series were analyzed via pairwise correlation and full system identification, specifically by AutoRegressive Moving Average (ARMA) approximations, as described below.
Iiia Autocorrelation analysis
Pairwise correlation produces a quantitative metric for the statistical dependencies between values of two data series at different lags. In the case when a single data series is compared to itself, the autocorrelation corresponds to the statistical dependencies between subsequent values of the same series [11, 12]. Hence, value pairs with high correlation correspond to regular patterns in the series, i.e., encode periodicity at smaller or larger scales, according to the selected lag:
(1) 
where is the time series, and
are the mean and standard deviation,
is the lag andis the correrponding autocorrelation vector of length
. In this study, the wind variables of average speed and gust were analyzed via autocorrelation with a lag limit of against the current day, i.e., within the full 123days time frame available (four months).IiiB ARMA system identification
More generic and powerful than autocorrelation or standard linear regression alone, the
AutoRegressive Moving Average (ARMA) model is the standard approach for describing any linear digital filter or signal generator in the time domain. It is essentially a combination of an autocorrelation component that relates the current outputs to previous ones and a smoothing component than averages the inputs over a fixedsize window.(2) 
where is the input vector of size at time step , is the output vector of size (i.e., the current plus the previous ones), is the convolution kernel for the inputs, is the convolution kernel for the outputs and is the residual model error. Normally, and are vectors of scalar coefficients that can be fixed, if the model is static, or variable, if the model is adaptive (constantly “retrained”).
Both coefficient vectors, as well as their sizes, are subject to optimization of the model design according to some criterion, which typically is the minimization of the residual error . In practice, this is defined as , where is the standard Euclidean norm, is the ARMAapproximated output and is the true (measured) process output. The sizes and are the orders of the model and they are usually estimated either by informationtheoretic algorithms [12, 13, 15] or by exploiting known properties (if any) of the generating process, e.g. with regard to its periodicity. Such a model is described as ARMA(,), where AR() is the autoregressive component and MA() is the movingaverage component.
In approximation form, expanding the convolutions and estimating the current output , the analytical form of Eq.2 is:
(3) 
The error term in Eq.3 can also be expanded to multiple terms of a separate convolution kernel, similarly to and , but it is most commonly grouped into one scalar factor, i.e., with an order of one. In such cases, the model and be described as ARMA(,,) where is the order of the convolution kernel for the error term.
When applied to a signal generated by a process of unknown statistical properties, an ARMA approximation of it reveals a variety of important properties regarding this process. In practice, the (estimated) order of the AR component shows how strong the statistical coupling is between subsequent outputs, while the order of the MA component shows the “memory” of the process, i.e., how far in the past inputs the process “sees” in order to produce the current output.
IiiC Interpolation for missing values
In the selected time frame, the weather dataset for Samos contained one day of missing data (10Jan2016) due to maintenance of the weather station there. In order to fillin the missing data, interpolation was applied separately for each target wind variable, i.e., average speed, gust and direction.
More specifically, moving average of two and four points (missing point in the middle) were tested, as well as cubic spline interpolation (QS) [16], as shown in Table I for the ‘average wind speed’ parameter. The results from leaveoneout crossvalidation tests across the entire Samos data series perdimension confirmed, as expected, that QS is the most resilient to interpolation errors (RMSE), compared to the MA(2) and MA(4) methods. Moreover, using more than two points immediately adjacent to the missing one produces slightly larger interpolation error when MA is employed.
Wind Speed (km/h)  Method  Value  RMSE 

average  MA(2)  3.00  3.25 
MA(4)  3.85  3.52  
QS  2.57  2.97  
gust  MA(2)  25.75  9.81 
MA(4)  35.80  10.67  
QS  21.73  9.33 
These results hint that there may be large shortterm fluctuations present in the data series and higherorder polynomial, QS or other, has to be employed for accurate interpolation of the missing values. This is also evident by the fact that the actual interpolated values from MA(2) in Table I are much closer to the QS values, measured as overall the most accurate w.r.t. RMSE (see Eq.5), than the ones produced by MA(4). Hence, using the simple ruleofthumb that assumes only small changes from the previous 24hour time frame seems to be unsafe in practice for accurate forecasting.
The QS interpolation used in this case throughout he study is expected to be adequately accurate, namely km/h for average wind speed and about km/h for wind gust, which for the winds scale for Samos (see next section) translates to less than B for both ‘average’ and ‘gust’ parameters. Therefore, this singleday interpolated instance in the Samos data series can be considered as nonintrusive for the inherent statistics and the validity of the main experimental protocol.
Iv Experiments and Results
The following subsection present the analysis and modeling of the wind data series for the five sites, including (a) statistics per location, (b) correlation analysis and (c) ARMA for predictive modeling.
Iva Statistics per location
The following figures present the statistics of wind direction, average speed and gusts for the five sites. Figures 2, 5, 8, 11 and 14 illustrate the normalized polar histograms of dominant winds (daily average). Figures 3, 6, 9, 12 and 15 illustrate the normalized histograms of the average speed. Figures 4, 7, 10, 13 and 16 illustrate the normalized histogram of gusts. All statistics are for the time frame used in this study, i.e., 1Oct2015 to 31Jan2016 (123 days).
IvB Correlation analysis
The following figures present the pairwise autocorrelation analysis of wind average speed and gust for the five sites. In each plot, the entire dates range is used (1Oct2015 to 31Jan2016) in onestep lags. The ‘urand’ diagonal (dotted) is displayed as guideline for comparison against a Gaussian random walk with mean and standard deviation same as the corresponding gust data series.
Table II presents the pairwise correlations between the three wind data series per site. Statistical significance confirmation is indicated with bold numbers for and plain numbers for , otherwise italics indicate nonsignificant result for correlation. The same notation is used in Table III, presenting the crosssite correlations for wind average speed.
Site  avg/gust  avg/dir  gust/dir 

Chios  0.775  0.519  0.086 
Kos  0.642  0.410  0.685 
Lesvos/P  0.856  0.022  0.319 
Lesvos/T  0.739  0.837  0.452 
Samos  0.830  0.027  0.176 
Chios  Kos  Lesvos/P  Lesvos/T  Samos 

1  0.250  0.636  0.884  0.168 
1  0.277  0.225  0.216  
1  0.672  0.166  
1  0.204  
1 
IvC ARMA for predictive modeling
As describe above, ARMA can provide model identification for analysis and/or prediction. In this study, various AR, MA and ARMA designs were employed for the three wind variables (speed average, gust, direction) with a focus on forecasting future values from a historic time frame.
More specifically, in the case of AR a sequence of previous data points from the same series were used to predict future values; in the case of MA a sequence of previous and current data points from the corresponding series of other sites were used to predict future values; and in the case of ARMA these two approaches are combined. In practice, the MA part functions as input aggregator, i.e., regression against the other sites, and the AR part functions as output filter, i.e., regression against the same site’s history.
Figure 22 illustrates the results of ARMA predictive modeling for Kos’ wind average speed data series, for various AR and MA configurations. Even when using an AR component of order 7 or 10 and a MA component of 5 or 10 (historic values from the four other sites), no model seems to converge accurately in term of prediction. In contrast, Figure 23 illustrates how a similar ARMA (7,5) model achieves very efficient approximation for the Lesvos/Thermi site.
The fitness is a measure of accuracy of the approximation model and for a specific data series is defined analytically as:
(4) 
where is the true data point, is the model’s approximation, is the data series mean and is the standard Euclidean norm. For comparison, RMSE is defined as:
(5) 
From Figure 23 it is clearly evident that the ARMA(7,5) approximation is very efficient and successfully tracks the wind average speed data series very closely, using 7 previous values from the same site and 4 previous plus the current one from the other four sites. In other words, the model implements a spatiotemporal linear regressor with a total of 27 free parameters (trained) to make a 1day lookahead prediction. The exact performance of the model is , FPE=5.064 (Akaike’ final prediction error [17]) and RMSE=1.884 (km/h). The trained ARMA(7,5) model is described below:
(6) 
and:
(7) 
(8) 
(9) 
(10) 
In these polynomials, is the delay factor of the kernel, as described by the analytical form of Eq.3. Hence, the coefficient of in is essentially , i.e., the magnitude for the autoregressive factor (output) days back. The second index in refers to the corresponding site, in the same order as presented above. Since Lesvos/Thermi () is the target site, all the other sites are associated to their corresponding polynomials, i.e., Chios with Eq.7, Kos with Eq.8, Lesvos/Petra with Eq.9 and Samos with Eq.10.
V Discussion
The methods presented in this study fall under the general concept of regression, i.e., using available values from single or combined spatiotemporal data series to predict their future evolution. In this sense, even the missing data interpolation described earlier for Samos could be used as a very simple and fast approach to do this in practice. However, it is clear that this is not the optimal approach in terms of accuracy, as the results from ARMA illustrate later on. Thus, it is imperative to conduct indepth statistical and correlation analysis in each data series separately and in combination, in order to estimate their spatiotemporal dependencies and inherent complexity, which are to be used as guidelines for the design of efficient ARMA or other similar model approximations.
The polar histograms of wind speeds reveal that there are clearly dominant wind directions in all five sites: for Chios it is North, for Kos it is West, for Lesvos/Petra it is SouthEast, for Lesvos/Thermi it is West/NorthWest and for Samos it is South/SouthEast. In the three later cases the evidence is very strong, showing a very compact peak towards the source of the dominant wind direction. This is extremely useful when the wind models are to be associated statistically with corresponding sea condition models, for example the average wave height, since the local geographical morphology becomes more or less irrelevant in these approximations.
In all the sites, the histograms of the wind average speeds present an almostGaussian probability distribution function (pdf) and, thus, the corresponding mean, standard deviation, etc, for each case can be considered statistically valid. On the other hand, wind gusts in all sites is highly skewed if the same wind speed margins are used. These pdfs may also be approximated by Gaussians, but further investigation is required in terms of descriptive statistics and parametric distributions, as it is clearly evident that they include multiple peaks, large ‘flat’ ranges, etc.
The autocorrelation plots reveal the inherent dependencies between subsequent values in the data series, separated by specific time lags. In general, it is expected that nonrandom signals present a triangularshaped plot centered at the zerolag peak, which exhibits the maximum autocorrelation value. The larger the deviation from this triangular shape, the larger the stochastic factors in the signal. In other words, for a perfectly random signal of pure white noise, the corresponding plot should be ‘flat’ instead of triangularshaped. In the plots illustrated here, the wind average speeds in the five sites present smaller or larger stochastic properties. In particular, Samos seems to be the most unpredictable case, with large drops immediately before/after the zerolag peak, as well as multiple small peaks in other lag values. on the other hand, Kos seems to be the most deterministic case, as the plot is almost perfectly triangularshaped. It is expected that data series with large deviations from the ‘ideal’ triangular shape, if approximated by linear models like ARMA, will require larger convolution kernels, i.e., higherorder AR components, in order to capture longer historic sequences.
The correlations between the wind variables, illustrated in Table II, reveal that there is verified statistical dependency () between average speed and gusts, as expected, in all five sites. Furthermore, there is strong dependency () between dominant wind direction and (a) average speed in Lesvos/Thermi and (b) gust in Kos, as well as (c) average speed in Chios at a lower level (). In these cases, it is expected that corresponding sea condition models, especially average wave height and direction, will be more accurate and useful than in the other sites.
Furthermore, the crosscorrelation results, illustrated in Table III, reveal that there is a strong association () between the sites that are geographically adjacent or adequately near to each other. In particular, the average wind speed at Chios is strongly correlated with the two Lesvos sites, especially the one at Thermi. Looking at the map in Figure 1, it is clear why: the two islands are near to each other with open sea between them, especially between Lesvos/Thermi and Chios, which exhibits the largest crosscorrelation value in the Table. There is also a very strong correlation between the two sites at Lesvos, as expected. Nevertheless, the link between the two Lesvos sites is still weaker than the one between Lesvos/Thermi and Chios, probably because in the first case there is a large geographical obstacle between them  the large mount of the island of Lesvos itself. It is worth noticing that the distance across the two sites at Lesvos is about 34 km, while the distance across the Lesvos/Thermi and Chios sites is about 92 km, more than x2.7 times larger, but entirely over open sea or flat land. These quantified observations are extremely important when designing crosssite interpolation models, as only this kind of statistical analysis can reveal such unexpected and counterintuitive results regarding the weight that needs to be assigned to each spatial data node.
The results from the ARMA model examples more or less confirm the conclusions drawn from the crosscorrelations between sites. The average wind speed at Kos is difficult to predict from the values of the other four sites, as Figure 22 illustrates, because their crosscorrelations are marginally around 0.25 even for the closest one (Samos). This example was selected specifically to show that the most remote sites in terms of spatial distribution is the most difficult to forecast via linear predictive modeling like ARMA. There is still some correlation present between the sites over large geographical areas, primarily because the Aegean Sea is enclosed from three sides and spans over a basin of roughly 280 x 500 km (main area). This makes it very rare to have drastically different wind conditions over its islands, except in areas with specific geographical features, as in the case of the narrow ‘closed’ sea passage north of the Lesvos/Petra site, compared to the more ‘open’ passage east of the Lesvos/Thermi site.
The ARMA(7,5) model detailed for the Lesvos/Thermi site is a typical example of how spatiotemporal linear regressors can achieve very accurate predictions in the short term. In particular, the average wind speed that is modeled can be approximated with an expected error (RMSE) of less than km/h, which is almost 37% better than the error achieved by the cubic spline interpolation (QS) as described for the Samos data series, despite the fact that in the second case the interpolation is for an intermediate point (i.e., both sides bounded by true values) instead of 1day lookahead extrapolation (i.e., only one side bounded). Given the fact that such ARMA models are very simple to implement, based on vector operations over 30 or less parameters (trained), it is clearly evident that they can be extremely useful when analytical weather forecasts like NOAA or computationally intensive simulationbased models cannot be used in practice, e.g. as part of a mobile or web application.
Vi Conclusion
In this study, the three main wind variables (average speed, gust, direction) were investigated for five main locations that have been the most active ‘hotspots’ in terms of refugee influx in the Aegean Sea islands during the Oct/2015  Jan/2016 period.
The analysis of the threepersite data series included standard statistical analysis and parametric distributions (Gaussians), autocorrelation analysis, as well as crosscorrelation analysis between the sites. Various ARMA models were designed and trained in order to estimate the feasibility and accuracy of such spatiotemporal linear regressors for predictive analytics, compared also with standard moving average and cubic spline interpolation used for missing values.
The results proved that such datadriven statistical approaches are extremely useful in identifying unexpected and sometimes counterintuitive associations between the available spatial data nodes. It is worth noticing that such discoveries verify and quantify important semantic information that is related to special geographical features, such as narrow sea passages and large obstacles to wind flows. This is very important when designing corresponding models for shortterm forecasting of sea condition, especially average wave height and direction, which is in fact what defines the associated weather risk of crossing these passages in refugee influx patterns.
References
 [1] UNHCR, “Unhcr: Mediterranean sea arrivals – 2015 data by location, country of arrival (data).” [Online]. Available: http://data.unhcr.org/mediterranean/download.php?id=490
 [2] IOM, “Iom: Missing migrants project  mediterranean, update 12 april 2016.” [Online]. Available: http://migration.iom.int/Europe/
 [3] MSF, “Msf: Obstacle course to europe (report).” [Online]. Available: http://www.msf.gr/publications/obstaclecourseeurope
 [4] “Modeling the behavior of smuggling networks in turkey and northern Africa via refugee influx analysis for better search and rescue readiness,” (preprint), 2016. [Online]. Available: http://doi.org/10.5281/zenodo.185257
 [5] “Identification of refugee influx patterns in greece via modeltheoretic analysis of daily arrivals,” arXiv:1605.02784 [stat.ML], 2016. [Online]. Available: http://arxiv.org/abs/1605.02784
 [6] H. Georgiou, I. Kiomourtzis, and F. Alexakos, “Refugee influx analysis for ‘smart’ earlywarning systems for the rescue/relief operations in the firstreception islands (in Greek),” in New technologies for Civil Protection services (SafeEvros 2016), 2225 June 2016, Alexandroupolis, Greece, 2016. [Online]. Available: http://dx.doi.org/10.5281/zenodo.59728
 [7] N. O. of Greece, “Meteo.gr – Data feeds from local weather stations in Greece.” [Online]. Available: http://meteosearch.meteo.gr/
 [8] H. Georgiou, “GRRWL1O15J16: Daily refugee arrivals and weather data, Greece, Oct.2015  Jan.2016.” [Online]. Available: http://dx.doi.org/10.5281/zenodo.58987
 [9] MathWorks, “Mathworks matlab (r2015b).” [Online]. Available: http://www.mathworks.com/products/matlab/
 [10] U. o. W. Machine Learning Group, “Weka (v3.7.13).” [Online]. Available: http://www.cs.waikato.ac.nz/~ml/weka/
 [11] H. Hsu, Signals and Systems. McGrawHill: 1995, 1995.
 [12] R. Hamming, Digital Filters, 3rd/Ed. Dover Publications: 1989, 1989.
 [13] B. Porat, Digital Processing of Random Signals  Theory and Methods. Dover Publications: 1994, 1994.
 [14] C. Therrien, Discrete Random Signals and Statistical Signal Processing. PrenticeHall: 1992, 1992.
 [15] E. Hannan, “The estimation of the order of an arma process,” Ann. Statistics, vol. 5, pp. 170–181, 1993.
 [16] M. Spiegel, J. Liu, and S. Lipschutz, Mathematical Handbook of Formulas and Tables (4th/Ed.). McGrawHill: 2012, 2012.
 [17] L. Ljung, System Identification: Theory for the User. PrenticeHall, NJ, 1999.
Comments
There are no comments yet.