1 Introduction
Floods are the most common and deadly natural disaster in the world. Every year, floods cause between thousands to tens of thousands of fatalities cred ; jonkman2003loss ; unisdr ; jonkman2005global ; doocy2013human , affect hundreds of millions of people doocy2013human ; jonkman2005global ; unisdr , and cause tens of billions of dollars in economic damages cred ; unisdr . Sadly, these numbers have only been increasing in recent decades loster1999flood . Indeed, the UN charter notes floods to be one of the key motivators for the formulation of the sustainable development goals (SDGs), and directly challenges us: "They knew that earthquakes and floods were inevitable, but that the high death tolls were not" undp .
Early warning systems, even with limited lead time and imperfect accuracy, have been shown to reduce both fatalities and economic damages by more than a third, and in some cases almost by half who ; pilon1998guidelines ; worldbank . Unfortunately, the majority of human costs that are due to flooding are concentrated in developing countries doocy2013human , which often lack effective and actionable early warning systems due to limited data collection, funding, or professional expertise stromberg2007natural . The result is that, across multiple countries, thousands die on average every year, and relief and mitigation efforts have very limited information to rely on.
In this work, as part of our broader efforts in flood forecasting nevo2018ml , we focus on riverine floods which are responsible for much of the effect on human life. Existing hydrologic methodology for building flood prediction models relies heavily on insitu infrastructure such as costly extensive gauging systems worldbank , and on local adaption of the models that requires highly trained professionals anderson2002calibration . Providing value where it matters most thus requires overcoming several challenges. First, we would like to reduce reliance on insitu measurements such as extensive gauging sites constructed along the modeled river. Relevant data is constantly being produced at immense scale across the globe, but the vast majority of this data is not measured using insitu measurements but rather comes in the form of, e.g., satellite imagery. Clearly, leveraging even small parts of it has the potential for substantially improving flood prediction models. Second, to cover large areas in developing regions, we must automate and scale up the model building methodology and reduce its reliance on the human factor. Third, it is the sad paradox of life that populations in lowmeans areas cannot afford to respond to a low precision system, and thus to make positive impact in such areas, we require improved predictive power.
The field of machine learning (ML) has transformed many aspects of our lives, and is naturally geared to cope with the above challenges. Improving prediction, leveraging on multiple signals that are difficult for a human expert to get a grasp on, and automating human processes, are all characteristics of effective ML systems. The first critical step toward building such systems is to provide globalscale estimates of the water discharge (volume per second) through the cross sections of a river, which can then be used to train early warning predictive models. As noted, such insitu measurements are unavailable more often than not, and thus our first goal is to perform remote discharge estimation, or estimating the discharge based on remote measurements (usually satellite data)
smith1997satellite .Our concrete goal is thus to create a prediction model that, using few measurements from a set of river locations, will be able to generalize to all
locations. Intuitively, this should be possible because the multiple prediction problems (one for each location) are related: the underlying physical mechanism that relates satellite measurements to water level is identical, and each local measurement, where it exists, gives us a "clue" as to the nature of this shared mechanism. This general setting of leveraging information about some tasks to assist in the learning of models for other tasks has a long history in machine learning: inductive transfer, transfer learning and multitask learning are all closely related variants of the framework (see, e.g.
Thrun:1996 ; InductiveTransfer:1997 ; Caruana:1997 ; Baxter:2000 for some of the early influential works and Pan:2010 for a more recent survey).We consider a simple but powerful regression model where the coefficients are composed of two components: one local that allow us to adapt to the characteristics of the local site, and one shared that allows us to capture the global water discharge mechanism. It is this shared component that can benefit from transfer learning. A similar formal setting is explored in the highly cited work of ando2005framework where the task is called structural learning^{1}^{1}1The focus of the work on transferring from unlabeled to labeled tasks is different from ours but the formal underpinning is identical., pointing to the common shared structure learned. As they show, using the empirical risk minimization (ERM) principle, it is provably beneficial to learn from multiple tasks, from a statistical sample complexity perspective. A recent work yuan2017spectral also shows empirically that this shared regression approach can be useful for multispectral imagery classification. The computational and optimization questions of "Can we efficiently learn such a model?" are left unanswered. In this work, we show that the answer to this question, at least from an optimization viewpoint, is in the affirmative.
The target objective of this common mechanism regression (CMR) is nonconvex and may have spurious local minima. Our main contribution is that, given enough independent tasks, we can efficiently find its global optimum. For this purpose, we extend the ideas in netrapalli2013phase ; candes2015phase to CMR with multiple regressions. We begin with a spectral initialization with provable nearoptimal accuracy, and then refine it using standard descent methods.
In the context of remote discharge estimation, our learning goal is to capture the common discharge mechanism that relates satellite measurements from multiple spectral bands to water levels. Naturally, we do not have access to the true mechanism (or we would not need to learn it). However, we can simulate such mechanisms and assess the merit of our approach when the ground truth is known. Using such simulations, we demonstrate the effectiveness of using our approach for transfer learning: sharing measurements from individual sites allow us to jointly improve the average predictive performance across all of them.
2 Common Mechanism Regression (CMR)
Our model consists of independent regressions that share a common mechanism. For simplicity, we assume that each regression has exactly pairs of labels and features
(1) 
where are scalar labels, and are matrix observations.^{2}^{2}2We use for bands and for pixels in the context of discharge estimation but the setting is general. Our common mechanism regression (CMR) involves a two phase approach: a common mechanism parameterized by
followed by decoupled local linear regressions denoted by
:(2) 
Note that the overall structure is linear in the features, but has a bilinear parameterization. Our main goal is to recover the common parameter and, if possible, we would also like to identify the local ’s. In particular, we are interested in the scenario when is large but is small, so that we have many regression problems but few observations for each one. Each regression, if estimated independently, requires at least samples. By introducing a common mechanism where is shared across the different sites, we allow , and also address the case of where exact recovery of is impossible.
The CMR model is natural for river discharge estimation using remote sensing. Specifically, in multispectral imaging, the data matrices are defined by spectral and spatial dimensions. A reasonable approach to discharge estimation is thus to use the spectral information to identify water pixels and then apply spatial regression. The classical technique for water identification is via a common nonlinear spectral feature known as Normalized Difference Water Index (NDWI) mcfeeters1996use ^{3}^{3}3More advanced indices are reviewed in isikdogan2017surface .. This index is the motivation to CMR which automatically learns a datadriven feature defined by the weights of . In what follows, we will show that linear CMR outperforms the nonlinear NDWI.
We propose to recover the parameters as the solution to the following regularized bilinear least squares optimization:
(3) 
Due to its bilinear structure, CMR involves a nonconvex minimization. Naive descent techniques may therefore converge to spurious local minima. Interestingly, CMR is similar to phase retrieval problems where it was recently shown that these bad critical points can be avoided via clever initialization schemes netrapalli2013phase ; candes2015phase . Adaptation of these ideas to CMR leads to the following common spectral initialization:
where
is the eigenvector corresponding to the largest eigenvalue. From here, we continue with standard descent methods, e.g., gradient descent or alternating least squares, till convergence. Together, the computational complexity of this approach is linear in
.Under standard assumptions, the proposed spectral initialization can recover the true with high accuracy. Like hardt2016identity , we consider the realizable case, with normal features and assume an exact CMR model with no noise. We also assume random local regressors, i.e., we model
as i.i.d. realizations of an arbitrary probability distribution. This last assumption is special for our work and is required in order to model multiple regression problems with common characteristics.
Theorem Under the above assumptions, there exists a constant such that if , then with probability of at least .
The theorem quantifies the improved performance when increasing or via their product. To prove the theorem, we show that
(4) 
where and are positive constants that depend on the distribution of
. Thus, its principal eigenvector is the true parameter. Using the fact that the variance of
decays with , we show that concentrates around its mean as and increase.3 Numerical experiments
We start by assessing the merit of our CMR approach for discharge estimation using synthetic simulations. Recall that our goal is to leverage measurements from many locations to improve prediction. Thus, we consider the performance of CMR for a range of values of (the number of sites) and (the number of samples per site). For each set of values, we repeat the following 50 times: chose a random and , run the CMR algorithm, and declare success if the squared correlation between the true and its estimate exceeds . We do this with and without the spectral initialization. The results for and are summarized in the figure below.
As expected, the results demonstrate that CMR recovers with few samples for many sites, i.e., when . Interestingly, we also succeed in recovering when , a setting where it is impossible to recover . The left and right panels illustrate the importance of the initialization, which substantially widens the ranges of settings for which CMR succeeds with high probability.
We now evaluate the merit of our CMR approach for the predictive task of discharge estimation in a realworld setting. We use images from LANDSAT8 mission roy2014landsat which include 11 spectral bands each, and ground truth labels from the United States Geological Survey (USGS) website. The results were generated using river gauge sites with temporal samples each. For every cross validation fold, the temporal samples were split into train and test, and the CMR results were compared with the NDWI persite regression. The average mean squared errors, normalized persite, of randomly shuffled fold cross validation repeated 4 times are given in the following table:
Train  Test  

NDWI  0.54  0.70 
CMR  0.47  0.65 
As can be seen, there is a clear advantage to learning the shared component of the CMR model. Appealingly, the advantage is also substantial on held out test data, despite the expressiveness of the CMR model which also allows for local components.
4 Summary and Future Directions
In this work, we proved that, despite the nonconvex nature of the learning objective, the common mechanism regression (CMR) model can be globally optimized using a spectral initialization combined with standard descent. We also demonstrated the efficacy of the approach for the challenge of discharge estimation where we have few measurements for many river sites.
On the modeling front, it would be useful to generalize CMR so as to allow for robust and tasknormalized loss functions. Another interesting direction is to inject nonlinearity into CMR to make it even more competitive with the nonlinear physically motivated NDWI approach. On the practical discharge estimation front, we plan to aggregate multiple data sources (e.g. additional types of satellites, weather data) within the CMR framework.
References
 (1) The Centre for Research on the Epidemology of Disasters (CRED)  Natural Disasters 2017. https://cred.be/sites/default/files/adsr_2017.pdf, 2017. [Online; accessed 30092018].
 (2) United Nations Development Programme (UNDP)  Sustainable Development Goals. https://shop.undp.org/pages/thesustainabledevelopmentgoals, 2015. [Online; accessed 30092018].
 (3) United Nations Office for Disaster Risk Reduction (UNISDR)  The Human Cost of Weather Related Disasters. https://www.unisdr.org/2015/docs/climatechange/COP21_WeatherDisastersReport_2015_FINAL.pdf, 2015. [Online; accessed 30092018].
 (4) World Health Organization (WHO)  Global Report on Drowning. http://www.who.int/violence_injury_prevention/global_report_drowning/Final_report_full_web.pdf, 2014. [Online; accessed 30092018].
 (5) World Bank  Global Assessment Report on Costs and Benefits of Early Warning Systems. https://www.preventionweb.net/english/hyogo/gar/2011/en/bgdocs/Rogers_&_Tsirkunov_2011.pdf, 2011. [Online; accessed 30092018].
 (6) Eric A Anderson. Calibration of conceptual hydrologic models for use in river forecasting. Office of Hydrologic Development, US National Weather Service, Silver Spring, MD, 2002.
 (7) Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817–1853, 2005.
 (8) Jonathan Baxter. A model of inductive bias learning. J. Artif. Int. Res., 12(1):149–198, 2000.
 (9) Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007, 2015.
 (10) Rich Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997.
 (11) Thomas G. Dietterich, Lorien Pratt, and Sebastian Thrun. Special issue on inductive transfer. Machine Learning, 28(1), 1997.
 (12) Shannon Doocy, Amy Daniels, Catherine Packer, Anna Dick, and Thomas D Kirsch. The human impact of earthquakes: a historical review of events 19802009 and systematic literature review. PLoS currents, 5, 2013.
 (13) Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016.

(14)
Furkan Isikdogan, Alan C Bovik, and Paola Passalacqua.
Surface water mapping by deep learning.
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10(11):4909–4918, 2017.  (15) Sebastiaan N Jonkman. Global perspectives on loss of human life caused by floods. Natural hazards, 34(2):151–175, 2005.
 (16) SN Jonkman. Loss of life caused by floods: an overview of mortality statistics for worldwide floods. DC12336, 2003.
 (17) Thomas Loster. Flood trends and global change. In Proceedings IIASA Conf on Global Change and Catastrophe Management: Flood Risks in Europe, 1999.
 (18) Stuart K McFeeters. The use of the normalized difference water index (ndwi) in the delineation of open water features. International journal of remote sensing, 17(7):1425–1432, 1996.
 (19) Praneeth Netrapalli, Prateek Jain, and Sujay Sanghavi. Phase retrieval using alternating minimization. In Advances in Neural Information Processing Systems, pages 2796–2804, 2013.
 (20) Sella Nevo, Vova Anisimov, Gal Elidan, Ran ElYaniv, Pete Giencke, Yotam Gigi, Avinatan Hassidim, Zach Moshe, More Schlesinger, Guy Shalev, Ajai Tirumali, Ami Weisel, Oleg Zlydenko, and Yossi Matias. ML for flood forecasting at scale. In Proceedings of the NeurIPS AI for Social Good Workshop, 2018.
 (21) Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 2010.
 (22) Paul J Pilon et al. Guidelines for reducing flood losses. In Guidelines for reducing flood losses. Naciones Unidas, 1998.
 (23) David P Roy, MA Wulder, Thomas R Loveland, CE Woodcock, RG Allen, MC Anderson, D Helder, JR Irons, DM Johnson, R Kennedy, et al. Landsat8: Science and product vision for terrestrial global change research. Remote sensing of Environment, 145:154–172, 2014.
 (24) Laurence C Smith. Satellite remote sensing of river inundation area, stage, and discharge: A review. Hydrological processes, 11(10):1427–1439, 1997.
 (25) David Strömberg. Natural disasters, economic development, and humanitarian aid. Journal of Economic perspectives, 21(3):199–222, 2007.
 (26) Sebastian Thrun. Is learning the nth thing any easier than learning the first? In Advances in Neural Information Processing Systems, pages 640–646, 1996.
 (27) Haoliang Yuan and Yuan Yan Tang. Spectral–spatial shared linear regression for hyperspectral image classification. IEEE transactions on cybernetics, 47(4):934–945, 2017.
Comments
There are no comments yet.