The rise of Machine Learning in hydrology and other natural sciences

by: Xiang Li, Ankush Khandelwal, Christopher Duffy, Vipin Kumar, John L. Nieber, and Michael Steinbach

In 2016 AlphaGo and its successor programs defeated human Go professionals using AI (artificial intelligence) (“AlphaGo,” n.d.). The tremendous growth in “AI,” “machine learning (ML),” and “big data” has thus declared a new era sometimes called the “fourth industrial revolution,” which has fundamentally changed the way we live and work. Customers are targeted with more effective business advertisements. Live captions on the media are more semantically accurate. Behind these scenes is the advent of machine learning.

Although the unreasonably effective predictive performance of ML models may make them appear mysterious to some (Sejnowski, 2020), they are not unintelligible to practitioners. In simplest terms, any applicable ML model can be broken down into three components: general model architecture, purpose-orientated loss function, and an optimization algorithm. These components can be customized and re-designed for a specific problem. With appropriate modification, an ML algorithm can be transformed to solve a well-defined problem even for specialized science and engineering domains with data-rich scenarios.

Indeed, in the natural sciences, ML is already having an enormous impact, e.g., ML was relevant to thirteen percent of all STEM papers submitted in 2019 (“arXiv submission rate statistics,” 2020). While ML can discover complex patterns in the data, it is quite distinct from the traditional scientific discovery paradigm. The data-driven pathway may not follow scientific guidelines and usually ignores the wealth of accumulated scientific knowledge. On the other hand, the knowledge-based pathway does not fully leverage the information hidden in data because not all interesting patterns in data might be comprehensively explained. To bridge this gap between ML and knowledge-based discovery, there is an emerging research direction named “Knowledge Guided Machine Learning” (KGML) that has captured the interest of both academia and industry. With the guidance of scientific knowledge from domain experts, the KGML framework accelerates science discovery processes.

In August 2020, the University of Minnesota Twin Cities held an inaugural 3-day virtual workshop, which engaged worldwide researchers to discuss the KGML framework (“1st Workshop on Knowledge Guided Machine Learning (KGML),” 2020). Hydrology was one of the involved natural science sessions. In that session, one presentation showed the application success of KGML for streamflow prediction. This implementation demonstrated some success at emulating the streamflow mechanism of the well-known hydrologic model, SWAT (Khandelwal et al., 2020). In one small till dominated watershed in Southeast Minnesota, the SWAT generated discharge was emulated satisfactorily when the ML model adopted concepts of hydrologic system memories, such as soil moisture and snow accumulation. As shown in Figure 1 (1-year data for visualization), KGML improves streamflow prediction compared to the case when no physics is included in the ML model. Putting those time series data in a scatter plot, it clearly shows that KGML prediction matches with the SWAT synthetic data more consistently. Through the whole testing period, the NSE score improves from 0.57 to 0.76 when implementing KGML.


Emulation of the SWAT model in the South Branch of the Root River at Garden Meadow in SE MN. KGML (blue solid line) improves the pure ML model (orange dashed line) performance. Note that ‘observation’ is the SWAT synthetic data.

Although still at an early stage, both ML and KGML exhibit their remarkable potential in hydrology. Scientific discovery and the understanding of complex hydrologic systems awaits help from these epoch-making data-driven methods. In addition to this result at the KGML 2020, broad application successes of ML across domains were also discussed, including applications in weather forecasting, lake modeling, and cancer diagnosis. This leads to the question: “When so many disciplines embrace the rise of ML, how should hydrologists adapt to this burgeoning trend in such a short time?”

The International Association of Scientific Hydrology (IAHS) proposed the dedication in 2012-2022 to “Prediction Under Change” (PUC) (Sivapalan, 2011). The traditional physically based hydrologic model does not always yield satisfactory performance in all basins. Recently, although an increasing trend to apply ML in hydrology has been recognized, it bears emphasis that pure ML will definitely not replace process-based models because ML is unsuited for data inadequacy situations. However, incorporating principles of hydrology (e.g., conservation of mass, conservation of energy, etc.) would potentially overcome this disadvantage. Consequently, KGML will be an appropriate candidate approach that implements ML under the guidance of hydrology knowledge, which requires significant collaborative research efforts between data scientists and hydrologists. This research collaboration will become more effective and beneficial when hydrology graduate students and researchers grasp fundamental knowledge of ML because coupling physically-based hydrological models and data-driven ML models will be a future research direction for comprehending complicated watershed systems. At the same time, while interdisciplinary research efforts are continually contributing to model complex hydrological systems with the assistance from ML, it is anticipated that more definitive answers to the PUC theme will gradually develop within this decade. 

Full published study and references in the summer bulletin of the American Institute of Hydrology