Canonical Variable Selection for Ecological Modeling of Fecal Indicators

Document Type


Publication Date



More than 270,000 km of rivers and streams are impaired due to fecal pathogens, creating an economic and public health burden. Fecal indicator organisms such as Escherichia coli are used to determine if surface waters are pathogen impaired, but they fail to identify human health risks, provide source information, or have unique fate and transport processes. Statistical and machine learning models can be used to overcome some of these weaknesses, including identifying ecological mechanisms influencing fecal pollution. In this study, canonical correlation analysis (CCorA) was performed to select parameters for the machine learning model, Maxent, to identify how chemical and microbial parameters can predict E. coli impairment and F+-somatic bacteriophage detections. Models were validated using a bootstrapping cross-validation. Three suites of models were developed; initial models using all parameters, models using parameters identified in CCorA, and optimized models after further sensitivity analysis. Canonical correlation analysis reduced the number of parameters needed to achieve the same degree of accuracy in the initial E. coli model (84.7%), and sensitivity analysis improved accuracy to 86.1%. Bacteriophage model accuracies were 79.2, 70.8, and 69.4% for the initial, CCorA, and optimized models, respectively; this suggests complex ecological interactions of bacteriophages are not captured by CCorA. Results indicate distinct ecological drivers of impairment depending on the fecal indicator organism used. Escherichia coli impairment is driven by increased hardness and microbial activity, whereas bacteriophage detection is inhibited by high levels of coliforms in sediment. Both indicators were influenced by organic pollution and phosphorus limitation.