References

Prior publications and useful reading relevant to analysis in general and for each algorithm can be found in the references listed below.

Understanding the algorithms of H2O is an integral part of using the platform correctly, and getting the most of analysis.

Below are the citations of seminal articles, and articles demonstrating rigorous application of the algorithms of H2O This list is not meant to be exhaustive, but rather to provide an abbreviated syllabus to help develop a strong understanding.

GLM

Breslow, N E. “Generalized Linear Models: Checking Assumptions and Strengthening Conclusions.” Statistica Applicata 8 (1996): 23-41.

Goldberger, Arthur S. “Best Linear Unbiased Prediction in the Generalized Linear Regression Model.” Journal of the American Statistical Association 57.298 (1962): 369-375. http://people.umass.edu/~bioep740/yr2009/topics/goldberger-jasa1962-369.pdf

Guisan, Antoine, Thomas C Edwards Jr, and Trevor Hastie. “Generalized Linear and Generalized Additive Models in Studies of Species Distributions: Setting the Scene.” Ecological modelling 157.2 (2002): 89-100. http://www.stanford.edu/~hastie/Papers/GuisanEtAl_EcolModel-2003.pdf

Nelder, John A, and Robert WM Wedderburn. “Generalized Linear Models.” Journal of the Royal Statistical Society. Series A (General) (1972): 370-384. http://biecek.pl/MIMUW/uploads/Nelder_GLM.pdf

Snee, Ronald D. “Validation of Regression Models: Methods and Examples.” Technometrics 19.4 (1977): 415-428.

Poisson

Frome, E L. “The Analysis of Rates Using Poisson Regression Models.” Biometrics (1983): 665-674. http://www.csm.ornl.gov/~frome/BE/FP/FromeBiometrics83.pdf

Logistic (binomial and multinomial)

Press, S James, and Sandra Wilson. “Choosing Between Logistic Regression and Discriminant Analysis.” Journal of the American Statistical Association 73.364 (April, 2012): 699–705. http://www.statpt.com/logistic/press_1978.pdf

Pearce, Jennie, and Simon Ferrier. “Evaluating the Predictive Performance of Habitat Models Developed Using Logistic Regression.” Ecological modelling 133.3 (2000): 225-245. http://www.whoi.edu/cms/files/Ecological_Modelling_2000_Pearce_53557.pdf

GBM

Dietterich, Thomas G, and Eun Bae Kong. “Machine Learning Bias, Statistical Bias, and Statistical Variance of Decision Tree Algorithms.” ML-95 255 (1995).

Elith, Jane, John R Leathwick, and Trevor Hastie. “A Working Guide to Boosted Regression Trees.” Journal of Animal Ecology 77.4 (2008): 802-813

Friedman, Jerome H. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics (2001): 1189-1232.

Friedman, Jerome, Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. “Discussion of Boosting Papers.” Ann. Statist 32 (2004): 102-107

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. “Additive Logistic Regression: A Statistical View of Boosting (With Discussion and a Rejoinder by the Authors).” The Annals of Statistics 28.2 (2000): 337-407 http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aos/1016218223

Neural Networks

Baldi, Pierre, and Kurt Hornik. “Neural Networks and Principal Component Analysis: Learning From Examples Without Local Minima.” Neural networks 2.1 (1989): 53-58.

Coolen, A C C. Concepts for Neural Networks. N.p.: Springer, 1998. 13-70.

Tweedie

Dunn, Peter K. “Occurrence and Quantity of Precipitation Can Be Modelled Simultaneously.” International Journal of Climatology 24.10 (2004): 1231-1239.

K-Means

Napoleon, D, and S Pavalakodi. “A New Method for Dimensionality Reduction Using KMeans Clustering Algorithm for High Dimensional Data Set.” International Journal of Computer Applications 13.7 (2011): 41-46.

Xiong, Hui, Junjie Wu, and Jian Chen. “K-means Clustering Versus Validation Measures: A Data- distribution Perspective.” Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 39.2 (2009): 318-331.