Honors Program

University Honors

Date of Award

5-2025

Thesis Professor(s)

JeanMarie Hendrickson

Thesis Professor Department

Mathematics and Statistics

Thesis Reader(s)

Michael Garrett

Abstract

Machine learning is a method that employs statistical algorithms to identify patterns and make predictions from data. This study applies machine learning techniques to analyze data from Major League Baseball (MLB) teams between 1998 and 2024, with the goal of determining which factors strongly influence a team's likelihood of reaching the postseason and in accurately predicting the teams that do and do not qualify for the postseason. Data exploration and unsupervised machine learning methods such as clustering were used to identify underlying patterns in team performance metrics and determine potential significant contributors to team success. Many different supervised learning methods were employed to develop predictive models. The dataset was randomly divided into a training set which was used to train the predictive models, and a test set which was used to evaluate the models’ accuracy. Hierarchical and K-Means clustering were used to group similarly performing teams and identify variables that had great influence on the teams’ performance. Logistic regression, KNN Classification, Linear Discriminant Analysis, Non-linear Functions, Decision Trees, Bagging, and Random Forest models were constructed to classify teams and evaluate the importance of the various predictors. Results indicate that ERA+, Runs Allowed, OPS+, and Runs Scored are the most significant contributors to postseason qualification. This stresses the importance of a balance between offensive and defensive strength and performance. A Random Forest model was selected and used to predict the outcomes of the 2025 season based on early-season data. By identifying key performance indicators, this research aims to offer insights into critical contributors to reaching the postseason in the MLB. This research also demonstrates the value of machine learning in sports analytics and highlights how different methods can be used to handle complex data to support strategic decision-making and forecasting in professional baseball.

Publisher

East Tennessee State University

Document Type

Honors Thesis - Open Access

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.

Copyright

Copyright by the authors.

Included in

Data Science Commons

Share

COinS