Handling Non-Normal Data in Sport Analytics: The Application of Box-Cox Transformation to MLB and LPGA Data

Abstract

Advancements in Information and Communication Technology (ICT) and big data have significantly transformed sports analytics, enabling the collection of complex, multidimensional datasets. However, sports data often exhibit non-normal distributions, skewness, and outliers, which pose challenges for linear models used in association analysis. This study evaluated the effectiveness of the Box–Cox transformation in addressing these issues using ICT-based sports datasets from Major League Baseball (MLB) and the Ladies Professional Golf Association (LPGA). Dependent variable distributions, regression model performance, and residual patterns were compared before and after the transformation. The Box–Cox transformation effectively reduced skewness and improved normality, ensuring that key regression assumptions such as homoscedasticity and linearity were satisfied. Model fit improved across both datasets, as evidenced by higher R² values, lower Akaike Information Criterion (AIC) scores, and more evenly distributed residuals. These findings demonstrate that the Box–Cox transformation enhances the reliability and interpretability of regression models in sports analytics, particularly for non-normal data, by addressing both distributional characteristics and residual behaviors.

keywords
sports data transformation sports ICT sports big data sports analytics regression assumption validity
Submission Date
2025-11-26
Revised Date
2025-12-23
Accepted Date
2025-12-29

logo