Tools Used: Python, MariaDB, SQL, Docker

Period: Aug 2022 - Dec 2022

Objective: The objective of this project is to predict the home team wins in baseball games by analyzing the various statistical features of the teams and players using SQL and Python.

Data Collection: The data for this project will be collected from 31 tables of baseball statistics and stored in a database. The database will contain various features such as Pythagorean expectation, batting average, and starting pitcher stats. 

Data Preprocessing: The data will be preprocessed using SQL, where the final table with relevant features will be imported into Python. In Python, various ranking techniques will be applied to the features, such as RFVIP, the difference with the mean of response, and correlation analysis. 

Feature Engineering: Based on the results of the ranking techniques, new features will be created using a brute force analysis to identify the best predictor combinations.

Data Cleaning and Modeling: The data will be cleaned and tested on various machine learning models such as regression, decision trees, and random forests, by hyperparameter tuning to improve the accuracy of the predictions.

Deployment: The entire process, including the SQL and Python scripts, will be run in a Docker environment using a MariaDB image and a custom Docker image. The results of the analysis will be stored in a mounted volume as an HTML page for easy access and interpretation. 

Conclusion: The results of this project will provide valuable insights into the predictors of home team wins in baseball games, helping in making informed predictions and decisions. The deployment in a Docker environment will ensure efficient and reproducible results.

The images are of the brute force analysis and mean square difference analysis of the predictors, to know more about the project please visit the git hub wiki.