Digging Deep [Learning] to Win March Madness
I built a neural network to make my NCAA men’s tournament picks this year for my office pool. As of this writing, an Auburn win and Michigan State loss in the Final Four, and I win bragging rights for a year! I was inspired to take this approach because I had watched a grand total of zero college basketball games this year and sadly went to a college that isn’t what you’d call an “athletic powerhouse”.
I decided to do this on March 20th, so had a little more than 24 hours before the 12 noon cutoff on March 21st. This is the story of what I was able to get done in that time. If you want to skip the post, you can go to the following links:
Getting the Data
To build a dataset, I leveraged data from Sports-Reference. They provide a variety of statistics for every Division 1 team. In the spirit of neural networks, I didn’t want to assume a function or any type of relationship. The site provided Basic and Advanced stats for each team and for their opponents. The basic stats are what you would expect and the advanced stats were more “rate-based” which allows for better comparisons between different styles of teams. I ended up using:
Basic Stats: [School, Games, Wins, Losses, Win/Loss %, Simple Rating, Strength of Schedule, Conference Wins, Conference Losses, Home Wins, Home Losses, Away Wins, Away Losses, Total Points, Total Opponent Points, Minutes Played, Field Goals, Field Goal Attempts, Field Goal Percentage, 3-point Field Goals, 3-point Attempts, 3-Point Field Goal Percentage, Free Throws, Free Throw Attempts, Free-Throw Percentage, Offensive Rebounds, Total Rebounds, Assists, Steals, Blocks, Turnovers, Personal Fouls]
Advanced Stats: [Pace Factor, Offensive Rating, Free Throw Attempt Rate, 3-Point Attempt Rate, True Shooting Percentage, Total Rebound Percentage, Assist Percentage, Steal Percentage, Block Percentage, Effective Field Goal Percentage, Turnover Percentage, Offensive Rebound Percentage, Free Throws/Field Goal Attempt]
Building the Dataset
I made the following additional decisions to build the dataset:
- Selected Teams: I scraped data only for the 64 teams that qualified for the main tournament in a given year (excluded play-in games)
- Opponent Data: For each team, I added the above Basic & Advanced Stats for their opponents
- School Name: I added each school’s name, which made sense to me given the potential to add a feature related to long-term program success and stature (e.g. Duke, Kentucky)
- Years: I scraped data from 2012-2018
- Seeding: I added each teams seeding for the tournament. I did not take the time to differentiate between same seeds (i.e. rank order the four #1 seeds)
- Game Outcome: Win or loss by the higher seeded team was the dependent variable, and I assigned 1 for a Win and 2 for a Loss
To create the dataset, I paired the higher seed team’s Basic, Advanced, OpponentBasic, & OpponentAdvanced stats with the same group of stats for the lower seeded team they were playing in a given game. As mentioned above, I added team/school name, seeding, and the game outcome. This complete set of data became a single record. For each tournament, I ended up with 63 records, and had 441 total records (2012-2018).
I was aware of the fact that this is not a lot of data, particularly to train a deep neural network. Nonetheless, the clock was ticking!
Training the Model
For sake of time, I opted to use libraries and factory methods built by fast.ai that sit atop PyTorch.
Step 1: I uploaded my data (a csv file scraped from Sports-Reference) as a pandas dataframe, the required format for the fast.ai factory methods.
Step 2: I created my dependent variable, “higheroutcome”, my categorical variables (higherschoolname and lowerschoolname), as well as my continuous variables (everything else)
Step 3: I used the factory method TabularList.from_df to create a databunch object that would be recognized by the fast.ai/PyTorch learner. I fed it the pandas dataframe, the path where the source data sat, and the names of the categorical and continuous variables. The method creates a validation set for you and ultimately creates a usable databunch object.
Step 4: I checked the first ten rows of data to make sure it looked right, which it did (note: many columns are cut-off in the picture below)
Step 5: One line of code built the learner, leveraging the tabular_learner factory method to create a TabularModel. This basic architecture of the model is below:
(0): Embedding(77, 18)
(1): Embedding(161, 28)
(bn_cont): BatchNorm1d(180, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(0): Linear(in_features=226, out_features=200, bias=True)
(2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Linear(in_features=200, out_features=100, bias=True)
(5): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(6): Linear(in_features=100, out_features=2, bias=True)
Step 6: Another line of code runs the
Step 7: I ran the learning rate finder function, but it indicated the same learning rate, so I simply ran 3 epochs this time, and got slightly better results, about 80%. I decided to go with that model.
Step 8: I went through and inferred outcomes for 2019 based on the saved model, one round at a time.
I had modest expectations for the results, and my bracket has been about average. I picked 38 out of 60 correct games, including a blistering 13-2 in the Midwest region, where I correctly had Auburn beating Kansas, UNC, and Kentucky. At some point, I will compare the neural network model to results obtained from a different machine learning method such as Random Forest. I am sure I would have done better with more data, including every regular season game. Nonetheless, I have a legitimate chance to win my office pool, so mission nearly accomplished. Go Tigers!!!
Any opinions or forecasts contained herein reflect the personal and subjective judgments and assumptions of the author only. There can be no assurance that developments will transpire as forecasted and actual results will be different. The accuracy of data is not guaranteed but represents the author’s best judgment and can be derived from a variety of sources. The information is subject to change at any time without notice.