Can you solve this $1B challenge with machine learning?
We’ve just passed the middle of March. For folks worldwide, this means gearing up for autumn or spring festivities and traditions, religious and cultural celebrations like St. Patrick’s Day, as well as more humorous events like Pi Day. For sports fans in the U.S., March is the unofficial month of basketball, and it’s when basketball gets a little crazy. Here’s the story of how sports and basketball connect with passion, madness, analytics, machine learning, and a billion dollars (or potentially at least a few millions).
The love of the game
In an era in which professional sports have been vastly commercialized, from outrageous ticket prices to digital collectibles, collegiate sports still have a soft spot in our hearts as well as high viewership and engagement by fans. In fact, it might surprise some international readers to find that top college football and college basketball matches are consistently among the most watched and televised sporting events in the U.S. Most collegiate athletes will never make any money playing professionally, and choose to play for the privilege of a good education, for the thrill of performing for their enthusiastic home crowds as well as on the national stage, and for the love of the game. It’s a fun mix of elite athleticism, competitiveness, passion, and school pride.
Why so mad?
The collegiate system is too complex to unpack in full, but in case you’re newbies, here it is in brevity: Schools (i.e. universities and colleges) are divided into conferences for historic reasons of geographic proximity and long-standing rivalry. Over the course of the regular season, teams compete in a mix of in-conference and out-of-conference matches, leading into the month of March. The regular season is then followed by conference-level tournaments, and ending with a single-elimination tournament of 68 teams that compete in seven rounds for the national championship. It’s called March Madness because of the ridiculous number of games packed into a tight schedule, as well as the fact that historically it is full of upsets and surprise wins.
That madness is what makes it so exciting - for sports fans, aficionados, stats nerds, and data scientists. Roughly 70 million fans fill out tournament brackets each year in anticipation of the tournament, aiming to accurately predict the outcome of each of the 63 games. The problem? No one has ever even come close to getting it right.
The Buffet challenge
Warren Buffett, CEO of Berkshire Hathaway, famously teamed up with Dan Gilbert of QuickenLoans in 2014 to offer a prize of $1 billion to anyone who correctly predicted every game’s outcome of that year’s tournament, and $100,000 to the top 20 imperfect brackets if no one wins the billion. Needless to say, nobody won the billion dollar prize. While Buffett's challenge is only open to Berkshire Hathaway employees for the 2021 tournament, it’s fascinating to think we are not much closer to seeing a perfect bracket than we were back in 2014.
We should be getting closer though, right? We have copious amounts of data about each team’s players, previous game outcomes, offensive and defensive statistics, historical bracket outcomes, and much more. So why is it still so challenging to correctly predict each game's results? Let's break it down. With 63 games played in the tournament the odds of getting every outcome correct, if each game’s odds were treated as a coin flip, is 1-in 9.2-quintillion. If you are a more savvy basketball fan who is “in-the-know” about matchups, your odds increase to roughly 1-in-120 billion. So just how close have we come to being perfect? The longest a verified March Madness bracket has stayed correct is for 49 games straight... which is far from perfection, but an amazing achievement nonetheless.
This sounds like an interesting (and potentially lucrative) problem for machine learning and data science enthusiasts! But how much of it is scientific? How much of it is random?
Basketball data science
Sports analytics have come a long way over the last 20 years, thanks to sports betting, the moneyball movement, and big sports franchises spending millions on getting smarter. We are long past the point of tracking only rudimentary stats like field goals, rebounds, and steals. We now have advanced statistics such as ball deflections and a breakdown of field goals by distance from the hoop, as well as advanced metrics such as True Shooting and Value over Replacement Player (VORP). Furthermore, the world of basketball has moved from box analytics (i.e. just counting things) to complete video and optical analyses that enable real player tracking (i.e. what a player actually does on court, with and without the ball). We can simulate team matches at the possession level, such that now there is a real surplus of information to leverage, and making good data-driven decisions can be worth a fortune.
This may sound underwhelming to anyone who did not live through this evolution, but it means a huge digital transformation and the potential for some exciting data science work, looking at thousands of potential features that may inadvertently impact the game. And within that universe we have this specific problem set of predicting March Madness results, a challenge that may or may not be adequately approached with common methods like logistic regression and decision trees. Think you have a shot? Then perhaps put your skills to the test.
Calling all Kagglers!
If there is one elite group of elite scientists and engineers to take on such a challenge of predicting results in a complex system, it’s Kagglers. The March Machine Learning Mania competition taking place since 2014 enables data scientists to put their skills to that 1-in-9.2-quintillion test. And while collegiate-level data does not match the data quality and richness that one can find in pro sports, it does present a ton of interesting ways to leverage and integrate interesting data points such as past tournament performances and rating systems like Elo or the NCAA’s RPI, and makes for an exciting matchup on the Kaggle leaderboards.
This year’s Kaggle competitions for the men’s and women’s tournaments. There’s still some time left to participate if you’re up for it.
Bracket challenges everywhere, such as at your workplace or hosted in spaces like Yahoo! Fantasy or ESPN. There is still time to submit your own bracket, whether individually or as part of a pool.
And most importantly, enjoy the games and the thrill of a tournament where anything can happen. At Mona we celebrate diversity and exciting data science but we are particularly excited about Florida and Michigan’s chances this year. Go Gators and Go Blue!