The Baseball Futures Market
The underlying conflict that sparked the recent Supreme Court case Murphy v. NCAA (formerly Christie v. NCAA) was less about the legality of sports gambling and more about how money and power is divided between leagues and casinos--a power struggle between business behemoths, the stakes worth the value of almost every NFL, MLB, and NBA franchise combined. The leagues want to keep legislation in Washington, where they're heavily invested, and away from state governments. Negotiating with one government is much easier than negotiating with fifty.
As their lawyer Paul Clement stated in oral arguments, the aim of Professional and Amateur Sports Protection Act (PASPA) is to ensure "no state-sponsored or -operated gambling taking place by either individuals or by the state." Joining the NCAA are the NFL, MLB, and NBA. Adam Silver, commissioner of the NBA and the most prominent sports figure to support legalization, estimates the black market moves around $400 Billion per year (conservatively $150 Billion) and, correctly, sees a new way to increase the value of the leagues who already partner with Daily Fantasy companies, brokers of what can only be described as gambling with extra steps. And they simultaneously supported government legislation to the contrary. Odd bedfellows, indeed.
Much like this case’s machinations, gambling can be logical, calculated, driven by a complex mixture of converging factors. Gambling is also nearly never that. It’s emotional, delusional, hopeful, and hateful. It’s clouded by a noisy and wholly irrational sports media, an unfortunate victim of the 24-hour news cycle. But in all of that chaos lies an opportunity to exploit these glaring inefficiencies. As the old gambling proverb goes: In every bet, there is a fool and a thief.
Sports gambling is not much different than financial markets. The odds are the price the futures market gives for an outcome, you test your analysis against somebody else--most participants are suckers who get lucky every once in awhile and slowly give their money away. The dopamine release can’t be much different and there exists in both incentives to rig the market, though sports organizations will actually be punished for fixing the outcome.
With that in mind, I started a project to find opportunity in the Baseball Futures Market. Baseball makes sense for a number of reasons, the biggest being the sheer volume of games to bet on and the wealth of data available on Retrosheet and Lahman’s Baseball Database. Retrosheet, while comprehensive, is actually pretty difficult to extract data from--shout out to FanGraphs for making that process so much easier.
The bigger hurdle, however, was getting historical data for casinos’ odds for those games. Casinos do not have an incentive to simply give you data to help build a model. I was able to find a third party that collects and formats historical odds for a number of sports, including MLB data going back to 2009. It’s missing some games, but gives Pinnacle’s prices for the money line and Over/Unders for around 14,000 games. Pinnacle, known for setting the sportsbook market, is on the cutting edge of gambling and data science.
Calculations
Line-drive Rate
Ground Ball Rate
Fly Ball Rate
HR-Fly Ball Ratio
Fielding Independent Pitching
Weighted On-Base Average
Batting Average on Balls in Play
Pythagorean Win-Loss
I didn’t have the bandwidth to start analyzing each player individually. While totally possible with Retrosheet’s data, I thought it better to take a more macro approach, so early on I adopted the philosophy that I would calculate stats by team units: the offense, the bullpen, and the starting pitcher. In addition to these (relatively) raw calculations, I analyzed match ups by taking the average of a team’s pitching units and comparing it the opposing team’s batting unit. I then took it one step forward by measuring the overall advantage a team had between their offense and defense. I made features measuring their cumulative performance and their recent performance in rolling five and ten game windows, giving a nice historical foundation with a nod to recency.
There a few ways to approach predictions and a few ways to measure success. Clearly, sports gambling in general is a tough gig. Nearly everything is normally distributed and, while there are differences between winners, there are many exceptions that make it totally unobvious what the outcome will ultimately be. Even the best teams lose forty percent of the time. I don't expect to get a consistently high prediction rate, but accuracy isn't the most important element here. We're working in a futures market with different prices for each match up--the prices are what will get this project profitable. If I can get a sixty percent prediction rate and scale bets based on confidence in that outcome, I'll have a good shot at profitability.
Using the scikit-learn package in Python, I implemented Machine Learning techniques to predict an outcome for the game winner and the Over/Unders for 6, 7, and 8 runs (when odds were available). First, I fit the model for a random half of the 2009 data and tested on data from 2010. I then used 2009 and 2010 to predict 2011, going through every season and refitting the model after each.
Percent Correctly Predicted
I used two general approaches. One approach picked the winner by using logistic regression to predict a binary outcome: home win or not home win. It did the same for the Over/Unders. The second approach individually predicted the home and visiting scores by using Linear Regression. From there I was able predict an Over/Under by taking the sum.
This second approach also informed my scaling system. A large difference between the predicted runs for home and away teams led to a higher confidence level and therefore a bigger bet. Similarly, the difference between the predicted total runs and the benchmark for Over/Unders led to more confidence and more money risked.
Establishing a wager necessitates a bit of thought. As anyone with the gambling bug would know, $10 won't get you up in the morning. To do this seriously, you need a healthy risk appetite, just as you do when playing the stock market. Given the current gambling market suppression, one problem (of many) is the control Vegas has in even taking your bet. A "Sharp," Vegas parlance for a knowledgeable gambler, can be arbitrarily kept from placing a wager or, at the very least, limited in the amount they put down. It could be a maximum of $5,000 per bet or $5,000 per day. I went with a modest starting wager of $100 and scaled up to $500.
Below are two options to explore the results. The first is an interactive plot using Dash by Plotly. Scrolling over the points will reveal details about that game: who was in it, runs scored, the payout, and my rolling total. As you zoom in, the points become clearer. This plot displays the best performing strategies for the 2017 season--the free version of Plotly has file size limits, so I could include only a portion of my final data set. The second option is a gallery of outputs from Python showing how the strategies worked over each season. You can also view the code and (some) data on my GitHub account.
The results are incredibly promising. When anybody claims they’ve beat the house, there’s reason to be skeptical, but the analysis is pretty straightforward. I calculated stats that captured multiple raw fields and summarized them in a few ways. My goal was not to predict a high percentage of outcomes (though I did well in that regard as well), it was to capture a profitable mix of favorites and underdogs. And I think I have a good foundation to build on and further develop serious analyses of the Baseball Futures Market.
The peaks and valleys throughout are fascinating. You can imagine the anxiety when the model loses $10,000 in a week and the adrenaline when it goes the other way. The key is consistency in volume, strict adherence to the predictions, and knowing the difference between having fun and being smart. There'll be ups and downs, so it's best to have a healthy amount of liquidity. I can see this model as a source of some serious passive income or, at the very least, a 401-K with conspicuously better returns.
As of now, this is purely academic. To implement this during the season, I’d need to scrape up-to-date data for each unit and I anticipate that there might be issues in getting stats like fly ball rate. This analysis can also go much further. I’d like to eventually calculate a strength of schedule stat and somehow account for injuries. If I did this daily, I’d also look at some prop bets and do a quick analysis to exploit bad odds. If a home run hitter is on a slump, but facing pitchers with a high fly ball rate, you’ll probably find good odds, risking $20 to grab something upwards of $80. And even if this model doesn't work out long term (check on me in a couple of years), it still tells me that there is an objective, business-like approach possible in this market littered with uninformed and irrational investors.
We're going to get the pleasure to see a black market transition out of the shadows and towards regulation. Questions remain: How will leagues combat game fixing? What's the role of Las Vegas sports books? Will addiction be monitored? Will the NCAA finally pay its athletes? ("they get a free education and that's priceless.") What's the actual size of this market? Is this a job for governments or corporations? Will regulation allow for a new mutual fund market centered on gambling? And, most importantly, when can I enter the game?
Gamblers have won and lost fortunes for millennia, predating humanity's self-awareness to begin recording its history. It's a reflection of our very human selves--sometimes grim, sometimes glorious, always entertaining. Indeed, American history is littered with gamblers masquerading as investors and inventors, businessmen and explorers, artists and athletes.
Sports captivate us with an unadulterated version of competition. A legendary performance is inspiring not just because we can appreciate greatness, but because we want to know it in ourselves, to apply it in our domain. In both victory and defeat, we see shades of the audacity and vulnerability it will take to reach for our respective greatness and be thus distinguished from those "that neither know victory nor defeat."
Success inherently requires an unflinching understanding of risk and an insatiable desire to beat the odds. In gambling, there is no metaphor--just the real thing.