## A Theoretical Model for The Probability of Winning a Basketball Game – Part 2

*This is the second in a 3 part series where I will present a theoretical model for the probability of winning a basketball game. The 3 parts will break this model down at the team, unit, and player level.*

In part one of this series I presented a way to break down the probability of winning a basketball game from the (theoretically) known probability of winning down to the **points scored per 5-player unit per play**.

Recall that at the 5-player unit level, the total points scored for any given play is:

Points Scored = FTM x 1 + 2FGM x 2 + 3FGM x 3

**Expected Points Scored per 5-player Unit per Play**

To break this down further we must now use mathematical expectation. In totality, the expected number of points scored for a 5-player unit on a given play is given by:

E(Points) = Pr(FTM = 0) x 0 + Pr(FTM = 1) x 1 + Pr(FTM = 2) x 2 + Pr(FTM = 3) x 3

+ Pr(2FGM = 0) x 0 + Pr(2FGM = 1) x 2 + Pr(3FGM = 0) + Pr(3FGM = 1) x 3

Where E() denotes expectation and Pr() denotes probability.

This formula now allows us to think of the number of points scored per 5-player unit per play in terms of expected value, but these probabilities are not independent. There are a lot of underlying probabilities involved that must be decomposed.

**Breaking Down the Probabilities of E(Points)
**

First recognize that E(Points) can also be written as:

E(Points) = Pr(No Turnover) x E(Points | No Turnover)

+ Pr(Turnover) x E(Points | Turnover)

It is obvious that Pr(Turnover) x E(Points | Turnover) always equals 0 because the conditional expectation E(Points | Turnover) always equals 0. As such, we can simply focus on the composition of Pr(No Turnover) x E(Points | No Turnover).

**Focusing on E(Points | No Turnover)**

Now it’s time to identify the structure of E(Points | No Turnover).

For brevity I will try to keep this as simple as possible, as there are a lot of possible combinations.

E(Points | No Turnover) = Pr(2FGA) x E(Points | 2FGA) + Pr(3FGA) x E(Points | 3FGA)

+ Pr(Non-Shooting Foul) x E(Points | Non-Shooting Foul)

+ Pr(All Other Events) x Pr(Points | All Other Events)

I have extracted out **all other events** here to recognize events that cause the shot clock to reset and start a new play (such as jump ball situations). These always give us 0 points, so we will ignore it for now and focus on E(Points | 2FGA), E(Points | 3FGA), and E(Points | Non-Shooting Foul).

**Definition of E(Points | 2FGA)**

When a team attempts a 2 point shot the following events can happen:

- No Foul and Shot Made
- No Foul and Shot Missed
- Foul and Shot Made
- Foul and Shot Missed

This means E(Points | 2FGA) can be written as:

E(Points | 2FGA) = Pr(No Foul and Shot Made) x E(Points | No Foul and Shot Made)

+ Pr(No Foul and Shot Missed) x E(Points | No Foul and Shot Missed)

+ Pr(Foul and Shot Made) x E(Points | Foul and Shot Made)

+ Pr(Foul and Shot Missed) x E(Points | Foul and Shot Missed)

Clearly:

E(Points | No Foul and Shot Made) = 2

E(Points | No Foul and Shot Missed) = 0

The other two expectations rely on the foul shots. Therefore:

E(Points | Foul and Shot Made) = 2 + E(One FTA)

E(Points | Foul and Shot Missed) = E(Two FTA)

**Definition of E(Points | 3FGA)**

This is very similar to E(Points | 2FGA), so I’ll simply state the differences:

E(Points | No Foul and Shot Made) = 3

E(Points | Foul and Shot Made) = 3 + E(One FTA)

E(Points | Foul and Shot Missed) = E(Three FTA)

**Definition of E(Points | Non-Shooting Foul)**

This expectation can be defined as:

E(Points | Non-Shooting Foul) =Pr(Bonus Situation) x E(Two FTA)

+ Pr(Non-Bonus Situation) x 0

Where Pr(Bonus Situation) is always 1 or 0 depending, of course, on the bonus situation. Pr(Non-Bonus Situation) is the complement, 1 – Pr(Bonus Situation). Also, unless the shot clock resets for some reason, the foul in a non-bonus situation **does not** lead to a new play.

**The Player Level**

With the underlying expectations for E(Points) defined at the 5-player unit level, let me go back and define E(Points) in terms of players:

E(Points) = Pr(O1) x E(Points | O1) + Pr(O2) x E(Points | O2) + Pr(O3) x E(Points | O3)

+ Pr(O4) x E(Points | O4) + Pr(O5) x E(Points | O5)

So I’ve decided to end this part of the series hanging on that last definition of E(Points) in terms of players. In the last part of this series I will expand on this definition of E(Points) and build the picture of what the theoretical model looks like at the player level.

**Summary**

I’ll admit that this part of the series is pretty boring (it’s mostly a bunch of definitions with ugly notation). I wanted to define the basic structure of the points scored at the 5-player unit level so that I didn’t lose anything when creating the theoretical model for each player. This also helps remind us that we have to worry about 5 players together, and the player piece of this theoretical model will be cognizant of that.

Oh, and in case you’re wondering, I haven’t lost sight of defense, imporance of shot location, etc. These factors will clearly affect the underlying player probabilities and will be defined in the player part of this model.

## A Theoretical Model for The Probability of Winning a Basketball Game – Part 1

*This is the first in a 3 part series where I will present a theoretical model for the probability of winning a basketball game. The 3 parts will break this model down at the team, unit, and player level.*

Before diving right into the data and trying to build new models of the game, I feel it will be worthwhile to try and present theoretical models of the game as best I see them.

I see this as being beneficial for two reasons:

- The first, and in my mind most important, reason for doing this is that it will clearly show that all models are wrong. By having a theoretical model as a guideline, we can better understand what a practical model does and (more importantly) does not capture. This way we can better understand the inferences and predictions we’re making with any specific model that’s created.
- The second reason has to do with the fact that
**I know what I don’t know**. And what I don’t know is everything that goes into every possible theoretical model of the game of basketball. I am hoping obvious mistakes in these theoretical models will be pointed out by informed readers.

**The Purpose**

The purpose of this theoretical model is to show how the probability of winning a basketball game is derived. Instead of trying to start at the player level (the most basic components of a team), I will instead start with the proportion of wins and losses between two imaginary basketball teams. This will allow me to construct a top down view of the model. The end goal is to work on a player level to understand how a team’s probability of winning changes based on these most basic components.

It’s worth noting that you can’t start at the player level when constructing this model. Well you can, but you are probably going to miss out on the big picture if you do. Starting from the top will allow us to better understand how the components (players) interact.

**The Long Run Proportion**

In a perfect world we would know how often Team A beats Team B in an imaginary game. With our current level of technology, however, we can’t just conjure up games between Team A and Team B until the law of large numbers gives us an idea as to the true probability of Team A beating Team B.

Theoretically, however, we can say that we have this proportion. I know that Team A will beat Team B with probability **p**, and I know that Team B will beat Team A with probability **q**.

Now I will define a distribution that allows us to calculate these probabilities.

**The Margin of Victory Distribution**

One way to calculate these probabilities of winning would be with a discrete distribution I call the margin of victory distribution. This distribution represents the probability Team A wins by 1,2,3,…,n points (where n is Team A’s largest margin of victory). It also represents the probability Team B wins by 1,2,3,…,m points (where m is Team B’s largest margin of victory).

This is a good distribution to use as it fully represents the probabilities of winning, p and q.

These margins of victory come from the obvious: points scored by each team in every possible game.

**Points Scored in Each Game**

These margins of victory come from what should be familiar to everyone. They come from the actual difference between Team A’s points scored and Team B’s points scored. Because you can only score points by making free throws, 2pt shots, and 3pt shots, the formula for points scored is easy to calculate:

Points Scored = FTM x 1 + 2FGM x 2 + 3FGM x 3

This formula will always give you the number of points each team scored in the game.

We’re getting closer to identifying the next layer, the 5-player units, but the points scored formula must be broken down into its most basic component: points scored per play.

**Points Scored per Play
**

Points scored per play gets to the heart of how teams score points. I would like to note that I leave out points scored by way of technical free throws made from the plays themselves. This is done to limit the theoretical maximum points per play to 5 (a made 3 point field goal followed by two made free throws that are the result of a flagrant foul; I don’t think this has ever happened, but it is theoretically possible). It is also done because of the nature of technical fouls that can be called on either the offensive or defensive team for a variety of reasons.

For the sake of clarity, the points scored formula becomes:

Points Scored = (Σ FTM_{i} x 1 + 2FGM_{i} x 2 + 3FGM_{i} x 3) + (Technical FTM x 1)

Where the first part is summed over the index i that starts at 1 and ends at the total number of plays. With points scored broken down in this fashion, we can finally break down points scored into the 5-player unit level.

**Points Scored per 5-player Unit per Play**

That’s a mouthful, but we can now look at the points scored per 5-player unit per play. This turns the point scored formula into:

Points Scored = Σ (Σ FTM_{ij} x 1 + 2FGM_{ij} x 2 + 3FGM_{ij} x 3) + (Technical FTM_{i} x 1)

The points scored formula now has two summations that are performed as follows: i is the index corresponding to the 5-player unit and j is the index corresponding to each play of the 5-player unit.

**Summary**

It took a few steps to get here, but we’ve now approached a point where we can begin looking at how a 5-player unit scores points on a play. That will be the subject of part 2 of this 3 part series.

I have tried my best to be as technically accurate as possible with the derivation from the probability of winning down to the points scored per 5-player unit per play. I appreaciate any comments and feedback that would help refine this theoretical model.

## Tracking the 2008 NBA Playoffs: What the Data Represents

In my post where I detail my data collection goals for the 2008 NBA playoffs, I spell out the sort of data that I’m tracking and adding to the play-by-play. This post will expand on that and describe exactly what you’ll see in the data.

First, a quick reminder of the four types of events I’m adding data to:

- Shots
- Turnovers
- Rebounds
- Fouls

With that in mind, here is what the fields mean for each event type:

**Shots**

**assist**– I’m not changing much here except that I’m awarding assists even if the shot was missed. Therefore, if the shot was missed, you could say the assist field represents a**potential assist**. For the curious, I use the 82games definition:

Basically if the player after receiving the pass pauses or dribbles around for a while before taking action it’s not an assist, but otherwise if the player takes the pass and immediately shoots (catch and shoot), drives to the basket, or has a little pump fake type move to throw off the defense and then goes up for the shot (with perhaps one small dribble even) then you’re talking assist.

**opponent**– This field is used to track contested shots. I have tried to be**as consistent as possible**while tracking this data, so let me state my goal for tracking defenders: My goal is to understand the difference between contested and uncontested shots. So the rule of thumb is:**If the defender appears to contest the shot then they are tracked as the opponent**. Understand that on some shots the opponent*tries*to contest the shot, but in reality they are too far away and/or come in at a bad angle to get in the shooter’s way. These examples**are not**counted as contested shots. This field can also be made up of multiple players, in which case the player’s names are separated by the pipe ‘|’ character.**pick_assist**– This is similar to an assist except this player screened an opponent to allow the shooter to obtain spacing, an uncontested shot, or a better matchup. If the screen leads to a shot then that player is credited here, in very much the same way an assist is defined. It is rare, but in the case where two teammates are setting screens, both player’s names are recorded and are separated by the pipe ‘|’ character.**x**and**y**– These integers represent the shot location in feet. Although I’m not actually tracking this (as that was already done by the great folks that score these games), I have merged the location of the shot into the play-by-play. The translation into the court is simple: If you are standing behind the offensive team’s hoop, then x goes from left to right, and y starts at the baseline behind the hoop all the way up to the baseline behind the opponent’s hoop.

**Turnovers**

**x**and**y**– These (x,y) coordinates represent the location where**possession was gained**. These values are translated in the same way as shot location (x,y) coordinates. Note, however, that the team gaining possession is defined as the**defensive team**. As such, the team losing possession is the**offensive team**. So keep this in mind when work with these coordinates.

**Rebounds**

First I want to say that this could be done better (or at least done in a way that will help achieve a specific goal). One of my goals is to try and understand the probability of gaining a rebound when two opponents battle for it. I’m not so sure I’m meeting this goal.

The other goal I have, however, is to understand the likelihood a shot will be rebounded in a specific floor location given the location the shot was taken from. I’m more confident in reaching this goal (or at least understanding it better).

**opponent**– This field holds the opponent(s) (seperated by the pipe ‘|’ character) that directly contested the rebounding player. I could go into all sorts of examples, but basically you have to either attempt to get the rebound (say jump in the air and put your arm in the area) or be actively blocked out by the rebounder when the rebound heads in your direction. Opponents just standing around the rebounder are not tracked.**teammate**– This field holds the teammate(s) (seperated by the pipe ‘|’ character) that tried to get the rebound when instead their teammate gained possession. See the definition above for how players get put into this field.**x**and**y**– These (x,y) coordinates represent the location where the**ball was rebounded**. It’s common for a ball to bounce around while the player’s try to get the rebound, so understand that this coordinate represents the location where the ball was finally controlled, not where it first landed (or first made contact with a player). Also, for coordinate translations, the**shooting team**is always considered the offensive team.

**Fouls**

The only other data being tracked in association with fouls in addition to the (x,y) coordinates is shot-related information for shooting fouls (like assists and pick assists).

**x**and**y**– These (x,y) coordinates represent the location where the foul took place.

**Summary**

The types above represent what I am tracking. There is a lot more data in the standard play-by-play than the list above, but hopefully that data is straight forward (everyone should know what a block is, for example).

At some point I will create a definition for each field, but for now you will at least understand the intent behind the data I’m tracking.

Please use the comments below to help clarify any questions you might have about the data I’m tracking.

## Analytics is the End Game

Analytics, at the end of the day, is … going to be the end game.

– MC Hammer

The quote above is taken from this video, where MC Hammer lays it all out in the open about analytics.

If you just replace the music references with those from the basketball world, then you end up with MC Hammer laying down the law on basketball analytics.

I can think of no better spokesperson. The basketball analytics community should sign him to do a PSA about where we need to go with quantitative analysis in basketball.

If you need me I’ll be trying on my hammer pants.

[Thanks FlowingData]

## Data Pet Peeve #2: Offensive Rebounds & In-Air Shots

It didn’t take long for me to find my 2nd data pet peeve. This pet peeve is related to the first data pet peeve in that it involves rebounds and play-by-play ordering.

This peeve is because offensive rebounds are credited after in-air shots. This makes me sad.

**An Example**

With 7:54 left in the 4th quarter in game 1 of the first round series between Toronto and Orlando, Dwight Howard gets an offensive rebound and slams down a dunk without ever touching the ground. (He is superman, after all.) In the play-by-play, however, you’ll see that the offensive rebound is credited after the putback dunk. I’m sure this would frustrate superman as much as it does me.

My cut & paste skills are going to be expert soon enough. This also makes me sad.