I’m always running into recurring issues with existing basketball data that cause me exercise my obsessive compulsiveness, so I’m going to start keeping track of those issues here. If they frustrate you just as much as they frustrate me then maybe someone will do something about it one day. Or not.
First on the list of my data pet peeves is the way loose ball fouls that lead to rebounds are recorded in play-by-play data. I’m on game #5 of 80 in the 2008 NBA playoffs tracking project, and I already cringe every time I see a loose ball foul that results in the opponent being rewarded a rebound.
I cringe because the rebound is credited first before the loose ball foul. Clearly this doesn’t make sense. If the ball is loose, then how can one team obtain possession before the foul? If PlayerX were to actually have made the rebound then we’d give him credit for it and call a personal (instead of loose ball) foul on PlayerY.
Why does this matter?
This is a huge deal to me because of the way I’m keeping track of (X,Y) coordinates of event locations. Everything is in relation to the offensive team, so if PlayerX actually had possession of a defensive rebound then the coordinates of the foul would be much different. Therefore, I’m forced to change every single one by hand. It’s resolved by a simple cut and paste, but those add up over time. It is a bane to my tracking efficiency.
Thanks a lot scoring software.
After writing roughly 4,000 lines of perl code to get the existing play-by-play data into the format I want to work with, I’m finally ready to track data from the 2008 NBA playoffs. This data collection project (along with the motivation to perform this data collection on a larger scale for the regular season) has been fueled by my desire to build better offensive and defensive models of basketball.
My end goal is to use these models as inputs to a simulation engine to attempt to understand how a given lineup will play together. Therefore, it is necessary to understand as much context as possible around the events that take place on the basketball court. In addition to what is already in the play-by-play data, I will be specifically tracking the following data:
- For Shots: Defenders, Pick Assists, Potential Assists, and Potential Pick Assists.
- For Turnovers: Defender(s) that forced the turnover along with the (X,Y) location of the turnover
- For Rebounds: Opponent(s) and teammate(s) near the rebounding player along with the (X,Y) location of the rebound
- For Fouls: The (X,Y) location of the foul along with shot data (like assists, etc.) for shooting fouls.
I will be collecting this data for 80 of the 2008 NBA playoff games. The other games not a part of this 80 game sample have incomplete play-by-play data that I will need to rectify using another source of play-by-play data. I’m hoping 80 games will be plenty, although I am fairly obsessive compulsive, so if I can complete the data collection process in enough time before the start of the regular season I will try to get data for the rest of the games.
As I described in my welcome post, the data will be open for all to use. To achieve this, I have setup a Google site called NBA Game Tracking where I will upload the data to. This is an open Google site that allows anyone to view the content published to the site. Once the regular season rolls around I’m hoping for trackers to contribute their own work to the site. More on that when the time comes. For now, just know that is where I’ll be uploading the data files as I complete them.
That’s all for now. I’m hoping my setup will allow me to complete this data collection efficiently, so hopefully I’ll have a solid set of game data to work with a week from today.
Welcome to BasketballGeek.com. My name is Ryan J. Parker, and I am one of the many basketball geeks out there.
I have started this website for a couple of reasons:
Reason #1: Play-By-Play Data
First, and most importantly, I believe that for the average researcher basketball research has reached a ceiling due to a lack of data. With the exception of measures such as adjusted plus-minus and the charting work done at 82games.com, most basketball research revolves around the box score. There isn’t a lot of knowledge left to squeeze out of this area of basketball research, mostly because the statistics measured about a player in the typical box score depend on context. Questions raised about these measures include:
- Who is on the court with the player?
- What is the player’s role for each unit he plays with?
- Where is the player shooting the ball from?
- What are the player’s defensive responsibilities?
- Who is contesting the player’s shots?
Without play-by-play data, these questions are impossible to answer. Even with play-by-play data, some of these questions can’t be answered.
Thus my highest priority is to provide an open source of play-by-play data that is accessible and modifiable by game charters. A lot of event location and defensive information is not tracked in play-by-play data, and this is the sort of data that will open new insight into the game. This website will provide a forum for contributing to and accessing this data.
If all goes as planned then in a year from now an open source of data will be available to analyze with respect to the 2008-2009 season.
Reason #2: Sharing My Opinions
Over the past couple of months I have become fully devoted to developing better models of offense and defense in basketball. I am interested in other areas as well, but everything always leads me back to creating these models.
I am fascinated to know what a given lineup’s optimal usage is, and I would also like to better quantify what is likely to happen when adding players to an already defined lineup (such as Brand in Philadelphia or Artest in Houston). Thus this website will provide me with a forum for airing my thoughts and research.
I’m also planning on writing some software and tools that dive into areas of basketball I’m interested in, and those will be contained here on the site.
Welcome to the site, and thanks for stopping by!