top of page

Report on Estimating NBA Players’ Point per Game

Chenning Yang

December 2021

NBA-logo.jpg

Introduction

The NBA stands for National Basketball Association. It is one of the most watched basketball tournaments in the world. This research uses the official data (2021 - 2022) given by the NBA to estimate the players' points per game by other given data. In much NBA research, there is very little research in this area. The importance of the research is to make professionals better adjust the players on the floor and the strategies applied which would produce a better outcome.

Method

The entire data is divided into training set and test set, and we first operate on training set. A model consists data in the training set obtained. We check condition 1 & 2, and if the above two assumptions are satisfied, the residual plot would be therefore generated to reflect what is being violated. In the contrast, if either condition does not hold, then the pattern we see cannot tell where the problem is. As long as we have both condition 1 & 2 satisfied, we can produce the residual plots which allow us to see the violation clearly. Three main types of residual plots we would use are: Residuals versus predictor plots, Residuals versus fitted values plots and Normal Quantile-Quantile (QQ) plots. Depending on different pattern showed on the residual plots, we can notice different violations. If we have fanning pattern on the plot, this indicates that we have a non-constant variance. To fix this, we use apply variance stabilizing transformation on Y. (the specific transformation method depends on the context) If we see skews on the Residual versus predictor plot, this indicates that problems on either or both linearity and Normality generated. Similarly, we could apply various types of transformations on predictors or Y depends on the specific situation. A new model with transformation obtained. We then reapply condition 1 & 2 on it to see if they are satisfied. Same steps as before, if both conditions are satisfied, we can generate residual plots. If there is no discernible pattern on the plots, assumption holds.

 

After that, we derive the best models for different numbers of predictors and derive the VIF for each of them. (Except the one consists only one predictor) VIF stands for variance inflation factor, a high VIF indicates that the independent variable of interest has a high degree of collinearity with the other variables in the model, and a low VIF indicates a low degree of collinearity. Low values are what is expected. If there is a high VIF shows between two predictors, either action can be taken: collect more data or respecify the model. Each has its own advantages and disadvantages and needs to be decided based on the context.

 

Model selection will implement after we fix the high VIF. Adjust R^2 would be a good choice to when we have models contains different numbers of predictor. Higher R^2 indicates larger variation could be explained using the model, therefore, we pick the one with highest R^2. We can also select model based on AIC (Akaike’s Information Criterion). The better fit model always with lower AIC values, which is we preferred. Or we can apply automatic AIC to select model as well, various selections can be taken. When we use the above model selection to come up with the best model, we check if it satisfies linear assumptions again.

 

We then check the problematic observations such as leverage points, outliers, and influential points in the model. In general, no observation could be removed if there is no contextual reason. No model is perfect, understanding how each type of problematic observation affects the model is necessary.

 

All the above findings would be validated in our test set. We generate a new model in the test set with all predictors we selected in the training set. We hope that the model generated has similar coefficients, all predictors contained are significant, and that the model is consistent with the linear assumptions.

Result

Final model:
                   Log (sqrtPoint) ~ MPG + Trueshoot + AST + TO

微信截图_20230223103008.png
  • Data description:

    • sqrtPoint: Players’ point per game.

    • MPG: Minutes per game - The average number of minutes a player has played per game.

    • Trueshoot: True shooting percentage - It measures a players’ efficiency at shooting the ball.

    • AST: Assist percentage - An estimate of the percentage of teammate field goals a player assisted while he was on the floor.

    • TO: the number of turnovers - The number of turnovers that are forced by the defensive player or team.

Steps with Explanation:

  • Transformation on Y:

    • ​When we use residual plot to check linear assumption, there is skews on the residual versus fitted value plot, (Pic #1, Figure 1) and a large deviation on large Normal QQ plot. Therefore, power transformation on Y applied.

微信截图_20230223103031.png
  • ​VIF of each model with different numbers of predictor:

    • When we apply VIF on models, I found that AST suddenly increase to 4.919139 from 1.044950 approximately when the data of APG added in. And at this time, APG has a figure 6.793859. And all other figures on the chart are around 1.0. I suggest there is a high correlation between AST and APG. To fix this, I remove the APG from the data.

微信截图_20230223104634.png
  • Model Selection:

    • To select model, I used three different ways: adjusted R square, AIC, and Automatic AIC. As model 4 gives us the biggest adjusted R square, and lowest AIC, we say it is the best model under these selections. Automatic AIC gives us:

sqrtPoint ~ Trueshoot + MPG + AST + TO (Model 4)

as a result. Therefore, Model 4 is the best model.

微信截图_20230223104659.png
  • Validation on test set:

    • From table 1 we can see that comparing the model on training set and test set, there is small difference between coefficients, and all predictors are significant.

​​In summary, we obtain the final model:

Log (sqrtPoint) ~ MPG + Trueshoot + AST + TO

Discussion

The finally obtained a model consisting of MPG, Trueshoot, AST and TO. The model suggests that the longer a player is on the court, the more points he tends to score. It is also very reasonable in practice that time on the court is proportional to the number of points scored. Same as shooting percentage, assist percentage (AST) also contributes a lot. This probably because players with more assists tend to be more efficient, better organized, and read the game better. It is also reasonable that the number of turnovers has negative impacts on points per game.

 

However, this research still has a lot of limitation. For example, the difference of home and away atmosphere may cause players to play unstable. Not only that, but the stability of players’ play is also an uncertainty, if there is more research on players' stability in the future, it is believed that the existing model can be improved better. Our data also does not consider the nature of the game, where players may choose whether to save their strength or not depending on the importance of the different games. Not only that, but we need to consider the injury history of the players and their current physical status. This greatly affects their potential in the future. If there is research that can study the scoring curve of each player from the beginning of the NBA game career to date, combined with their physical condition may be able to study more effective information about them. In addition, there are many phenomena on the NBA court that limit the development of some players their own. For example, the famous James system, star players almost monopolize the ball, other players aside to assist, and thus the actual individual scoring is very low.

bottom of page