NHL Equivalency and Prospect Projection Models: Building the Prospect Projection Model (Part 3)

Bringing Together the Best of Both Worlds

The second step of this project was not unlike the first in that it took heavy inspiration from the established work of one of hockey’s top public data scientists: Byron Bader. Bader runs Hockey Prospecting, an NHL-Level Analytics Tool which he describes below in his own words:

Hockey Prospecting standardizes player scoring across the board and uses historical performances to chart how prospects will perform in the NHL.

You’re probably familiar with Bader and/or Hockey Prospecting by name, but even if you aren’t, you’ve either been living under a rock or you’ve seen one of his trademark prospect comparison cards:

Image from Byron Bader

The essence of Byron’s model is that he uses a logistic regression which takes scoring rates to determine the probability that a prospect will become an NHL player and/or an NHL star. A player is considered an NHLer if they play at least 200 games, and they are considered a star if they score at or above a certain rate in their career.

Going into this project, my philosophy for building a prospect model remained very similar to Byron’s up until the way I chose define a star at the NHL level, at which point it diverged heavily. As I mentioned earlier, the NHL provides the public with a wide array of information that can be used to build metrics that are more valuable than points, and since my goal here is to use the best information available wherever I can, I chose to use the outputs of my WAR Model to define both target variables.

Defining an NHLer:

This process was simple: An NHLer is a player who has met the classic criteria of at least 200 NHL games played and provided a positive career WAR. I don’t care how many games somebody plays if they aren’t adding more than the next player that could be claimed off waivers, so I don’t believe any measure of success should be assigned to players who fail to do so.

Defining an NHL Star:

This was a slightly more complicated process because star is a completely arbitrary term that varies heavily depending on the context of the discussion, and the textbook definition of star probably has more to do with marketing and exposure than it has to do with how good a player actually is. At the conclusion of the 2019–2020 regular season, Marc-Andre Fleury was a 36 year old in the midst of his second straight mediocre season who had just been relegated to the role of his team’s backup for the playoffs. But he was also indisputably a star on the basis of name recognition and jersey sales.

I was obviously not about to define a star using name recognition, as I wanted to use a metric better than points; not a significantly worse one. I knew that I would use my WAR model and use the term “star” strictly as a measure of player quality, but I wasn’t sure how to do this until I thought back to some reading I had done some time ago on The Pareto Principle, which you may know as the 80/20 rule.

The Pareto Principle states that 80% of consequences come from 20% of the causes.

In other words, the “vital few” make up only 20% of a given population, yet are responsible for 80% of the outcomes in the population. To give a practical example of this in action, while researching the principle I found a study which showed that the top 15% of MLB baseball players contribute roughly 85% of the WAR. The numbers are not always exactly an 80/20 split, but there generally tends to be a small percentage of the population responsible for an inversely large percentage of the outcomes. My first thought upon reading this was that something similar was also true for my WAR model, but I didn’t bother to look any further into it at the time.

When it came time to define a star player, I circled back to the Pareto Principle. I felt that if this principle were true for my model, then the top 15% or 20% of WAR producers — those who would be considered the “vital few” according to the Pareto Principle — would be those considered star players. The more accurate terminology for “the probability of a player becoming a star” in this case is actually “the probability of a player becoming a member of the vital few,” but the two are essentially interchangeable, and the term star is much easier to interpret, so I’ll stick with it.

I didn’t know for sure whether the Pareto Principle was true for my model, and I couldn’t just assume it was and use the top 15% figure that was true for the MLB, or use the top 20% figure which Pareto defined.. To determine the Pareto Principle value for a given season, I first ranked all skaters by WAR, then took a number of total skaters (starting with 1) and kept only that number of the top skaters in WAR. I then added the total WAR contributed by those top skaters and divided it by the total WAR added by all skaters in that season. Finally, I took the percentage of the whole body of skaters which they represented and added that to the percentage of the whole body of WAR which their contributions represented. I repeated this process for every possible number of skaters, and determined the optimal value for the Pareto Principle was that for which the sum of the percentage of total skaters made up and the percentage of total WAR contributions from those skaters was closest to being exactly 1.

This may sound like a complicated process — I certainly found it easier to implement than to describe — so take a look at the following example which uses the top-5 potential percentages for 2020–2021:

For this particular season, the Pareto Principle was optimized at a point where the top 18.84% of skaters provided 81.33% of the total WAR. Here are the optimal values for every season since 2007–2008:

As you can see, the Pareto Principle varies from year to year, and recently it has grown to state that the “vital few” make up a larger percentage of the population than they used to. The gap here is small enough that it could totally just be variance, but my hypothesis is that NHL teams are just doing a better job of optimizing the lower halves of their lineups better than they did back from 2007 through 2010, so the gap between replacement level players and stars is smaller than it used to be. (Note that I already held this assumption, so these results just serve to confirm it.)

After optimizing the Pareto Principle for each season, I determine that the average of the percentage of skaters who make up the vital few across all seasons was roughly 18.5%. I then used this value to form my final definition of a star: A skater in the top 18.5% of career WAR per game rate among all skaters who have played at least 82 NHL games. This worked out to be roughly 1.5 WAR per 82 games played at the career level. This value feels low for a star, and many of the players on the list of established stars are not players to whom I would personally apply that term, but I’m confident enough in the methodology used to define a star that I’m comfortable with the outputs.

Projecting NHLers and Stars

I had a clear definition of what made an NHL player and what made a star. Research has generally shown that hockey players peak from ages 22 through 26 and decline thereafter, so I decided that if a player’s D+8 (age 26) season had passed and they hadn’t yet established themselves as an NHLer, they would be considered a bust. Those who had yet to be classified as an NHLer, star, or bust were considered “undetermined” and not used in calculating projections.

Once I made this classification, every player who had played their draft season during or after 2006–2007 and either seen their D+8 season pass (even if they were no longer playing hockey at that point) or established themselves as an NHLer and/or star was a data point that could be used to build out the final projection model. Their scoring rates as prospects would work as predictor variables, and their NHL performance (starting in 2007–2008) would be used as target variables.

With a universal measurement of the scoring proficiency demonstrated by prospects alongside the eventual outcomes of these prospects’ careers, I had everything I needed to build out the final model: One which leveraged scoring as prospects (as well as a few other less important variables in height, weight, and age) to predict the outcome of a player’s career at the NHL level.

Due to the small number of predictor variables in my regression and the fairly simple and predictable way in which I expected them to interact with one another, I determined that logistic regression would be sufficient enough to meet my needs. But one logistic regression model alone is not enough for the following reasons:

  • Forwards and defensemen score at vastly different rates as prospects; their scoring rates should only be compared to their positional peers. This raises the need for separate models to be built for forwards and defensemen.
  • The probability of making the NHL is not be the same as becoming a star, and different variables may influence these probabilities differently. This raises the need for separate models to be built determining NHLer probability and star probability.
  • Prospects find themselves at different stages of development. A player in his Draft-1 season is in a much different position than a player in his Draft+1 season, but I’d like to be able to use all of the information available to determine the likelihood that both players will succeed in the NHL. This raises the need for separate models to be built for players at different stages of their development.
  • If one regression model is trained on the same sample that it is used on, it runs the risk of becoming overfit. To give a practical example, a model that already knows Brayden Point became a star in the NHL may use the data point of a player with Brayden Point’s profile to overstate the likelihood that he was going to become a star given the information available at the time. This raises the need for separate models trained out-of-sample to be built for each season.

Here is what the process looks like for building out a model for all skaters who played at least 5 games in any one league during their Draft-1 and Draft seasons:

  1. Remove players whose draft years were 2007 from the training data, then split the training data into forwards and defensemen.
  2. Use the forwards in the training set to train a logistic regression model which uses Draft-1 year NHLe, Draft year NHLe, height, weight, and age as predictor variables and making the NHL as the target variable.
  3. Repeat the process with a model that is identical except that it uses becoming an NHL star as the target variable.
  4. Take the two logistic regression models which were trained on forwards and run the models on forwards whose draft year was 2007.
  5. Repeat steps 2, 3, and 4 using the defensemen in the training set.
  6. Repeat steps 1 through 5for all skaters whose draft year was 2008, and so on and so forth for until you reach skaters whose draft year is 2021.

This process was repeated for all players at different stages of their development using whatever information was available, starting with players in their Draft-2 year and working up to players in their Draft+5 year. This means that a model was built which used Draft-2 year NHLe through Draft+5 year NHLe, along with a model for everything in between: Draft year through Draft+5, Draft-1 through Draft+3, etc.

Remember that the end values of my NHLe model can be interpreted as a very simple equation: 1 point in League A is worth X points in the NHL. One downside of this rudimentary equation appears to be the lower ceiling for the NHLe values of players who play in weaker leagues. The fact that hockey games only last 60 minutes and very few players play more than 30 means there is a cap on how much any player in any league is physically capable of scoring.

Take the GTHL-U16, for example: The league has an NHLe conversion rate of 0.012. While any NHL player would completely dominate it, they would also not score anywhere near the 83 points per game necessary in that league to post an NHLe of 82. The Connor McDavid of today, whose NHLe in 2020–2021 was 153.75, would probably struggle to manage an NHLe of even 40 if he were banished to the depths of the GTHL-U16.

While the McDavid fantasy scenario I laid out is extreme, this issue is legitimately present on a lesser scale with prospects who can only dominate junior leagues to such a degree. Even if the NHLe conversion values are accurate for the entirety of the population, the ceiling for the NHLe which high end prospects can post is much lower for those playing in junior leagues than it is for those playing in men’s leagues where each point is far more valuable.

In addition to this modeling quirk, it should also be kept in mind that the NHLe model already appears to be quite high on professional men’s leagues and relatively low on junior leagues, at least relative to conventional wisdom. Both of these factors mean that the prospect projection model, and especially the star probabilities from the model, are more favorable to European players. This can be seen in the 20 players who had the highest star probability using their Draft-1 and Draft seasons.

The list contains a high proportion players who spent their draft year in Europe. While the model did a good job of identifying undervalued players from European leagues in Tomas Hertl and Vladimir Tarasenko who weren’t drafted in the second half of the first round, I’d say it was still a bit too confident in those players, and also too confident in players like Mikael Granlund (who just narrowly meets the star cut-off) and players like Kaapo Kakko and Jesse Puljujarvi whose career outcomes are still up in the air.

While the logistic regression itself is trained out-of-sample, it’s still not entirely fair to use the outputs of this model as though this information was all available at the time these picks were made because roughly half of the information used to inform the logistic regression for each season actually happened after that season.

For example, the model which says Patrick Kane had a 98% probability of becoming a star and a 100% probability of making the NHL may not have known that a player with Kane’s exact profile did both things, but it knew everything else about players from every other draft year from 2008 through 2020. These things had yet to happen in 2007, and the information which stemmed from them was therefore not available. In addition, the NHLe model itself — by far the most important component in these regression models — also suffers from this same issue (and was not even trained out-of-sample).

This is all by way of saying that a comparison between what my model would have suggested NHL teams do on the draft floor and what they actually did would be inherently unfair to those NHL teams, who also undoubtedly would have used this new information to inform their future decisions. Nikita Kucherov’s draft year (2012) was two years after Artemi Panarin’s, but if NHL teams somehow knew in 2010 that Nikita Kucherov would become one of the greatest players in the world (like my model for 2010 forwards did), they would likely have viewed Artemi Panarin in a different light even if they didn’t specifically know his future. Until I can devise a fair way to compare the outputs my model to the draft performance of NHL teams on equal footing, I won’t try and do so.

I can, however, still test the performance of my model according to two metrics: Area Under Curve (AUC), which you can read about here, and Log Loss, which you can read about here. The results of these tests are laid out below:

For whatever reason, the star defenseman model struggles to post a strong AUC when provided with only one season of data, regardless of whether that season is the Draft-2, Draft-1, or Draft Year. Outside of that, every single model posts excellent test metrics: The documentation for AUC which I linked above states that a model which scores between 0.9 and 1.0 is excellent, and log loss values below 0.1 are generally considered excellent as well.

None of this necessarily means I built out a great model, though. This is because not only do the vast majority of players in this data set fail to make the NHL — much less become stars — but because they almost all exhibit signs that make their failures extremely easy to predict. Take 2010 as an example, where 2,706 players were eligible to be drafted and used in the model: Of those 2,706 draft-eligible players, only 462 had at least a 1% probability of making the NHL and only 126 had at least a 1% probability of making a star.

While most of these players are irrelevant, it’s also fair to say that a model deserves some degree of credit for identifying bad players. At what point should you stop giving that credit, though?

I spent some time pondering over this before eventually deciding that if no NHL team would draft a player, a model shouldn’t receive credit for agreeing on such an obvious statement. But I couldn’t just exclude all undrafted players, as it would be very unfair to the model to exclude a player like Artemi Panarin, who had a 78% chance of making the NHL and 32% probability of becoming a star based on data from his Draft-1 and Draft Years.

What I chose to do was simply to exclude all players whose NHLe in their Draft-2, Draft-1, or Draft years were below the threshold of any player who had been drafted in any year. This unfortunately didn’t actually change much, because players with an NHLe of 0 in their Draft-2 year and Draft-1 year have been drafted, and players with an NHLe below 1 in their Draft year have been drafted. The test results for the performance of the model on only “draftable” players is essentially identical, and for some sets of seasons where the minimum NHLe is 0, they are exactly identical.

As you can see, the model still performs mostly excellent according to these test metrics. The only area where it struggles even slightly is with predicting star defensemen early in their career, which brings me back to the assumption I made in part 1 of this article:

In order for defensemen (and forwards) to succeed at either end of the ice in the NHL, they need some mix of offensive instincts and puck skills that’s enough for them to score at a certain rate against lesser competition.

As it turns out, the hardest thing for scoring rates as a prospect to predict is whether a defenseman will become an NHL star; the model does a fair, but not necessarily good or great job predicting this alone with only one year of data. With multiple years of data, though, the model does a very good job of predicting star defensemen.

In closing, I would say that my research mostly confirms the assumption I made: The rate at which a defenseman scores as a prospect is very important, and I greatly recommend against using high draft picks on defensemen who aren’t proficient scorers.

This concludes part 3 of the series, and the documentation for the model. Chances are that if you’ve already read through the ten thousand or so words I’ve written so far, you’ve got some kind of interest in the 2021 draft. If that’s the case, then I have good news for you: This series has a part 4 which will break this draft class down in detail!