Finally Settled: Baseball’s Best of The Best

Watching Mike Trout today is the equivalent of seeing…Shoeless Joe Jackson?

Payton Soicher
Towards Data Science

--

Hank Aaron (https://baseballhall.org/discover/hank-aaron-715-hr-ticket)

Baseball fans love to compare current players to the legends of the past. Mike Trout having the same talent as Mickey Mantle, Billy Hamilton being able to run stride for stride with Ricky Henderson, or Javier Baez’s electric style matching Jackie Robinson’s. Older fans make these comparisons but for the young fans, like myself, how would I know who they remind the lifelong fans of? I didn’t see Hank Aaron’s offensive dominance. Pete Rose railroading catchers at home plate. Reggie Jackson’s unmatchable power. Old video can show the highlight moments of these careers but don’t give a true sense of what these players did on a daily basis.

Not to mention that maybe our grandparents didn’t know what they were talking about either. Last I checked, you couldn’t watch the ’27 Yankees or the ’55 Dodgers on the MLB package. We are all biased in favor of our favorite players and think they’re larger than life. Mickey Mantle was considered the greatest center fielder of a generation, but he was the best player on the most popular team. It is possible that all of his hype came from people just liking him so much?

But what if we could go one step further? Instead of comparing players against each other, comparing individual player seasons against each other and see which individual season performances are proportionate.

Today I will give a lesson on how we can compare the greatest offensive regular season performances in MLB history, and rank the best offensive seasons and players to ever play the game.

For anyone interested in the dataset and code, you can see my project at my GitHub: https://github.com/anchorP34/MLB-Clustering/blob/master/MLB%20Clustering%20Analysis.ipynb

Offensive Statistics Breakdown

First, let’s look at an example from the offensive statistics that are provided. My favorite player, Nolan Arenado for the Colorado Rockies, has games from 2013–2018.

Offensive statistics from Nolan Arenado’s seasons in the MLB

This shows the baseline statistics of his offensive season like games, at bats, runs, hits, home runs, RBI’s, etc. With my baseball background, I knew of some other metrics that are used to evaluate offensive performance.

Baseball statistics that are common in evaluating a player’s performance

With the inclusion of the new variables, there is a complete offensive breakdown that should give enough insight into a player’s offensive performance for a season.

Offensive Percentiles

In order to compare players from season to season, it is inefficient to compare raw numbers head to head. For example, this is the time series of how the average league batting average of players with more than 50 at bats in a season:

League average of batting averages from players with more than 50 at bats.

If a player hit .300 in 1975, that would be more impressive than a player who hit .300 in the year 2000. So, we should take percentiles of the players for that season of each offensive category and rank them on a scale from 0 to 1. Someone who hit .300 in 1975 might be in the .9 percentile (or 90th percentile) and hitting .300 in 2010 might be the .75 percentile.

Making those adjustments, Arenado’s percentiles for each offensive category now:

Offensive percentiles of Nolan Arenado for each season.

Here’s a quick breakdown of how well Nolan has done in each season:

  • In 2018, Nolan was in the 99th percentile for RBI’s, meaning that only 1% of the MLB had more RBI’s than him
  • His OBP increased or about stayed the same each year, showing his progress of that offensive category
  • In 2015, he finished in the 99th percentile in Slugging Percentage, Home Runs, and RBI’s, which would indicate that he is a very powerful hitter

But does that mean Nolan Arenado is a valuable hitter in 2015?

Nolan is in the upper percentiles for most the offensive categories, but are those the right categories to be considered a valuable hitter? Are MVP candidates prevalent in those categories as well? A clustering algorithm can help with identifying these fields.

However, we don’t want the clustering algorithm to break each cluster into equal parts. We want the truly valuable players all brought together and separated from the rest of the pack, which should be a smaller percentage compared to the rest of the population.

After clustering, there should be three groups of players:

Group 1: Average or below average players

Group 2: Above average players

Group 3: The best players for that given season

After running some analytics from the normalized data, we got a breakdown of the following:

  • Average or below average players: 40% of MLB population
  • Above average players: 35% of MLB population
  • The best players for that given season: 25% of MLB population

This distribution is spread a little too even. 1 out of 4 players are considered the best in the league? There needs to be a smaller amount to truly get the top notch players that we’re interested in. Making one more adjustment, whether that player was in the top 5 percentile of that offensive category for that season, data can be interpreted more binary:

Data converting data into binary outputs depending on whether Nolan Arenado was in the top 5 percentile of that offensive category for that season.

Now we can see that in 2015, Nolan was in the top 10 percentile for 13 of the 18 categories we are interested in. Running the same type of clustering algorithm as before, the new breakdown turns into

  • Average or below average players: 82% of MLB population
  • Above average players: 10% of MLB population
  • The best players for that given season: 8% of MLB population

Perfect! This carves the fat out of most MLB players to make sure the best seasons are being identified. To double check, we can look at the players who are all in the top tier cluster for the 2015 season:

Some players who were in the top group for the 2015 season

The MVP’s of the 2015 season Josh Donaldson and Bryce Harper were included as well as Arenado. Algorithm is working just as we expected.

What offensive categories are most dominant in identifying the best players?

With the players now assigned to a cluster for each year, we can see what percentage of players have that offensive category covered for that season. The heat map below breaks this all down.

Cluster 2 is considered the best players in the MLB. We can see that offensive categories like RBI, OPS, and HR/G are the values that separate them from the rest of the pack

Walking through the chart, cluster 0 belongs to the below average players, cluster 1 are above average players, and cluster 2 are the top players. Opening with the RBI category, 86% of the players in cluster 2 are in the top 5th percentile of that category in their season. For OPS, 83% of cluster 2 players were in the top 5th percentile for that season.

Overall, the categories RBI, HR, OPS, SLG, and HR/G are the fields that separate cluster 2 from the rest. All those fields have to do with power numbers. When identifying who the best players are in the league, the first place you should look at is the power numbers.

Season Comparison

There are some seasons that are driven by individual star power and others that are defined by prevailing teams. The line chart below shows the number of offensive categories the average player in cluster 2 is qualified for. For example, in the 2001 season players in cluster 2 on average were in the top 5 percentile in approximately 12 categories.

This line chart shows the average number of offensive categories a player in cluster 2 has in each season. The high points show the dominance of them in a season, low points show that the season was more evenly distributed throughout the league.

From the graph, there are lots of up and downs in back to back years with the volatility of movement calming down since the 2003 season. The 1980s and 1990s saw an increasing trend which makes sense when the steroid era of baseball was creating new superstars every year.

So, who’s at the top of the all time list?

The best individual offensive seasons in MLB history

But who had the most dominant offensive seasons ever? People usually think of players like Hank Aaron, Barry Bonds, and Babe Ruth when it comes to the most dominating players; however, the chart on the left has a few other ideas. For starters, San Musial and Lou Gehrig are listed at the top three times each which is remarkable. Many people are all over Mike Trout being the next great generational talent and it makes sense. He is the only one listed as having a truly dominant season in the last 15 years. Even triple crown winner Miguel Cabrera’s 2012 season finished with a total of 16, displaying how incredible these seasons were.

Having the ability to see Mike Trout’s dominance in parallel with Ted Williams in ’46 or Gehrig in ’27 is an amazing way to understand how incredible these seasons were.

Finally, we can look at the players who have been at the top of the MLB charts having at least 5-year careers.

Players with the highest career Top Tier Percentage who played at least 5 years in the MLB

This data set suggests that Frank Robinson, who by the clustering standards was a top tier player for 19 of his 20 year career, is the most offensive juggernaut to ever touch a baseball field. Personally, I look at the gold standards of Hank Aaron and Barry Bonds who both have marvelous numbers as my top two players to ever play the game, but the algorithm says otherwise. Finally, the best player in the game today, Mike Trout, at this pace will have to be in the conversation as one of the best to ever play. Who is his best comparable player as of this current moment? None other than the infamous Shoeless Joe Jackson.

What about players like Willie Mays who people consider to be the greatest player to ever live. The chart above shows the number of times they were considered one of the best in the league for a given season divided by the total number of career seasons. Mays is punished in this ranking due to his last few years in the league. Of his 22 years in the bigs, he was considered great 16 of them, giving him a Top Tier percentage of 73%.

With the clustering algorithm, we were able to compare players of the game across seasons to give a true perspective of which players today match up with baseball immortals. Understanding that offensive categories like RBI, HR, OPS, SLG, and HR/G are the driving forces behind these players being considered elite in a season can give more accurate descriptions of what makes players MVP candidates and all-stars.

I’m sorry to anyone who’s frustrated that their favorite player didn’t crack the list. I’m still trying to figure out how to get Arenado to show up too.

But hold on, hold on! Jeff Bagwell made the list but not Derek Jeter? Where is Pete Rose? They weren’t considered offensive bad boys? Something must be wrong, right? Or…should they not belong in this category?
This might take another deep dive…

--

--