More on Home Runs Per Game

In the previous post, I looked at the trend in home runs per game in the Major Leagues and suggested that the recent deviation from the increasing trend might have been due to the development of strong farm systems like the Tampa Bay Rays’. That means that if the same data analysis process is used on data in an otherwise identical league, we should see similar trends but no dropoff around 1995. As usual, for replication purposes I’m going to use Japan’s Pro Baseball leagues, the Pacific and Central Leagues. They’re ideal because, just like the American Major Leagues, one league uses the designated hitter and one does not. There are some differences – the talent pool is a bit smaller because of the lower population base that the leagues draw from, and there are only 6 teams in each league as opposed to MLB’s 14 and 16.

As a reminder, the MLB regression gave us a regression equation of

$\hat{HR} = .957 - .0188 \times t + .0004 \times t^2 + .0911 \times DH$

where $\hat{HR}$ is the predicted number of home runs per game, t is a time variable starting at t=1 in 1955, and DH is a binary variable that takes value 1 if the league uses the designated hitter in the season in question.

Just examining the data on home runs per game from the Japanese leagues, the trend looks significantly different. Instead of the rough U-shape that the MLB data showed, the Japanese data looks almost M-shaped with a maximum around 1984. (Why, I’m not sure – I’m not knowledgeable enough about Japanese baseball to know what might have caused that spike.) It reaches a minimum again and then keeps rising.

After running the same regression with t=1 in 1950, I got these results:

	Estimate	Std. Error	t-value	p-value	Signif
B0	0.2462	0.0992	2.481	0.0148	0.9852
t	0.0478	0.0062	7.64	1.63E-11	1
tsq	-0.0006	0.00009	-7.463	3.82E-11	1
DH	0.0052	0.0359	0.144	0.8855	0.1145

This equation shows two things, one that surprises me and one that doesn’t. The unsurprising factor is the switching of signs for the t variables – we expected that based on the shape of the data. The surprising factor is that the designated hitter rule is insignificant. We can only be about 11% sure it’s significant. In addition, this model explains less of the variation than the MLB version – while that explained about 56% of the variation, the Japanese model has an R^2 value of .4045, meaning it explains about 40% of the variation in home runs per game.

There’s a slightly interesting pattern to the residual home runs per game ( $Residual = \hat{HR} - HR$ . Although it isn’t as pronounced, this data also shows a spike – but the spike is at t=55, so instead of showing up in 1995, the Japan leagues spiked around the early 2000s. Clearly the same effect is not in play, but why might the Japanese leagues see the same effect later than the MLB teams? It can’t be an expansion effect, since the Japanese leagues have stayed constant at 6 teams since their inception.

Incidentally, the Japanese league data is heteroskedastic (Breusch-Pagan test p-value .0796), so it might be better modeled using a generalized least squares formula, but doing so would have skewed the results of the replication.

In order to show that the parameters really are different, the appropriate test is Chow’s test for structural change. To clean it up, I’m using only the data from 1960 on. (It’s quick and dirty, but it’ll do the job.) Chow’s test takes

$\frac{(S_C -(S_1+S_2))/(k)}{(S_1+S_2)/(N_1+N_2-2k)} \sim\ F_{k,N_1+N_2-2k}$

where S_C = 6.3666 is the combined sum of squared residuals, S_1 = 1.2074 and S_2 = 2.2983 are the individual (i.e. MLB and Japan) sum of squared residuals, k=4 is the number of parameters, and N_1 = 100 and N_2 = 100 are the number of observations in each group.

$\frac{(6.3666 -(1.2074 + 2.2983))/(4)}{(100+100)/(100+100-2\times 4)} \sim\ F_{4,100+100-2 \times 4}$

$\frac{(6.3666 -(3.5057))/(4)}{(200)/(192)} \sim\ F_{4,192}$

$\frac{2.8609/4}{1.0417)} \sim\ F_{4,192}$

$\frac{.7152}{1.0417)} \sim\ F_{4,192}$

$.6866 \sim\ F_{4,192}$

The critical value for 90% significance at 4 and 192 degrees of freedom would be 1.974 according to Texas A&M’s F calculator. That means we don’t have enough evidence that the parameters are different to treat them differently. This is probably an artifact of the small amount of data we have.

As a reminder, the MLB regression gave us a regression equation of

$\hat{HR} = .957 - .0188 \times t + .0004 \times t^2 + .0911 \times DH$

where $\hat{HR}$ is the predicted number of home runs per game, t is a time variable starting at t=1 in 1954, and DH is a binary variable that takes value 1 if the league uses the designated hitter in the season in question.

After running the same regression with t=1 in 1950, I got these results:

	Estimate	Std. Error	t-value	p-value	Signif
B0	0.2462	0.0992	2.481	0.0148	0.9852
t	0.0478	0.0062	7.64	1.63E-11	1
tsq	-0.0006	0.00009	-7.463	3.82E-11	1
DH	0.0052	0.0359	0.144	0.8855	0.1145