Modeling Google Search Trend Data: MLB
Authors: Nolan Meyer & Declan Elias
Baseball has a pastime rooted in American sports history, and has been one of the most popular sports in America for many generations. The first season began in 1901, and the popularity has grown ever since to a point where baseball stars have become household names across the country. The MLB – Major League Baseball – is the top baseball league in the US where each team plays 162 games during the regular season from late March/early April until late October/early November. Given its long existence, we are interested in studying how the popularity of the sport has changed over time and how the popularity varies within each year as well.
We are using Google search trend data for the term “MLB” from January 2004 to February 2022. The data gives us the popularity of the search term for each month in the time period. The values range from 0-100. A value of 100 represents the peak popularity of the search term. A value of 50 means the term is half as popular. A score of 0 means there was not enough data for the search popularity at that time.
The variables of interest are the month and the popularity of that given month. The variables were measured by Google using data from the amount of searches in the given month.
In the original, un-transformed data, the trend cycles showed a large increase in variance over time, so we log transformed the data to reduce this increase in variance.
mlb = read.csv("multiTimeline.csv")
colnames(mlb) = c("popularity")
mlb = mlb[-1,, drop = FALSE]
mlb = mlb %>%
mutate(log.popularity = log(as.numeric(popularity)),
Month = (1:nrow(mlb) - 1) %% 12 + 1,
Year = rep((1:nrow(mlb) - 1) %/% 12 + 2004),
Date = paste(Year, Month, 1 , sep = '-'),
Date = ymd(Date),
Date_dec = decimal_date(Date),
pandemic = if_else(Month %in% c(3, 4, 5, 6) & Year == 2020, 1, 0))
To estimate and model the trend, we used b-splines of varying degrees to try and accurately capture the general pattern of the data. We tested b-splines with degrees of 2, 3, and 4 to model the trend. All three estimates of the trend included a knot at 2020 as there is some irregularity in the data due to COVID that year. We selected the model that best captured the trend and led to less variability in the resulting residuals.
mlb = mlb %>%
mutate(mlbTrend2 = predict(lm(log.popularity ~ bs( Date_dec, knots= c(2020), degree = 2), data = mlb)))
mlb = mlb %>%
mutate(mlbTrend3 = predict(lm(log.popularity ~ bs( Date_dec, knots= c(2020), degree = 3), data = mlb)))
mlb = mlb %>%
mutate(mlbTrend4 = predict(lm(log.popularity ~ bs( Date_dec, knots= c(2020), degree = 4), data = mlb)))
mlb = mlb %>%
mutate(Detrend3 = log.popularity - mlbTrend3,
Detrend2 = log.popularity - mlbTrend2,
Detrend4 = log.popularity - mlbTrend4)
To model the seasonality, we used indicator variables for the month to estimate the monthly average deviations due to the seasonality. We also added an indicator for the months during the pandemic, as we noticed some different trends during this time period. By doing this we can plot the average monthly deviations and get a sense of the estimated seasonality of the data.
After removing both the trend and seasonality using the 2nd degree b-splines and the month indicator variables, we are left with the remaining errors. To model the errors, we looked at the autocorrelation function using the astsa package of the errors to identify potential patterns that would indicate reasonable models to use. (Stoffer and Stoffer 2021) We ended up modeling the errors with an AR(15) model, an AR(1) model, and a seasonal AR(1) model. To decide between these models, we looked at the SARIMA output and compared the resulting acf plots and Ljung-Box tests, looking for acf plots that looked like white noise and higher values for the Ljung-Box test.
# Fitting the errors
mlb <- mlb %>%
mutate(Errors = Detrend2 - Season)
# Autocorrelation
library(astsa)
# acf2(mlb$Errors)
errorTS = ts(mlb$Errors, start = c(2004, 1), frequency = 12)
# Error models:
# sarima(errorTS, p = 15, d = 0, q = 0)
# sarima(errorTS, p = 1, d = 0, q = 0)
# sarima(errorTS, p = 1, d = 0, q = 0, P = 1, D = 0, S = 12)
Lastly, we combined the trend and seasonality models together, and then incorporated it with the seasonal AR(1) error model into a final SARIMA model and assessed its performance using similar methods as we did with the previous SARIMA models. Using this final model, we also did a 24 month forecast into the future to see what the model predicts will happen moving forward.
# Forecasting set-up
Bknots = c(min(mlb$Date_dec),decimal_date(max(mlb$Date) %m+% months(24)))
y = lm(log.popularity ~ bs(Date_dec, knots= c(2020), degree = 2,Boundary.knots = Bknots) +factor(Month) + pandemic:I(Month==3) + pandemic:I(Month==4), data = mlb)
X = model.matrix(y)[,-1]
newdata = data.frame(Date = max(mlb$Date) %m+% months(1:24)) %>%
mutate(Date_dec = decimal_date(Date), Month=month(Date), pandemic=0)
newX <- model.matrix(~bs(Date_dec, knots= c(2020), degree = 2,Boundary.knots = Bknots) +factor(Month) + pandemic:I(Month==3) + pandemic:I(Month==4), newdata)[,-1]
# Final SARIMA Model:
# sarima(mlb_data, p = 1, d = 0, q = 0, P = 1, D = 0, S = 12, xreg = X)
# Forecasting into the future:
# sarima.for(mlb_data, 24, p = 1, d = 0, q = 0, P = 1, D = 0, S = 12, xreg = X,newxreg = newX)
Additionally, the following packages were used in our project to work with the data, make visualizations, and perform addition analyses: dplyr (Wickham et al. 2014), lubridate (Grolemund and Wickham 2013), stringr (Wickham and Wickham 2019), ggplot2 (Wickham, Chang, and Wickham 2016), splines (Bates, Venables, and Team 2011), and readr (Wickham et al. 2015).
Our first step in modeling the data was to first model the trend. We wanted a model that would estimate the trend without being affected by the seasonality. A linear model was considered, but because of the pandemic affected season in 2020 leading to a drop in popularity, we decided to go with a b-spline. Using a spline allowed us to have more flexibility by fitting different parts of the data with different models.
We fit splines of degree 2,3, and 4 using a knot at January 2020. Each of the three did a good job of fitting the trend. The residual plots showed all the models did a good job of estimating the trend, with degree 4 doing the best. This, however, comes at the expense of having a more complex model. A more complex model leads to higher variance. Because we see that all 3 of the models estimate a very similar trend, we choose the least complex model to fit the rest of the data. This left us with a b-spline of degree 2 with a knot at January 2020 as our trend estimate.
We estimated the monthly average deviations due to seasonality by predicting the residuals as a function of the month, using the pandemic indicator variables. We added an indicator variable because we noticed some unusual trends due to the pandemic. We then plotted the average monthly deviations to get a sense of the estimated seasonality of the data. We see that the peak months for the log popularity is from April-October, with lows from November-March. in the residuals plot, we see that this method does best at removing the seasonality from ~2011-2015, while the other years have greater errors and variation.
The residual plot also shows the model does a poor job of modeling the seasonality in 2020 and 2021. In 2020 it severely overestimates the the seasonality and does the opposite in 2021.
After removing both the trend and seasonality using the 2nd degree b-splines and the month indicator variables, we are left with the remaining errors. Looking at the autocorrelation function we see the ACF plot decays to 0, and the PACF drops to 0 after 1 lag, but is not 0 again around 13. This indicates that a seasonal AR(1) will do the best job of modeling the errors.
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
ACF 0.63 0.35 0.12 0.04 -0.09 -0.14 -0.19 -0.14 -0.09 0.10 0.30
PACF 0.63 -0.08 -0.10 0.04 -0.16 -0.02 -0.08 0.03 0.02 0.22 0.25
[,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
ACF 0.47 0.29 0.07 -0.05 -0.11 -0.16 -0.19 -0.20 -0.12 -0.03
PACF 0.21 -0.31 -0.20 -0.02 -0.04 0.06 0.03 0.04 0.11 0.01
[,22] [,23] [,24] [,25]
ACF 0.13 0.30 0.36 0.26
PACF 0.02 0.03 -0.04 0.04
The SARIMA function shows a seasonal AR(1) does an excellent job modeling the noise. The ACF of the residuals is all white noise, and the p values for the Ljung-Box statistic are all significantly above the threshold.
initial value -1.421678
iter 2 value -1.829822
iter 3 value -1.831444
iter 4 value -1.831692
iter 5 value -1.831714
iter 6 value -1.831808
iter 7 value -1.831956
iter 8 value -1.832009
iter 9 value -1.832016
iter 10 value -1.832018
iter 10 value -1.832018
iter 10 value -1.832018
final value -1.832018
converged
initial value -1.824807
iter 2 value -1.824863
iter 3 value -1.824889
iter 4 value -1.824905
iter 5 value -1.824954
iter 6 value -1.824994
iter 7 value -1.825014
iter 8 value -1.825017
iter 9 value -1.825017
iter 9 value -1.825017
iter 9 value -1.825017
final value -1.825017
converged
$fit
Call:
arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, Q), period = S),
xreg = xmean, include.mean = FALSE, transform.pars = trans, fixed = fixed,
optim.control = list(trace = trc, REPORT = 1, reltol = tol))
Coefficients:
ar1 sar1 xmean
0.616 0.5573 0.0106
s.e. 0.053 0.0598 0.0592
sigma^2 estimated as 0.02541: log likelihood = 88.53, aic = -169.05
$degrees_of_freedom
[1] 215
$ttable
Estimate SE t.value p.value
ar1 0.6160 0.0530 11.6325 0.0000
sar1 0.5573 0.0598 9.3121 0.0000
xmean 0.0106 0.0592 0.1788 0.8583
$AIC
[1] -0.775459
$AICc
[1] -0.7749445
$BIC
[1] -0.7133582
We forecasted the next two years using the seasonal AR(1) model with the estimates for the trend and seasonality. The model forecasted a small decrease in popularity each year for the next two years. This decrease makes sense because it is following the general shape of the trend model, with a continued decrease past 2021.
initial value -1.455398
iter 2 value -1.828628
iter 3 value -1.838249
iter 4 value -1.855558
iter 5 value -1.858660
iter 6 value -1.859781
iter 7 value -1.860697
iter 8 value -1.861772
iter 9 value -1.862144
iter 10 value -1.862430
iter 11 value -1.862693
iter 12 value -1.862842
iter 13 value -1.862886
iter 14 value -1.862899
iter 15 value -1.862906
iter 16 value -1.862910
iter 17 value -1.862912
iter 18 value -1.862914
iter 19 value -1.862915
iter 20 value -1.862915
iter 21 value -1.862916
iter 22 value -1.862916
iter 23 value -1.862916
iter 24 value -1.862916
iter 25 value -1.862916
iter 25 value -1.862916
iter 25 value -1.862916
final value -1.862916
converged
initial value -1.839426
iter 2 value -1.841995
iter 3 value -1.843439
iter 4 value -1.844417
iter 5 value -1.845374
iter 6 value -1.846265
iter 7 value -1.846967
iter 8 value -1.847281
iter 9 value -1.847459
iter 10 value -1.847516
iter 11 value -1.847543
iter 12 value -1.847557
iter 13 value -1.847563
iter 14 value -1.847566
iter 15 value -1.847567
iter 16 value -1.847569
iter 17 value -1.847571
iter 18 value -1.847573
iter 19 value -1.847574
iter 20 value -1.847574
iter 21 value -1.847574
iter 21 value -1.847574
iter 21 value -1.847574
final value -1.847574
converged
$fit
Call:
stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D,
Q), period = S), xreg = xreg, transform.pars = trans, fixed = fixed, optim.control = list(trace = trc,
REPORT = 1, reltol = tol))
Coefficients:
ar1 sar1 intercept
0.6319 0.5795 1.4129
s.e. 0.0613 0.0599 0.1667
bs(Date_dec, knots = c(2020), degree = 2, Boundary.knots = Bknots)1
0.9454
s.e. 0.3180
bs(Date_dec, knots = c(2020), degree = 2, Boundary.knots = Bknots)2
1.1516
s.e. 0.2114
bs(Date_dec, knots = c(2020), degree = 2, Boundary.knots = Bknots)3
0.5667
s.e. 0.8085
factor(Month)2 factor(Month)3 factor(Month)4 factor(Month)5
0.1491 0.5695 1.4007 1.3159
s.e. 0.0878 0.1136 0.1268 0.1338
factor(Month)6 factor(Month)7 factor(Month)8 factor(Month)9
1.4881 1.5561 1.3952 1.4172
s.e. 0.1372 0.1382 0.1370 0.1336
factor(Month)10 factor(Month)11 factor(Month)12
1.6064 0.2003 0.1657
s.e. 0.1268 0.1141 0.0895
pandemic:I(Month == 3)FALSE pandemic:I(Month == 3)TRUE
-1.0834 -0.2020
s.e. 0.1422 0.1375
pandemic:I(Month == 4)TRUE
-0.4044
s.e. 0.1389
sigma^2 estimated as 0.02423: log likelihood = 93.44, aic = -144.89
$degrees_of_freedom
[1] 198
$ttable
Estimate
ar1 0.6319
sar1 0.5795
intercept 1.4129
bs(Date_dec, knots = c(2020), degree = 2, Boundary.knots = Bknots)1 0.9454
bs(Date_dec, knots = c(2020), degree = 2, Boundary.knots = Bknots)2 1.1516
bs(Date_dec, knots = c(2020), degree = 2, Boundary.knots = Bknots)3 0.5667
factor(Month)2 0.1491
factor(Month)3 0.5695
factor(Month)4 1.4007
factor(Month)5 1.3159
factor(Month)6 1.4881
factor(Month)7 1.5561
factor(Month)8 1.3952
factor(Month)9 1.4172
factor(Month)10 1.6064
factor(Month)11 0.2003
factor(Month)12 0.1657
pandemic:I(Month == 3)FALSE -1.0834
pandemic:I(Month == 3)TRUE -0.2020
pandemic:I(Month == 4)TRUE -0.4044
SE
ar1 0.0613
sar1 0.0599
intercept 0.1667
bs(Date_dec, knots = c(2020), degree = 2, Boundary.knots = Bknots)1 0.3180
bs(Date_dec, knots = c(2020), degree = 2, Boundary.knots = Bknots)2 0.2114
bs(Date_dec, knots = c(2020), degree = 2, Boundary.knots = Bknots)3 0.8085
factor(Month)2 0.0878
factor(Month)3 0.1136
factor(Month)4 0.1268
factor(Month)5 0.1338
factor(Month)6 0.1372
factor(Month)7 0.1382
factor(Month)8 0.1370
factor(Month)9 0.1336
factor(Month)10 0.1268
factor(Month)11 0.1141
factor(Month)12 0.0895
pandemic:I(Month == 3)FALSE 0.1422
pandemic:I(Month == 3)TRUE 0.1375
pandemic:I(Month == 4)TRUE 0.1389
t.value
ar1 10.3007
sar1 9.6735
intercept 8.4744
bs(Date_dec, knots = c(2020), degree = 2, Boundary.knots = Bknots)1 2.9733
bs(Date_dec, knots = c(2020), degree = 2, Boundary.knots = Bknots)2 5.4467
bs(Date_dec, knots = c(2020), degree = 2, Boundary.knots = Bknots)3 0.7010
factor(Month)2 1.6976
factor(Month)3 5.0148
factor(Month)4 11.0438
factor(Month)5 9.8339
factor(Month)6 10.8425
factor(Month)7 11.2608
factor(Month)8 10.1850
factor(Month)9 10.6089
factor(Month)10 12.6707
factor(Month)11 1.7559
factor(Month)12 1.8510
pandemic:I(Month == 3)FALSE -7.6199
pandemic:I(Month == 3)TRUE -1.4690
pandemic:I(Month == 4)TRUE -2.9103
p.value
ar1 0.0000
sar1 0.0000
intercept 0.0000
bs(Date_dec, knots = c(2020), degree = 2, Boundary.knots = Bknots)1 0.0033
bs(Date_dec, knots = c(2020), degree = 2, Boundary.knots = Bknots)2 0.0000
bs(Date_dec, knots = c(2020), degree = 2, Boundary.knots = Bknots)3 0.4842
factor(Month)2 0.0912
factor(Month)3 0.0000
factor(Month)4 0.0000
factor(Month)5 0.0000
factor(Month)6 0.0000
factor(Month)7 0.0000
factor(Month)8 0.0000
factor(Month)9 0.0000
factor(Month)10 0.0000
factor(Month)11 0.0806
factor(Month)12 0.0657
pandemic:I(Month == 3)FALSE 0.0000
pandemic:I(Month == 3)TRUE 0.1434
pandemic:I(Month == 4)TRUE 0.0040
$AIC
[1] -0.6646108
$AICc
[1] -0.6450514
$BIC
[1] -0.3385815
$pred
Jan Feb Mar Apr May Jun Jul
2022 2.553591 3.828661 3.822476 3.937531 4.071877
2023 1.972754 2.102584 2.544448 3.621640 3.571247 3.698445 3.793138
2024 1.845217 1.969401
Aug Sep Oct Nov Dec
2022 3.940133 3.964095 4.077213 2.516225 2.051923
2023 3.636725 3.647229 3.779863 2.270872 1.974346
2024
$se
Jan Feb Mar Apr May Jun
2022 0.1556702 0.1841463 0.1943549 0.1982846
2023 0.2008502 0.2008527 0.2204393 0.2277904 0.2306604 0.2317965
2024 0.2325474 0.2325481
Jul Aug Sep Oct Nov Dec
2022 0.1998322 0.2004468 0.2006917 0.2007894 0.2008284 0.2008440
2023 0.2322486 0.2324288 0.2325008 0.2325295 0.2325410 0.2325456
2024
Our forecasting method showed some interesting results. It predicts a decrease in popularity for 2022 and 2023, even after an increase in popularity from 2020 to 2021. This is due to the COVID pandemic making us change the way we modeled the data. Modeling the trend of the popularity from 2004 to 2019 would be fairly easy to estimate and forecast as it was a fairly linear increase, but 2020 drastically affected the shape of the trend.
We had to use a b-spline to account for the decrease in popularity due to the pandemic. The spline did a good job modeling the trend. However, because the pandemic was so recent we had very limited data on how the popularity would rebound. Furthermore, splines are more variable on edge cases. Using the spline to forecast led to the prediction that popularity would decrease over the next few years. We cannot say for certain if our future predictions are good or not, but we believe them to be a conservative forecasting.
On the other hand, the model appears to do a good job of forecasting the seasonality. The general shape of the previous years was kept in the forecast. This is due in part to the use of an indicator variable for the pandemic. Using an indicator variable caused the drastically different 2020 year to not have as large of an affect on the shape.
The results show how much an event like the covid pandemic can affect statistical analysis. Because the pandemic shortened a season, it affected the distribution of the search popularity for 2020, and caused a decrease in popularity from 2019 to 2021, even thought there was a fairly strong upward trend from 2004 to 2019. Because of the sudden change and lack of data on how the popularity would recover, it made modeling and forecasting a very difficult task.
Overall, given the contest of the data with the pandemic, our model does a very good job fitting the data. On the other hand, he use of a spline coupled with the uncertainty on the actual affect of the pandemic on the popularity leads to the conclusion that the forecasting method might not be the best. To create a better forecast, there are other factors we need to take into account. For example, the rebound of the popularity is closely tied with the economic rebound for the nation. In conclusion, the trend and seasonality are well estimated, but we need more data to accurately forecast.
We would like to thank Brianna Heggeseth, our Correlated Data professor, who taught us the skills necessary to perform this analysis and also provided us with constructive feedback and oversight along the way. In addition, the resources found at https://bcheggeseth.github.io/CorrelatedData/ were used during this project.