Forecasting and Analyzing Wine Vintage Scores Part 3

Return to Part 1
Return to Part 2 Forecasting

Important Features

The previous time series analysis suggested specific weather variables—AvgMonthWindSpd and SumMonthSnowDepth for Pinot Noir, and AvgMonthVis and MaxMonthTempHigh for Cabernet Sauvignon—may play a role in influencing their respective vintage scores. This raises key questions: Are these effects specific to these grape varieties, or perhaps influenced by the specific ARIMAX/SARIMAX modeling framework used?

To investigate weather impacts more broadly across all wine types and regions in the Northern Hemisphere data, I’ll first run a multiple linear regression model incorporating all available weather features. I can then interpret the coefficients of the linear regression to assess which factors are significant in explaining vintage scores. Then, I’ll employ a Random Forest model, which better captures non-linear patterns and use its feature importance rankings to identify weather variables that emerge as significant predictors of vintage scores.

What weather factors affect wine vintage scores?

For both the linear regression and the Random Forest models developed in this section, the predictor set includes all available weather features from the current vintage year, as well as those same features lagged by one year. The purpose of incorporating these one-year lagged variables is to investigate potential carry-over effects from the previous year’s conditions. This allows the models to assess whether factors like the vine’s health status entering the season, which would be influenced by the prior year’s weather, have a discernible impact on the current year’s grape quality and resulting vintage score.

Linear Regression

For this linear regression model, I’m going to include fixed effect terms for the years and months. The reason for doing so is twofold. First, I want to control for the fact certain years might be better or worse than others in terms of wine quality. Second, I believe there is no strong reason to assume that each unit increase in year will have a uniform linear effect on the vintage score.

OLS Regression Diagnostics
R-squared:	0.417
Model:	OLS
Adj. R-squared:	0.408
Method:	Least Squares
F-statistic:	47.48
Date:	Thu, 17 Apr 2025
Prob (F-statistic):	0.00
Time:	17:11:22
Log-Likelihood:	-16347.
No. Observations:	7007
AIC:	3.290e+04
Df Residuals:	6902
BIC:	3.365e+04
Df Model:	104
Covariance Type:	nonrobust

Looking at the model diagnostics, we see that the R^2 is 0.417. This means that the model is able to explain around 41% of the variance in VintageScore. Since the adjusted R^2 is 0.408 and is quite close to the R^2, it suggests that the model isn’t overfitting too much by including a large number of irrelevant predictors.

ARIMA(1, 1, 1) Results
	coef	std err	z	P>\|z\|	[0.025	0.975]
AvgMonthWindSpd	0.2798	0.158	1.776	0.076	-0.029	0.589
MaxMonthDewHigh	0.2642	0.209	1.265	0.206	-0.145	0.674
AvgMonthMaxPressure	0.2483	0.318	0.782	0.434	-0.374	0.871
MaxMonthPrecip	0.0266	0.113	0.236	0.813	-0.194	0.247
MinMonthMinPressure	-0.7424	0.983	-0.755	0.450	-2.670	1.185
SumMonthSnowDepth	2.1032	1.065	1.975	0.048	0.016	4.190
MaxMonthTempHigh	0.0255	0.179	0.143	0.886	-0.325	0.376
AvgMonthVis	-0.2104	0.162	-1.300	0.194	-0.527	0.107
DaysRainMonth	-0.0028	0.117	-0.024	0.981	-0.233	0.227
RegionTag_KSTS	-1.9702	0.366	-5.386	0.000	-2.687	-1.253
RegionTag_LFBD	1.7317	0.331	5.228	0.000	1.082	2.381
ar.L1	-0.2524	0.033	-7.536	0.000	-0.318	-0.187
ma.L1	-0.8040	0.020	-39.620	0.000	-0.844	-0.764
sigma2	8.1224	0.371	21.874	0.000	7.395	8.850

Several weather variables showed statistically significant effects after controlling for baseline differences associated with region, grape varietal, and year (via year dummies, which were mostly significant).

Wind: A southerly wind direction (WindDirectionT.S, p=0.014) was positively associated with vintage scores. This aligns with the expectation that in the Northern Hemisphere, southerly winds often bring warmer air beneficial for ripening. However, AvgMonthWindSpd showed a significant negative coefficient (p=0.029). This contrasts with the positive association found in the Pinot Noir ARIMAX model. It’s possible that while moderate wind benefits specific varieties like Pinot Noir (perhaps via disease reduction), higher average wind speeds across all varieties and regions might lead to detrimental effects
Pressure: MaxMonthMaxPressure (maximum of daily highest pressure in a month) was marginally significant and positive (p=0.057), consistent with the idea that high-pressure systems could indicate stable, sunny weather conducive to good vintages. Its lagged version, MaxMonthMaxPressure_Lag (p=0.001), was also positive and significant, suggesting favorable weather conditions in the previous year may be beneficial to vine health and potential in the current year.
Precipitation: MaxMonthPrecip (maximum inches of rain in a month) was positive and significant (p=0.034). This might suggest that having at least one substantial rainfall event (perhaps breaking a dry spell or heatwave) during a key month is beneficial. The lagged effect of rain days (DaysRainMonth_Lag, p=0.012) was positive, suggesting that more frequent rainfall in the prior year is beneficial, likely due to recharging deep soil moisture reserves critical for the current season.
Snow: SumMonthSnowDepth was significantly negative (p=0.014), indicating that higher snow accumulation is generally associated with lower scores in this broad analysis. This contrasts with the positive coefficient seen for Pinot Noir specifically, suggesting the general impact of heavy snow (e.g., potential for damage, delayed season start) might outweigh variety-specific benefits like insulation when averaged across all types.
Temperature: The lagged minimum lowest daily temperature in a month (MinMonthTempLow_Lag, p=0.006) was positive and significant, possibly indicating that milder conditions (less extreme cold) in the preceding winter promote better vine health and subsequent vintage quality.
Dew Point: Both MinMonthDewLow (p=0.048), the minimum of the lowest daily dew point and its lag (MinMonthDewLow_Lag, p=0.013) were positive and significant. This might suggest that avoiding periods of extremely low dew points (which can indicate either very dry air causing stress, or correlate with very low nighttime temperatures increasing frost risk) is generally beneficial for vine health.

Random Forest

To complement the linear regression I also trained a Random Forest model. For consistency, I used the same training/test splits employed for the time series models. I fine-tuned the hyperpamaters of the model using a Grid Search approach with cross-validation so that the selected parameters generalize well. The following key hyperparameters were optimized:

n_estimators (Number of trees in the forest): 100, 200, 500
max_depth (Maximum depth of individual trees): 5, 10, 20
min_samples_split (Minimum number of samples required to split an internal node): 2, 5, 10
min_samples_leaf (Minimum number of samples required to be at a leaf node): 1, 3, 5
max_features (Number of features to consider when looking for the best split): sqrt of the total features, all features
bootstrap (Method for sampling data points for training each tree): True, False

Following hyperparameter tuning and training, the best Random Forest model identified had a RMSE of 3.18. On the test set, this model achieved an RMSE of 1.86, which suggests good predictive accuracy.

A key advantage of Random Forest models is their ability to provide estimates of feature importance, which indicate how much each predictor variable contributes to the model’s predictive accuracy. Essentially, feature importance is typically calculated by measuring how much the model’s performance (or the purity of the tree nodes) decreases on average when the values of a specific feature are randomly shuffled.

Looking at the feature importance scores from the trained Random Forest model (excluding the fixed effect terms like year, month, wine type, and region tags, as these are controlled for), we can identify which weather variables the model found most influential:

Top 20 Feature Importances (Excluding Fixed Effects)
Variable	Importance
AvgMonthMaxVis_Lag	0.052291
AvgMonthVis_Lag	0.043316
AvgMonthMaxVis	0.032641
AvgMonthVis	0.024829
MaxMonthPrecip_Lag	0.021852
MaxMonthPrecip	0.020006
AvgMonthMinVis_Lag	0.016726
AvgMonthWindSpd_Lag	0.016238
AvgMonthWindSpd	0.016089
AvgMonthMinVis	0.015948
AvgMonthMaxPressure	0.012614
MaxMonthMaxPressure_Lag	0.012283
AvgMonthMinPressure	0.012211
MaxMonthTempHigh	0.012084
AvgMonthPressure	0.012026
AvgMonthMinPressure_Lag	0.011674
AvgMonthTemp	0.011430
AvgMonthPressure_Lag	0.011351
MaxMonthTempHigh_Lag	0.011220
AvgMonthTempHigh	0.011173

Several key observations stand out from this list:

Visibility: Variables related to visibility (average, max, min, both current and lagged) occupy the top four spots and appear multiple times in the top 10. This strongly suggests that atmospheric visibility is the most important weather-related factor for predicting vintage scores according to the Random Forest model.
Precipitation Importance: The next most important category appears to be precipitation, specifically the maximum monthly precipitation (MaxMonthPrecip and its lag).
Wind Speed: Wind speed variables (AvgMonthWindSpd and its lag) also feature relatively high on the list.

Takeaways

The Random Forest model provided us with a ranking of weather features that contrasts with that of the linear regression model. In particular, visibility variables were not found to be statistically significant predictors. This difference could be a result of the ways these models operate:

Random Forests can capture complex non-linear relationships and interactions between variables automatically. Visibility’s impact might be non-linear (e.g., important up to a certain threshold) or highly dependent on interactions with other factors (like temperature or humidity), which the linear model wouldn’t easily detect without specific interaction terms being added.
Importance vs. Significance: Feature importance in RF measures the overall contribution to predictive accuracy across potentially complex relationships. Statistical significance in linear regression tests a specific hypothesis about a linear relationship, holding other variables constant. A variable can be crucial for prediction in a non-linear or interactive way (high RF importance) even if its “linear” effect isn’t necessarily significant.

So, what weather factors appear to influence vintage scores most significantly? Comparing all the findings from the various analyses conducted, this project suggests that variables related to visibility, wind speed, and max precipitation are the most important weather-related factors affecting grape quality and, in turn, vintage scores. If I were to grow my own grapes to produce wine, I’d probably choose a location in the Northern Hemisphere where there is high year-round visibility, high average southernly winds, and less frequent but heavier rainfall. Asking different AI LLM models revealed that Phoenix, AZ (Perplexity), Antalya, Turkey (Gemini), and the Negev region in southern Israel (ChatGPT) would fit this criteria!