Forecasting and Analyzing Wine Vintage Scores Part 3
Return to Part 1
Return to Part 2 Forecasting
Important Features
The previous time series analysis suggested specific weather variables—AvgMonthWindSpd and SumMonthSnowDepth for Pinot Noir, and AvgMonthVis and MaxMonthTempHigh for Cabernet Sauvignon—may play a role in influencing their respective vintage scores. This raises key questions: Are these effects specific to these grape varieties, or perhaps influenced by the specific ARIMAX/SARIMAX modeling framework used?
To investigate weather impacts more broadly across all wine types and regions in the Northern Hemisphere data, I’ll first run a multiple linear regression model incorporating all available weather features. I can then interpret the coefficients of the linear regression to assess which factors are significant in explaining vintage scores. Then, I’ll employ a Random Forest model, which better captures non-linear patterns and use its feature importance rankings to identify weather variables that emerge as significant predictors of vintage scores.
What weather factors affect wine vintage scores?
For both the linear regression and the Random Forest models developed in this section, the predictor set includes all available weather features from the current vintage year, as well as those same features lagged by one year. The purpose of incorporating these one-year lagged variables is to investigate potential carry-over effects from the previous year’s conditions. This allows the models to assess whether factors like the vine’s health status entering the season, which would be influenced by the prior year’s weather, have a discernible impact on the current year’s grape quality and resulting vintage score.
Linear Regression
For this linear regression model, I’m going to include fixed effect terms for the years and months. The reason for doing so is twofold. First, I want to control for the fact certain years might be better or worse than others in terms of wine quality. Second, I believe there is no strong reason to assume that each unit increase in year will have a uniform linear effect on the vintage score.
R-squared: | 0.417 |
Model: | OLS |
Adj. R-squared: | 0.408 |
Method: | Least Squares |
F-statistic: | 47.48 |
Date: | Thu, 17 Apr 2025 |
Prob (F-statistic): | 0.00 |
Time: | 17:11:22 |
Log-Likelihood: | -16347. |
No. Observations: | 7007 |
AIC: | 3.290e+04 |
Df Residuals: | 6902 |
BIC: | 3.365e+04 |
Df Model: | 104 |
Covariance Type: | nonrobust |
Looking at the model diagnostics, we see that the R^2 is 0.417. This means that the model is able to explain around 41% of the variance in VintageScore. Since the adjusted R^2 is 0.408 and is quite close to the R^2, it suggests that the model isn’t overfitting too much by including a large number of irrelevant predictors.
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
AvgMonthWindSpd | 0.2798 | 0.158 | 1.776 | 0.076 | -0.029 | 0.589 |
MaxMonthDewHigh | 0.2642 | 0.209 | 1.265 | 0.206 | -0.145 | 0.674 |
AvgMonthMaxPressure | 0.2483 | 0.318 | 0.782 | 0.434 | -0.374 | 0.871 |
MaxMonthPrecip | 0.0266 | 0.113 | 0.236 | 0.813 | -0.194 | 0.247 |
MinMonthMinPressure | -0.7424 | 0.983 | -0.755 | 0.450 | -2.670 | 1.185 |
SumMonthSnowDepth | 2.1032 | 1.065 | 1.975 | 0.048 | 0.016 | 4.190 |
MaxMonthTempHigh | 0.0255 | 0.179 | 0.143 | 0.886 | -0.325 | 0.376 |
AvgMonthVis | -0.2104 | 0.162 | -1.300 | 0.194 | -0.527 | 0.107 |
DaysRainMonth | -0.0028 | 0.117 | -0.024 | 0.981 | -0.233 | 0.227 |
RegionTag_KSTS | -1.9702 | 0.366 | -5.386 | 0.000 | -2.687 | -1.253 |
RegionTag_LFBD | 1.7317 | 0.331 | 5.228 | 0.000 | 1.082 | 2.381 |
ar.L1 | -0.2524 | 0.033 | -7.536 | 0.000 | -0.318 | -0.187 |
ma.L1 | -0.8040 | 0.020 | -39.620 | 0.000 | -0.844 | -0.764 |
sigma2 | 8.1224 | 0.371 | 21.874 | 0.000 | 7.395 | 8.850 |
Several weather variables showed statistically significant effects after controlling for baseline differences associated with region, grape varietal, and year (via year dummies, which were mostly significant).
- Wind: A southerly wind direction (WindDirectionT.S, p=0.014) was positively associated with vintage scores. This aligns with the expectation that in the Northern Hemisphere, southerly winds often bring warmer air beneficial for ripening. However, AvgMonthWindSpd showed a significant negative coefficient (p=0.029). This contrasts with the positive association found in the Pinot Noir ARIMAX model. It’s possible that while moderate wind benefits specific varieties like Pinot Noir (perhaps via disease reduction), higher average wind speeds across all varieties and regions might lead to detrimental effects
- Pressure: MaxMonthMaxPressure (maximum of daily highest pressure in a month) was marginally significant and positive (p=0.057), consistent with the idea that high-pressure systems could indicate stable, sunny weather conducive to good vintages. Its lagged version, MaxMonthMaxPressure_Lag (p=0.001), was also positive and significant, suggesting favorable weather conditions in the previous year may be beneficial to vine health and potential in the current year.
- Precipitation: MaxMonthPrecip (maximum inches of rain in a month) was positive and significant (p=0.034). This might suggest that having at least one substantial rainfall event (perhaps breaking a dry spell or heatwave) during a key month is beneficial. The lagged effect of rain days (DaysRainMonth_Lag, p=0.012) was positive, suggesting that more frequent rainfall in the prior year is beneficial, likely due to recharging deep soil moisture reserves critical for the current season.
- Snow: SumMonthSnowDepth was significantly negative (p=0.014), indicating that higher snow accumulation is generally associated with lower scores in this broad analysis. This contrasts with the positive coefficient seen for Pinot Noir specifically, suggesting the general impact of heavy snow (e.g., potential for damage, delayed season start) might outweigh variety-specific benefits like insulation when averaged across all types.
- Temperature: The lagged minimum lowest daily temperature in a month (MinMonthTempLow_Lag, p=0.006) was positive and significant, possibly indicating that milder conditions (less extreme cold) in the preceding winter promote better vine health and subsequent vintage quality.
- Dew Point: Both MinMonthDewLow (p=0.048), the minimum of the lowest daily dew point and its lag (MinMonthDewLow_Lag, p=0.013) were positive and significant. This might suggest that avoiding periods of extremely low dew points (which can indicate either very dry air causing stress, or correlate with very low nighttime temperatures increasing frost risk) is generally beneficial for vine health.
Random Forest
To complement the linear regression I also trained a Random Forest model. For consistency, I used the same training/test splits employed for the time series models. I fine-tuned the hyperpamaters of the model using a Grid Search approach with cross-validation so that the selected parameters generalize well. The following key hyperparameters were optimized:
- n_estimators (Number of trees in the forest): 100, 200, 500
- max_depth (Maximum depth of individual trees): 5, 10, 20
- min_samples_split (Minimum number of samples required to split an internal node): 2, 5, 10
- min_samples_leaf (Minimum number of samples required to be at a leaf node): 1, 3, 5
- max_features (Number of features to consider when looking for the best split): sqrt of the total features, all features
- bootstrap (Method for sampling data points for training each tree): True, False
Following hyperparameter tuning and training, the best Random Forest model identified had a RMSE of 3.18. On the test set, this model achieved an RMSE of 1.86, which suggests good predictive accuracy.
A key advantage of Random Forest models is their ability to provide estimates of feature importance, which indicate how much each predictor variable contributes to the model’s predictive accuracy. Essentially, feature importance is typically calculated by measuring how much the model’s performance (or the purity of the tree nodes) decreases on average when the values of a specific feature are randomly shuffled.
Looking at the feature importance scores from the trained Random Forest model (excluding the fixed effect terms like year, month, wine type, and region tags, as these are controlled for), we can identify which weather variables the model found most influential:
Variable | Importance |
---|---|
AvgMonthMaxVis_Lag | 0.052291 |
AvgMonthVis_Lag | 0.043316 |
AvgMonthMaxVis | 0.032641 |
AvgMonthVis | 0.024829 |
MaxMonthPrecip_Lag | 0.021852 |
MaxMonthPrecip | 0.020006 |
AvgMonthMinVis_Lag | 0.016726 |
AvgMonthWindSpd_Lag | 0.016238 |
AvgMonthWindSpd | 0.016089 |
AvgMonthMinVis | 0.015948 |
AvgMonthMaxPressure | 0.012614 |
MaxMonthMaxPressure_Lag | 0.012283 |
AvgMonthMinPressure | 0.012211 |
MaxMonthTempHigh | 0.012084 |
AvgMonthPressure | 0.012026 |
AvgMonthMinPressure_Lag | 0.011674 |
AvgMonthTemp | 0.011430 |
AvgMonthPressure_Lag | 0.011351 |
MaxMonthTempHigh_Lag | 0.011220 |
AvgMonthTempHigh | 0.011173 |
Several key observations stand out from this list:
- Visibility: Variables related to visibility (average, max, min, both current and lagged) occupy the top four spots and appear multiple times in the top 10. This strongly suggests that atmospheric visibility is the most important weather-related factor for predicting vintage scores according to the Random Forest model.
- Precipitation Importance: The next most important category appears to be precipitation, specifically the maximum monthly precipitation (MaxMonthPrecip and its lag).
- Wind Speed: Wind speed variables (AvgMonthWindSpd and its lag) also feature relatively high on the list.
Takeaways
The Random Forest model provided us with a ranking of weather features that contrasts with that of the linear regression model. In particular, visibility variables were not found to be statistically significant predictors. This difference could be a result of the ways these models operate:
- Random Forests can capture complex non-linear relationships and interactions between variables automatically. Visibility’s impact might be non-linear (e.g., important up to a certain threshold) or highly dependent on interactions with other factors (like temperature or humidity), which the linear model wouldn’t easily detect without specific interaction terms being added.
- Importance vs. Significance: Feature importance in RF measures the overall contribution to predictive accuracy across potentially complex relationships. Statistical significance in linear regression tests a specific hypothesis about a linear relationship, holding other variables constant. A variable can be crucial for prediction in a non-linear or interactive way (high RF importance) even if its “linear” effect isn’t necessarily significant.
So, what weather factors appear to influence vintage scores most significantly? Comparing all the findings from the various analyses conducted, this project suggests that variables related to visibility, wind speed, and max precipitation are the most important weather-related factors affecting grape quality and, in turn, vintage scores. If I were to grow my own grapes to produce wine, I’d probably choose a location in the Northern Hemisphere where there is high year-round visibility, high average southernly winds, and less frequent but heavier rainfall. Asking different AI LLM models revealed that Phoenix, AZ (Perplexity), Antalya, Turkey (Gemini), and the Negev region in southern Israel (ChatGPT) would fit this criteria!