Forecasting and Analyzing Wine Vintage Scores

Go to Part 2 Forecasting
Go to Part 3 Important Features


Drinking wine is one of my favorite hobbies, and I genuinely get excited each time I have a glass as I try to figure out how ‘good’ a wine tastes. It’s fun to compare my own subjective ratings with more ‘objective’ vintage scores. Different organizations hand these out, and all of them in some sense score a wine based on the year it was made.


Now I don’t know too much about how these vintages are scored, but what I found really interesting to think about is what makes a vintage ‘good’ or ‘bad’ in the first place? I’ve heard people throw around comments before like “2020 was a great year because the weather was favorable”. But what does that actually mean? Unless you’re a seasoned grape grower, it’s hard to understand what contributes to this vintage score.

Does weather predict wine quality?

Outline

So my idea behind this project is to try to connect the dots and see if actual weather data can tell us something about why certain years get higher vintage scores. I’ll approach this in 2 ways. First, I’ll try to see if I could build a model to predict wine scores based on historical weather data. Second, I’ll analyze the historical weather data and identify what factors contribute the most to wine vintage scores.

Data

I collected wine vintage scores from Wine Enthusiast. The vintage score data is theoretically scored from 0 to 100. However, in the data the scores range from 81 to 100. The data covers years from 1997 to 2025 and I targeted popular wine producing countries, noting the grape/wine type by each region within that country (i.e. some regions within a country have more than one wine type). One limitation of this data is that in reality, many regions produce more than 1-3 types of wine. For instance, Napa Valley produces Syrah which is not listed on Wine Enthusiast. This means that these unlisted wines are out of the scope of this project. In the end, the dataset contained 16 regions spanning across U.S., Italy, Australia, Argentina, Chile, and France for a total of 26 unique wine-region combinations.


The table below shows some descriptive statistics of the wine vintage scores.


Statistic Value
Mean 91.67
Standard Deviation 3.27
Median 92.00
Minimum 81.00
Maximum 100.00


Correpsonding historical weather data for the same 16 regions was sourced from WeatherSpark. The site requires a paid subscription, so to download the data, I paid for a month (sad). One major limitation to note of this data is that the historical data comes from the nearest airport meteorological station. This means that the weather data location is approximate to the actual vineyard areas, so the available data reflects general weather conditions across a broader region, not the specific microclimates of individual vineyard plots. Because of this, I assume that the regional weather can serve as a proxy for the conditions affecting all listed wine types in that region.


The historical weather data contains observations at the daily level, but for the purposes of my analysis, I’ll be aggregating the data to the monthly level. To handle missing observations, I imputed them hierarchically: first via forward fill (using the previous month’s value), then backward fill (using the next month’s values), and finally using the overall mean or zero (if it made logical sense) for any remaining missing entries. To get a sense of the weather data, let’s take a look at the different variables that were tracked. Since there are more than 10+ variables, and not all of them were consistently recorded, I’ll present a table below showing the features that I plan to include in my modeling and analysis part.


Variable Mean Median Standard Deviation Minimum Maximum Units
AvgMonthTemp 58.07 56.41 15.47 28.81 152.80 F
AvgMonthTempLow 47.45 46.05 12.06 18.53 140.73 F
AvgMonthTempHigh 68.69 66.88 20.25 34.95 173.50 F
AvgMonthDew 46.23 46.25 8.62 18.59 69.46 F
AvgMonthDewLow 41.46 41.29 8.81 10.23 64.75 F
AvgMonthDewHigh 51.00 51.16 8.62 19.40 79.79 F
AvgMonthWindSpd 7.30 7.09 1.79 0.86 23.73 mph
AvgMonthVis 7.32 6.31 3.30 1.98 25.13 mi
AvgMonthMinVis 4.56 4.40 1.83 0.39 15.56 mi
AvgMonthMaxVis 10.08 9.77 6.02 3.28 40.49 mi
AvgMonthPressure 30.03 30.01 0.17 28.95 34.78 Hg
AvgMonthMinPressure 29.95 29.94 0.14 27.30 30.42 Hg
AvgMonthMaxPressure 30.11 30.08 0.29 29.72 39.75 Hg
MaxMonthTempHigh 84.38 80.60 25.95 42.80 206.60 F
MaxMonthDewHigh 60.71 60.08 11.57 19.40 210.20 F
MaxMonthMaxWindSpd 26.21 24.17 13.04 3.45 391.26 mph
MaxMonthMaxPressure 30.73 30.35 7.46 29.98 295.27 Hg
MaxMonthSnowDepth 0.66 0.00 4.03 0.00 92.52 in
MaxMonthPrecip 0.34 0.05 0.58 0.00 6.05 in
MinMonthTempLow 36.33 35.60 11.16 -5.80 68.00 F
MinMonthDewLow 26.73 28.04 14.53 -142.60 55.40 F
MinMonthMinPressure 29.46 29.62 1.84 0.00 30.21 Hg
SumMonthPrecip 0.88 0.00 2.08 0.00 17.99 in
SumMonthSnowDepth 0.18 0.00 2.81 0.00 92.52 in
DaysRainMonth 3.72 0.00 6.18 0.00 30.00 days

Basic Descriptives

Let’s also explore the data a bit more by looking at how VintageScores changes based on wine type.


WineType Mean VintageScore
Amarone 90.62
Barolo 94.05
Bolgheri 91.42
Cabernet Sauvignon 91.86
Chablis 93.04
Chardonnay 91.18
Chenin Blanc 92.00
Chianti 91.62
Gamay 91.38
Gewurztraminer 91.27
Merlot 93.00
Pinot Noir 91.95
Semillon 92.54
Soave 90.15
Syrah 92.79
Zinfandel 90.04


It’s also interesting to look at the highest VintageScores given to each wine.


WineType Max VintageScore
Amarone 94
Barolo 99
Bolgheri 97
Cabernet Sauvignon 100
Chablis 96
Chardonnay 96
Chenin Blanc 96
Chianti 96
Gamay 96
Gewurztraminer 95
Merlot 98
Pinot Noir 98
Semillon 96
Soave 94
Syrah 99
Zinfandel 94


Finally, how does VintageScore vary based on the year?


Year Mean VintageScore
1998 88.91
1999 89.41
2000 88.23
2001 91.86
2002 88.62
2003 88.90
2004 91.04
2005 92.17
2006 90.13
2007 91.61
2008 90.74
2009 92.94
2010 93.09
2011 91.09
2012 92.23
2013 91.64
2014 91.91
2015 93.95
2016 93.91
2017 92.04
2018 92.78
2019 93.91
2020 92.09
2021 93.26
2022 92.87
2023 93.35


A useful plot that I always like to perform is a correlation heatmap of my predictor variables of interest with the main dependent variable, VintageScore.

HeatmapVintageScore


The variables that had a significant correlation with VintageScore are denoted with the *, **, or ***, representing significance at the <0.05, <0.01, or <0.001 level respectively.


Interestingly, we see that AvgMonthWindSpd, or the average monthly wind speed, is positively related to VintageScore. This means that the higher the monthly wind speed, the higher the vintage score. While I know nothing about growing vines, this may seem quite counter-intuitive because the higher the wind speed, the more potential damage to the grape vines. However, some quick research suggests that more wind could act as a proxy for good air circulation, which benefits the grape vines by drying them (preventing damp conditions) and moderating temperatures.


We also see that DaysRainMonth, so the number of days where it rained in that month, is negatively associated with VintageScore. This means that the more that it rained in a month, the lower the VintageScore. This could make sense because too much rain could mean that 1) the grape vines have more susceptibility to diseases (moisture for fungal spores) and 2) less sunny days so less photosynthesis.


So next, we’ll move onto building a time series model that can potentially predict vintage scores!
Go to Part 2 Forecasting
Go to Part 3 Important Features