========================================================

This report explores a data set containing chemical attributes and quality scores for approximately 1600 samples of red wine.

Univariate Plots Section

## [1] 1599   12
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Our data set consists of 12 variables with ~1600 observations.

The ‘X’ variable is merely the row number, so it was dropped after import. The quality is a median of >= 3 expert’s ratings

Adding a Boolean factor variable here called ‘badfso’ to denote if a sample has a a free sulfur dioxide value above 50 ppm, the threshold where it becomes noticeable to taste and smell.

A histogram of this plot show the overwhelming majority of samples below the 50 ppm threshold for being a ‘bad’ free sulfur dioxide value.

Unfortunately the quality is somewhat imbalanced. Perhaps applying a log10 transform will help.

This transformed quality histogram looks much more normal.

This fixed acidity histogram looks distribution looks fairly normal without a log transform. Lets try one anyway.

Looks a bit more normal, but not significantly so.

If we omit the low counts to the right this volatile acidity histogram looks pretty normalized.

Yes, omitting those low counts seems to help the shape.

This long tailed histogram of citric acid seems to have a large bin of samples with values with less than .01 grams/L of citric acid. My guess would be that citric acid might be an additive that some winemakers choose to omit when crafting their wine.

Also see a smaller spike ~.5 which could potentially be a result of less precise measurements or maybe a value that some winemakers aim to have in their wine.

Transforming this Citric Acid histogram seems to confirm our suspicions, showing a bi modal graph with a couple peaks at .25, and .5.

Initially this residual sugar looks somewhat long tailed, but limiting the x axis should give us a more normal shape

That looks a bit better

This chlorides histogram looks like it would be normal if we remove some of these potential outliers.

This looks limit helps confirm a normal shape.

This free sulfur dioxide appears long-tailed so we’ll try transforming it to better understand it’s shape.

This scale helps use see it’s uni modal distribution.

Just as with free sulfur dioxide we see a long tailed histogram. Lets try another log_x_10 layer.

Once again applying this scale_x_log10 helps us see the normal, uni modal shape.

This density histogram already shows a normal shape, doesn’t seem to need a log10 scale.

‘pH’ shows a normal shape for it’s histogram.

Sulfates looks fairly normal here.

Could argue this alcohol histogram is long tailed or normal I think.

Applying a square root scale doesn’t seem to change the shape.

Looks to be a somewhat bi modal, possibly multi-modal shape here.

What is the structure of your dataset?

There are 15,999 wine samples in the data set with 12 features(fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates, alcohol, and quality). Quality is an output variable of integer type. The remaining 11 variables are input variables of numeric type.

Other observations:

  • There are much more normal wines than there are ‘poor’ or ‘excellent’ wines. The vast majority range from 5-to-7
  • The median quality is a 6, and the max quality score is an 8.
  • All wines have a pH far below 7(~3-to-4), making their pH ‘basic’.

What is/are the main feature(s) of interest in your dataset?

From this data set I am most interested in the ‘quality’ variable and how it interacts with:

  • volatile acidity

“volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste”

  • citric acid

“citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines”

  • free sulfur dioxide

“…at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine”

What other features in the dataset do you think will help support your

investigation into your feature(s) of interest?

The ‘residual sugar’ variable might also have been of interest since it is an indicator of sweetness. Seeing if there was a perceived quality difference for sweet or dry red wines might have been interesting. Unfortunately all the wines in the data set fell far below the 45 g/L(dm^3), so all observations were by definition dry wines.

Did you create any new variables from existing variables in the dataset?

I would’ve liked to create some categorical variables for input variables with thresholds that define a wine characteristic (sweet/dry, acidic/basic, ) unfortunately the wine values weren’t split between the two thresholds.

Of the features you investigated, were there any unusual distributions?

Any operations on the data to tidy, adjust, or change the form

of the data? If so, why did you do this?

Only change I made was to remove the ‘X’ field as it was redundant with R’s own row names. The source described the data as being tidy already.

Bivariate Plots Section

Although scatter plots generally aren’t great for analyzing categorical data, with the use of a jittered scatter plot we’re able to see a clear trend. As volatile acidity increases, quality scores go down. Additionally, lets try using quality as a factor and use that to generate a box plot instead.

This makes it much easier to see the pattern. As quality rating increases both the median and mean volatile acidity decreases. This is true for each rating with the exception of quality rating 8 for which there was very little data.

## [1] -0.3905578

This confirms the plot findings showing a medium-strength negative correlation between quality and volatile acidity.

Here we see another clear pattern. In this case as citric acid increases the factor increases too.

## [1] 0.2263725

This seems to indicate a small positive correlation between citric acid and quality

This graph is perhaps the most interesting so far. Free sulfur dioxide seems positively correlated with quality up until rank 5, at which point it trends downward. This seems to coincide with wineQualityInfo.txt description which stated:

“…at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine”

However since the relation doesn’t look linear the correlation coefficient may not give us the full picture.

## [1] -0.05065606

This is a case where the relationship coefficient seems to have failed us.

The quality attribute takes on so few values it effectively acts like an ordered factor. With that in mind we’ll add a factor attribute based on quality below.

Unfortunately this faceted plot doesn’t seem very useful partly because the classes(quality scores) aren’t balanced.

Below we’ll use ggpairs to generate a grid of pairwise plots. Quickly looking over our variables of interest to see if we missed anything important.

Nothing stands out, but this maybe useful for reference later.

Bivariate Analysis

We saw three different types of relationships:

Did you observe any interesting relationships between the other features

(not the main feature(s) of interest)?

From glancing over the ggpairs output it was interesting to note the high-strength positive correlation between citric acid and fixed acidity, in contrast to it’s high strength negative correlation with volatile acidity.

What was the strongest relationship you found?

The strongest relationship from the points of interest was volatile acidity to quality.

From The entire data set a high-strength negative correlation of -0.683 for ‘fixed acidity’ and ‘pH’ seemed to be the strongest correlation I saw.

## [1] -0.6829782

Multivariate Plots Section

Presumably, citric acid would account for some of the volatile acidity. However their effect on quality is the opposite. As seen in this scatter plot their is a cluster of high quality samples with low volatile acidity and high citric acid values.

In order to see all the points here we use position with the ‘jitter’ value. Alpha was also used to help give the points some transparency.

This confirms what we suspected earlier. The free sulfur dioxide increases quality up preserving freshness, however if the levels become too high it affects the taste quality. None of the samples with free S02 over 50ppm achieved quality of 8.

Seem to see greater dispersion between samples as total S02 and free SO2 increase. At lower values they seem more tightly clustered together.

Not seeing a clear relationship between between these variables. This makes sense because citric acid and Free sulfur dioxide don’t seem to be related chemically.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. Were there features that strengthened each other in terms of

looking at your feature(s) of interest?

The first plot in multi-variate analysis I had expected to see citric acid go up as volatile acidity went up. I didn’t really see this though. It did reinforce what we saw in our bi variate box plots with high citric acid and low volatile acidity being beneficial to quality

In the second multi variate scatter plot we saw that none of the top quality samples had a free sulfur dioxide over 50ppm.

In our third multivariate scatter plot we didn’t see a clear relationship between all three variables. So a bit of a dead end there.

The final multi-variate plot used a line plot, attempting to find a relationship between citric acid, free sulfur dioxide, and quality. Was unable to identify any distinct trends.

Were there any interesting or surprising interactions between features?

Had trouble finding notable multivariate interactions between our values of interest. Nothing surprising that went against the statements made in the description of values provided in the data set description text file.

OPTIONAL: Did you create any models with your dataset?

No models were created.


Final Plots and Summary

Ultimately, it seems the bi variate plots we created provided the most relevant insights.

Plot One

Description One

Here we see a distinct trend, as volatile acidity goes down, the quality rating goes up. Keeping the volatile acidity low in wine looks to be a key component of receiving a high quality rating.

Plot Two

Description Two

Once again we see another clear trend. A higher citric acid level in wine seems to directly affect it’s quality rating. In this case we see the relationship is positive and linear.

Plot Three

Description Three

Here we see a non-linear relationship between Free Sulfur Dioxide and Quality. This was in my mind the most interesting observation. It seems that Free Sulfur oxide acts as a ‘double-edged blade’. It helps prevent microbial growth and oxidation in wine, key factors affecting quality. The downside being that if the levels become too high it affects taste and nose(smell).

This is made very evident in this box plot figure. The lowest scores show the lowest levels of FSO2. As a result these wines likely experienced cases of oxidation and/or microbial growth adversely affecting their quality.

The medium quality scores(5 and 6), had the highest FSO2 concentrations. It seems most likely that in this case that oxidation and microbial growth were largely prevented, helping these wine to score higher quality scores. FS02’s impact on taste and nose at high levels likely prevented them from achieving the highest quality scores.

The highest quality scores showed lower levels of FS02 than their mid-tier counterparts. Likely that at the highest quality level alternate methods to prevent oxidation and microbial growth were used. Presumably a more expensive process than simply keeping FS02 high.


Reflection

Coming from a background in python, I did not like R at first. After working more with R throughout this course I grew to like it alot. Rstudio was an amazing enviroment to work in. The multi-pane configuration with notebook, console, plots, and environment data was extremely useful.
The visualization libraries are fantastic and the tidyr documentation was very helpful. I felt I learned alot about EDA and R, but I also realized that I still have a lot to learn regarding EDA. I feel I struggled a bit with creating useful insights but that may be partly due to my limited background knowledge on the dataset.

Ultimately, this data set seemed to not be great for multivariate analysis. Had the data set contained information like grape types, wine brand, wine selling price, etc I feel much more interesting analysis could have been done. Regardless, I did see some interesting bi-variate relationships between the input variables of interest and their impact on the output variable ‘quality’.

An idea for future work regarding this data set would be to create a linear model to predict quality from applicable input variables in this data set.