Example .Rmd file provided by Prof Kam for ISSS608 Visual Analytics
Correlation coefficient is a popular statistic that use to measure the type and strength of the relationship between two variables. The values of a correlation coefficient ranges between -1.0 and 1.0. A correlation coefficient of 1 shows a perfect linear relationship between the two variables, while a -1.0 shows a perfect inverse relationship between the two variables. A correlation coeficient of 0.0 shows no linear relationship between the two variables.
When multivariate data are used, the correlation coeficeints of the pair comparisons are displayed in a table form known as correlation matrix or scatterplot matrix.
There are three broad reasons for computing a correlation matrix.
When the data is large, both in terms of the number of observations and the number of variables, Corrgram tend to be used to visually explore and analyse the structure and the patterns of relations among variables. It is designed based on two main schemes:
In this hands-on exercise, you will learn how to plot data visualisation for visualising correlation matrix with R. It consists of three main sections. First, you will learn how to create correlation matrix using pairs() of R Graphics. Next, you will learn how to plot corrgram using corrplot package of R Lastly, you will learn how to create an interactive correlation matrix using plotly R. .
Before you get started, you are required:
Next, you will use the code chunk below to install and launch corrplot, ggpubr, plotly and tidyverse in RStudio.
packages = c('ggpubr', 'tidyverse')
for(p in packages){library
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
In this hands-on exercise, the Wine Quality Data Set of UCI Machine Learning Repository will be used. The data set consists of 13 variables and 6497 observations. For the purpose of this exercise, we have combined the red wine and white wine data into one data file. It is called wine_quality and is in csv file format.
First, let us import the data into R by using read_csv() of readr package.
wine <- read_csv("data/wine_quality.csv")
Notice that beside quality and type, the rest of the variables are numerical and continuous data type.
In the figure below, multiple histograms are plottted to reveal the distribution of the selected variables in the wine quality data sets.
The code chunks used to create the data visualisation consists of two main parts. First, we will create the individual histograms using the code chunk below.
fa <- ggplot(data=wine, aes(x= `fixed acidity`)) +
geom_histogram(bins=20, color="black", fill="light blue")
va <- ggplot(data=wine, aes(x= `volatile acidity`)) +
geom_histogram(bins=20, color="black", fill="light blue")
ca <- ggplot(data=wine, aes(x= `citric acid`)) +
geom_histogram(bins=20, color="black", fill="light blue")
rs <- ggplot(data=wine, aes(x= `residual sugar`)) +
geom_histogram(bins=20, color="black", fill="light blue")
ch <- ggplot(data=wine, aes(x= `chlorides`)) +
geom_histogram(bins=20, color="black", fill="light blue")
fSO2 <- ggplot(data=wine, aes(x= `free sulfur dioxide`)) +
geom_histogram(bins=20, color="black", fill="light blue")
tSO2 <- ggplot(data=wine, aes(x= `total sulfur dioxide`)) +
geom_histogram(bins=20, color="black", fill="light blue")
density <- ggplot(data=wine, aes(x= density)) +
geom_histogram(bins=20, color="black", fill="light blue")
pH <- ggplot(data=wine, aes(x= pH)) +
geom_histogram(bins=20, color="black", fill="light blue")
sulphates <- ggplot(data=wine, aes(x= sulphates)) +
geom_histogram(bins=20, color="black", fill="light blue")
alcohol <- ggplot(data=wine, aes(x= alcohol)) +
geom_histogram(bins=20, color="black", fill="light blue")
Next, the ggarrange function of ggpur package is used to group these histograms together.
ggarrange(fa, va, ca, rs, ch, fSO2, tSO2, density, pH, sulphates, alcohol,
ncol = 4, nrow = 3)