AGRON INFO TECH

Simple way to create scatter plot showing correlation and significance in R

What is a scatterplot?

A scatterplot is a type of graph used to display the relationship between two continuous variables. It is called a scatterplot because it displays the individual data points as scattered dots on the graph.

Each dot in a scatterplot represents a single observation or piece of data. One variable is represented by the horizontal axis, and the other is represented by the vertical axis. Each dot’s location on the graph indicates how much each variable was worth for that particular observation. In this blog post we shall create an elegant styled scatter plot with regression equation in R programming language.

Loading iris data

The iris dataset is a commonly used dataset in machine learning and data analysis. It contains measurements of the sepal length, sepal width, petal length, and petal width for three different species of iris flowers (setosa, versicolor, and virginica), with 50 samples of each species.

In R, the iris dataset is included in the base installation, so we can load it directly without any additional packages. Here’s how to load and explore the iris dataset in R. We used head() function to print the first six rows of the dataset.

data("iris")
head(iris)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

Correlation analysis

The cor.test() function in R from stats package which is used to test for the correlation between two variables in a dataset. It calculates the correlation coefficient (r), the p-value, and the confidence interval for the correlation. We shall compute these values to test the correlation between sepal length and Petal length in the iris dataset. The output will include the correlation coefficient, the p-value, and the confidence interval for the correlation.

res <- cor.test(iris$Sepal.Length, 
                iris$Petal.Length, 
                method = 'pearson')
res
# 
#   Pearson's product-moment correlation
# 
# data:  iris$Sepal.Length and iris$Petal.Length
# t = 21.646, df = 148, p-value < 2.2e-16
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
#  0.8270363 0.9055080
# sample estimates:
#       cor 
# 0.8717538

Creating scatterplot

In addition to ggplot2smplot2 is a R package for statistical data visualisation. This package is an example of what I wish I had when I first started learning R. It seeks to simplify each phase of data visualisation. We shall enhance the visualization style using smplot2 package in addition to ggplot2 package.

Creating ggplot object

The ggplot() function is the main function of the ggplot2 package in R, used to create a new ggplot object. The ggplot object is a blank canvas that we can layer different graphical elements onto to create a visualization.

library(ggplot2)
library(smplot2)

plot <- ggplot(data = iris, 
               mapping = aes(x = Sepal.Length, 
                             y = Petal.Length)) 
plot

Adding points to scatter plot

The geom_point() is a function in the ggplot2 package in R that is used to create a scatter plot. It adds a layer of individual points to a plot created using ggplot().

plot <- plot +
          geom_point(shape = 21, 
                     fill = '#0f993d', 
                     color = 'white', 
                     size = 3) 
plot

Annotating correlation and p value

The annotate() is a function in the ggplot2 package in R that is used to add annotation to a plot created using ggplot(). Annotation includes text, labels, arrows, and other graphical elements that provide additional information about the plot.

plot <- plot + 
          annotate('text', x = 5, y = 6, 
                   label = paste0('R = ', round(res$estimate,2), ', p < 0.001'))
plot

Adding trend line

The sm_statCorr() is a function in the sm package in R that is used to compute the correlation coefficient between two variables in a dataset. The sm package provides nonparametric methods for smoothing and exploratory analysis of data, and sm_statCorr() is a function for exploring the relationship between two variables using correlation analysis.

plot +
          sm_statCorr(show_text = FALSE,
                      fit.params = list(color = 'black', 
                                        linetype = 'solid'))

Download R program — Click_here

Download R studio — Click_here