Contents
What is a scatterplot?
A scatterplot is a type of graph used to display the relationship between two continuous variables. It is called a scatterplot because it displays the individual data points as scattered dots on the graph.
Each dot in a scatterplot represents a single observation or piece of data. One variable is represented by the horizontal axis, and the other is represented by the vertical axis. Each dot’s location on the graph indicates how much each variable was worth for that particular observation. In this blog post we shall create an elegant styled scatter plot with regression equation in R programming language.
Loading iris data
The iris dataset is a commonly used dataset in machine learning and data analysis. It contains measurements of the sepal length, sepal width, petal length, and petal width for three different species of iris flowers (setosa, versicolor, and virginica), with 50 samples of each species.
In R, the iris dataset is included in the base installation, so we can load it directly without any additional packages. Here’s how to load and explore the iris dataset in R. We used head()
function to print the first six rows of the dataset.
data("iris")
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1 5.1 3.5 1.4 0.2 setosa # 2 4.9 3.0 1.4 0.2 setosa # 3 4.7 3.2 1.3 0.2 setosa # 4 4.6 3.1 1.5 0.2 setosa # 5 5.0 3.6 1.4 0.2 setosa # 6 5.4 3.9 1.7 0.4 setosa
Correlation analysis
The cor.test()
function in R from stats package which is used to test for the correlation between two variables in a dataset. It calculates the correlation coefficient (r), the p-value, and the confidence interval for the correlation. We shall compute these values to test the correlation between sepal length and Petal length in the iris dataset. The output will include the correlation coefficient, the p-value, and the confidence interval for the correlation.
res <- cor.test(iris$Sepal.Length,
iris$Petal.Length,
method = 'pearson')
res
# # Pearson's product-moment correlation # # data: iris$Sepal.Length and iris$Petal.Length # t = 21.646, df = 148, p-value < 2.2e-16 # alternative hypothesis: true correlation is not equal to 0 # 95 percent confidence interval: # 0.8270363 0.9055080 # sample estimates: # cor # 0.8717538
Creating scatterplot
In addition to ggplot2
, smplot2
is a R package for statistical data visualisation. This package is an example of what I wish I had when I first started learning R. It seeks to simplify each phase of data visualisation. We shall enhance the visualization style using smplot2 package in addition to ggplot2 package.
Creating ggplot object
The ggplot()
function is the main function of the ggplot2 package in R, used to create a new ggplot object. The ggplot object is a blank canvas that we can layer different graphical elements onto to create a visualization.
library(ggplot2)
library(smplot2)
plot <- ggplot(data = iris,
mapping = aes(x = Sepal.Length,
y = Petal.Length))
plot
Adding points to scatter plot
The geom_point()
is a function in the ggplot2 package in R that is used to create a scatter plot. It adds a layer of individual points to a plot created using ggplot()
.
plot <- plot +
geom_point(shape = 21,
fill = '#0f993d',
color = 'white',
size = 3)
plot
Annotating correlation and p value
The annotate()
is a function in the ggplot2 package in R that is used to add annotation to a plot created using ggplot(). Annotation includes text, labels, arrows, and other graphical elements that provide additional information about the plot.
plot <- plot +
annotate('text', x = 5, y = 6,
label = paste0('R = ', round(res$estimate,2), ', p < 0.001'))
plot
Adding trend line
The sm_statCorr()
is a function in the sm package in R that is used to compute the correlation coefficient between two variables in a dataset. The sm package provides nonparametric methods for smoothing and exploratory analysis of data, and sm_statCorr()
is a function for exploring the relationship between two variables using correlation analysis.
plot +
sm_statCorr(show_text = FALSE,
fit.params = list(color = 'black',
linetype = 'solid'))
Download R program — Click_here
Download R studio — Click_here