AGRON INFO TECH

Creating hexagon plot in R | hexbin and ggplot2 packages

Today, we’re going to create a hexagon plot and a scatter plot to explore the relationship between age and net income. Hexagon plots, also known as `hexbin` plots or hexagonal binning plots, are a type of data visualization used in statistics and data analysis. They are particularly useful when dealing with large datasets or continuous variables where individual data points may overlap, making it challenging to interpret patterns.

We will generate a hexagon plot using both the hexbin package and ggplot2 package.

So, let’s get started!

Cleaning workspace

First, we’re going to clean up our workspace by closing any open graphics windows and clearing all variables. This ensures we’re starting with a clean slate.

Load the necessary packages

Next, we’re going to load up some necessary packages. These include `haven` for data import, `hexbin` and `ggplot2` for plotting, and `dplyr` for data manipulation.

library(haven)
library(hexbin)
library(ggplot2)
library(dplyr)

Downloading data set

Our dataset originates from the German General Social Survey (GGSS), obtainable through the following steps:

  1. Visit the provided URL (https://www.gesis.org/en/allbus).
  2. Scroll down and locate the ALL BUS Cumulation link. Click on the link to open a new window.
  3. In the search box of the new window, type ZA3762 and press Enter.
  4. Click on the Research Data button.
  5. On the right side of the page, navigate to the Downloads section.
  6. Click on Datasets to access the available formats.
  7. Select the desired format, such as STATA data with a .dta file extension.
  8. Ensure that you are logged in to your account.

The dataset file will then begin to download shortly.

Importing and exploring the data set

Now, let’s read in our data. We’re using a dataset stored in a .dta file. We shall use read_dta() function from haven package to import the data.

data <- read_dta("ZA3762_v2-0-0.dta/ZA3762_v2-0-0.dta")

First, we’re checking how many rows and columns we have in our dataset using the dim() function. The dataset has 2946 rows and 899 columns.

# Check the structure of the imported data
dim(data)
## [1] 2946  899

Then, we’re listing the names of all the columns in our dataset using the names() function. The column names range from “v1” to “v899” and contain various variables or attributes of the data. Here we just printed the first 50 names.

names(data)[1:50]
##  [1] "v1"  "v2"  "v3"  "v4"  "v5"  "v6"  "v7"  "v8"  "v9"  "v10" "v11" "v12"
## [13] "v13" "v14" "v15" "v16" "v17" "v18" "v19" "v20" "v21" "v22" "v23" "v24"
## [25] "v25" "v26" "v27" "v28" "v29" "v30" "v31" "v32" "v33" "v34" "v35" "v36"
## [37] "v37" "v38" "v39" "v40" "v41" "v42" "v43" "v44" "v45" "v46" "v47" "v48"
## [49] "v49" "v50"

Finally, we’re opening up a new window to view the entire dataset. This helps us get a comprehensive understanding of the structure and content of our data. We can clearly see the information for each variable under each variable name.

View(data)

Observing Age and Net Income variables

We’re particularly interested in two variables: age and net income. Let’s take a closer look at these. The unique() function for age variable returns all unique values like 41, 49, 74, and so on, spanning a range of different ages.

Similarly unique() function for Net Income variable returns the all unique values for the net income of individuals and label information. 

# Age
unique(data$v686)
##  [1] 41 49 74 57 65 40 60 51 56 64 67 63 62 59 68 61 66 48 70 50 52 30 44 71 69
## [26] 54 42 58 78 76 45 55 72 36 38 47 46 83 75 53 35 73 27 34 79 43 37 29
# Net Income
unique(data$v473)
## <labelled<double>[331]>: befr.: nettoeinkommen, offene abfrage
##   [1]   300   800   750  1100  1300   360  1400 99997   350   700   530   420
##  [13]  4000   630   853     0  1500  2300   450  5000  3000  6000  1600   680
##  [25]  1000   600   314   520  2100   171  1700   240   150  1800   500   330
##  [37]  1020   430 99999   990  1200   780   506   550   960   650   400  3300
##  [49]   100   200  1350  2500  4500   900  2200   740   299  1050   610  2600
##  [61]   102 10000   503   250   980   880   720   160  3800   525  5500  1950
##  [73]   820   325   850   975   971  1850   760   713   920   993  2000  1250
##  [85]   840  2250  3700  1550   486   950  3500   230   170  1060   192  1760
##  [97]   440   702   378   307   570   372  2700   614  1248    60   842   584
## [109]   371  3200   130   810   870  1112   226   220   485   320  3100  1021
## [121]  1074  1450   860  1750   790  1900   362   510   257   723   286   890
## [133]   225   410  1650   460   620  1218   590  1295   538   755   770   203
## [145]   725  1041  1080  1375   685   341  1334   370   390   568   995    70
## [157]   748   580   480   904   690  1240   306  1421   756  2400   540   712
## [169]   425   710   270   435   505  1028  1140   682   216   825  2800   560
## [181]   947  1150   915  2550   259   973   625  4100   343   342    40   141
## [193]   624   732   445   340   930   528  1903  1535   382  1064 11000   730
## [205]  1170  2900  1280   389  1480  2085   385  7500  1622   190  1065    98
## [217]  2350   670  1610   803  3600   490   898   280   326    90   923   478
## [229]  1540  1580  1640   812   377   317   585   205   473  1460  1210   660
## [241]   728  1888   439  1930  1340  1220   983   470  1057   186  1840  1390
## [253]   426   242   260  1372   632   775   721   583  1470  1030   405   112
## [265]   573   348   266   815  1727  1526   832   608   438  1510  1025   623
## [277]   739   332   875  4400  7000  1017   601   970   180   101   631  1360
## [289]  1486  3450  1120  1980  8000   646   336  1145  1003  2540   423  2750
## [301]   935   545  4600  1034   125   762  1270   612  1130  1625  1860   135
## [313]   213   245  1232   311   436  2573  1490   665   472  1469   392   310
## [325]   265   562   633  1780   132  2007   602
## 
## Labels:
##  value          label
##      0 kein einkommen
##  99997     verweigert
##  99999   keine angabe

Creating a new data frame

Next, we’re going to create a new data frame with only these two variables.

We will create a new data frame called new_data containing only the “Age” and “Net Income” variables from the data set we imported earlier. We did this by selecting columns v473 and v686 from the original data frame, grouping the data by these two variables, and then filtering out any rows where the “Net Income” is less than 10000, excluding potentially invalid observations from the analysis.

# Create a new data frame with only Age and Net Income variables
new_data <- data %>% 
  select(v473, v686) %>%
  group_by(v473, v686) %>%
  # exclude invalid observations
  filter(v473 < 10000)

colnames(new_data) <- c("netincome", "ages")

Now that we have our new data frame, let’s take a peek at the first few rows to make sure everything looks good. For this we shall use the head() function to print the first six rows of the data set.

head(new_data)
## # A tibble: 6 × 2
## # Groups:   netincome, ages [6]
##   netincome  ages
##   <dbl+lbl> <dbl>
## 1  300         41
## 2  800         49
## 3  750         49
## 4 1100         49
## 5 1300         74
## 6  360         74

Mask the components of the data

We will mask the variable names using attache() function. Doing this will not need to specify data while using variable names.

attach(new_data)

Scatter plot between Age and Net Income

Now let’s create a scatter plot. This will give us a visual representation of the relationship between age and net income. We will utilize the plot() function, setting age on the x-axis and net income on the y-axis. The title will be defined using the main argument, while the x and y-axis labels will be specified using the x-lab and y-lab arguments, respectively. Additionally, we will set the point size in the scatter plot using the cex argument, which is set to 1.

plot(x = ages, y = netincome, 
     xaxt = "n", 
     main = "Scatter plot between Age and Net Income", 
     xlab = "Age", 
     ylab = "Net Income", 
     cex = 1)
axis(1, at = c(20, 30, 40, 50, 60, 70, 80, 90))

Hexagon plot Between Age and Net Income using hexbin package

Next, we’re going to create a hexagon plot. This type of plot is great for visualizing the density of observations. Darker hexagons indicate a higher number of observations in that area. We used hexbin package to visualize the relationship between age and net income. The plot is set to display net income on the y-axis and age on the x-axis, with a color scale style. Additionally, the plot includes borders, has an aspect ratio of 1, and uses a square root transformation for the x-axis values.

hexbinplot(netincome ~ ages,  
           style = "colorscale",  
           xlab = "Age", 
           ylab = "Net Income",
           main = "Hexagon plot between Age and Net Income \n(hexbin package)",
           border = TRUE, 
           aspect = 1, 
           trans = sqrt, 
           inv = function(ages) ages ^ 2)
hexagon plot hexbin package

Create a hexagon plot using ggplot2 package

Finally, we’re going to create another hexagon plot, but this time we’re using the `ggplot2` packages.

  • We start by specifying our data frame (new_data) and mapping the x-axis to ages and the y-axis to net income using the aes() function.
  • Next, we add hexagon-shaped bins to the plot using geom_hex() function, with 20 bins and a black outline color.
  • To add color to the plot, we use scale_fill_gradient() to create a gradient ranging from light green to dark green based on the density of data points.
  • For the x-axis, we apply a square root transformation using scale_x_continuous() to better display the data distribution.
  • We then set the labels for the x and y axes, as well as the title and subtitle of the plot using the labs() function.
  • Finally, we apply a black and white theme to the plot using theme_bw() for a clean and simple visual presentation.
ggplot(new_data, 
       aes(x = ages, 
           y = netincome)) +
  geom_hex(bins = 20, 
           color = "black") +
  scale_fill_gradient(low = "lightgreen", 
                      high = "darkgreen") +
  scale_x_continuous(trans = "sqrt") +  # Square root transformation for x-axis
  labs(x = "Age", 
       y = "Net Income", 
       title = "Hexagon Plot between Age and Net Income", 
       subtitle = "ggplot2 package") +
  theme_bw()
hexagon plot ggplot2 package

Looking at these plots one can see that almost all individuals have a net monthly income of less than 2,000 EUR. Only few individuals earn more than 6,000 EUR per month. Interestingly, the highest net income recorded is 8000, which is for an individual who is 62 years old.

Conclusion

Our exploration of hexagon plots and scatter plots in the context of age and net income unveils a tapestry of insights. From understanding data distributions to spotting outliers and trends, these visualizations serve as invaluable tools in data analysis.

As we bid adieu to this tutorial, remember that data visualization is not just about charts and graphs—it’s about storytelling. Each plot, each data point narrates a story waiting to be deciphered.

👉 For more details and informative videos 📺, you can also subscribe to our YouTube Channel AGRON Info Tech.


Download R program and R studio — Click_here