AGRON INFO TECH

How to perform Structural Equation Modeling (SEM) in R

Introduction

In the field of statistical analysis, Structural Equation Modeling (SEM) is a powerful tool used to analyze complex relationships between variables. It allows researchers to explore both direct and indirect effects among multiple variables simultaneously. In this blog post, we will dive into the implementation of SEM using R programming language and discuss the step-by-step process involved.

Water stress is one of the major abiotic factors that limit crop production and quality. Seed priming is a pre-sowing treatment that enhances seed performance under stress conditions. The aim of this study was to investigate the effects of water stress and seed priming on grain yield (gy) and its components.The dataset comprises 2 factor variables and 9 response or observed variables, alongside a variable denoting the repetitions of each experimental unit. The initial three observed variables (aba, apx, and pod) provide insights into the activity of phytohormones. Additionally, the variable “ph” represents the height of the plants. Lastly, the remaining variables (til, pl, grp, tgw) fall under the category of yield-contributing variables, contributing to the overall understanding of crop productivity.

I hypothesized that water stress and priming would have direct and indirect effects on “gy” through “EA” and “YC”. To test this hypothesis, I performed a SEM analysis using R.

Clear R Environment

Before we begin our analysis, it’s always a good practice to clear the R environment to avoid any conflicts or data contamination. The following R code accomplishes this task:

rm(list = ls(all = TRUE))
graphics.off()
shell("cls")

Importing Data

The data was stored in an Excel file named data_path.xlsx. I used the readxl package to import the data into R. I also converted some variables into factors and attached the data frame for convenience.

library(readxl)
data <- read_excel(
          path = 'data_path.xlsx',
          col_names = TRUE
)

head(data)
# # A tibble: 6 x 12
#     rep water priming   aba   apx   pod    ph   til    pl   grp   tgw    gy
#   <dbl> <chr> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1     1 FLOOD NP        4.6  0.9   3.78  90.6  11    13.7  75    17.3  20.9
# 2     2 FLOOD NP        5.7  0.77  4.4   81.6  11.0  16.9  84.5  13.6  24.0
# 3     3 FLOOD NP        4.1  0.81  5.33  86.5  10.4  14.4  70.2  15.6  19.1
# 4     1 FLOOD HP        7.1  0.97  6.73  94.2  15.4  18.4  94.1  20.9  28.8
# 5     2 FLOOD HP        5.6  0.91  6.31  86.9  13.9  18.9  88.8  19.7  25.9
# 6     3 FLOOD HP        6.3  0.87  5.4   82.0  12.7  15.4  81.1  17.4  30.3
data$rep <- as.factor(data$rep)
data$water <- as.factor(data$water)
data$priming <- as.factor(data$priming)

attach(data)

Performing SEM

Model Specification

The first step in SEM is to specify the model. I used the lavaan package to write the model syntax. The model consisted of two latent variables (EA and YC) and one observed variable (gy). The latent variables were defined by their indicators using the =~ operator. The observed variable was regressed on the latent variables using the ~ operator.

mod.id <- '
EA =~ aba + apx + pod
YC =~ til + pl + grp + tgw
gy ~ EA + YC
'

Model Estimation and Identification

After specifying the model, we can proceed with estimating the parameters and identifying the model’s fit. The lavaan package is used for this purpose. The following code snippet performs model estimation:

library(lavaan)
# This is lavaan 0.6-15
# lavaan is FREE software! Please report any bugs.
mod.est <- sem(
          model = mod.id,
          data = data
)
# Warning in lav_data_full(data = data, group = group, cluster = cluster, : lavaan
# WARNING: some observed variances are (at least) a factor 1000 times larger than
# others; use varTable(fit) to investigate
summary(mod.est)
# lavaan 0.6.15 ended normally after 85 iterations
# 
#   Estimator                                         ML
#   Optimization method                           NLMINB
#   Number of model parameters                        18
# 
#   Number of observations                            18
# 
# Model Test User Model:
#                                                       
#   Test statistic                                23.193
#   Degrees of freedom                                18
#   P-value (Chi-square)                           0.183
# 
# Parameter Estimates:
# 
#   Standard errors                             Standard
#   Information                                 Expected
#   Information saturated (h1) model          Structured
# 
# Latent Variables:
#                    Estimate  Std.Err  z-value  P(>|z|)
#   EA =~                                               
#     aba               1.000                           
#     apx               0.315    0.033    9.621    0.000
#     pod               1.102    0.117    9.385    0.000
#   YC =~                                               
#     til               1.000                           
#     pl                0.968    0.137    7.058    0.000
#     grp               6.103    0.742    8.230    0.000
#     tgw               1.060    0.151    7.006    0.000
# 
# Regressions:
#                    Estimate  Std.Err  z-value  P(>|z|)
#   gy ~                                                
#     EA                3.644    1.843    1.977    0.048
#     YC                0.082    1.034    0.080    0.937
# 
# Covariances:
#                    Estimate  Std.Err  z-value  P(>|z|)
#   EA ~~                                               
#     YC                6.341    2.359    2.689    0.007
# 
# Variances:
#                    Estimate  Std.Err  z-value  P(>|z|)
#    .aba               0.594    0.217    2.739    0.006
#    .apx               0.012    0.007    1.655    0.098
#    .pod               0.195    0.100    1.948    0.051
#    .til               2.748    0.974    2.822    0.005
#    .pl                1.287    0.492    2.619    0.009
#    .grp               9.552    8.392    1.138    0.255
#    .tgw               1.614    0.612    2.637    0.008
#    .gy               11.480    4.110    2.793    0.005
#     EA                3.767    1.443    2.611    0.009
#     YC               11.584    4.700    2.465    0.014

The sem() function estimates the parameters of the specified model using the provided data. The summary() function then provides a summary of the estimated model, including the estimated coefficients, standard errors, p-values, and fit indices.

Additionally, you can obtain fit measures by modifying the code snippet as shown below. This will provide additional fit measures such as the Comparative Fit Index (CFI), Root Mean Square Error of Approximation (RMSEA), and others.

summary(mod.est, fit.measures = TRUE)
# lavaan 0.6.15 ended normally after 85 iterations
# 
#   Estimator                                         ML
#   Optimization method                           NLMINB
#   Number of model parameters                        18
# 
#   Number of observations                            18
# 
# Model Test User Model:
#                                                       
#   Test statistic                                23.193
#   Degrees of freedom                                18
#   P-value (Chi-square)                           0.183
# 
# Model Test Baseline Model:
# 
#   Test statistic                               278.306
#   Degrees of freedom                                28
#   P-value                                        0.000
# 
# User Model versus Baseline Model:
# 
#   Comparative Fit Index (CFI)                    0.979
#   Tucker-Lewis Index (TLI)                       0.968
# 
# Loglikelihood and Information Criteria:
# 
#   Loglikelihood user model (H0)               -258.507
#   Loglikelihood unrestricted model (H1)       -246.911
#                                                       
#   Akaike (AIC)                                 553.015
#   Bayesian (BIC)                               569.042
#   Sample-size adjusted Bayesian (SABIC)        513.733
# 
# Root Mean Square Error of Approximation:
# 
#   RMSEA                                          0.127
#   90 Percent confidence interval - lower         0.000
#   90 Percent confidence interval - upper         0.259
#   P-value H_0: RMSEA <= 0.050                    0.223
#   P-value H_0: RMSEA >= 0.080                    0.711
# 
# Standardized Root Mean Square Residual:
# 
#   SRMR                                           0.024
# 
# Parameter Estimates:
# 
#   Standard errors                             Standard
#   Information                                 Expected
#   Information saturated (h1) model          Structured
# 
# Latent Variables:
#                    Estimate  Std.Err  z-value  P(>|z|)
#   EA =~                                               
#     aba               1.000                           
#     apx               0.315    0.033    9.621    0.000
#     pod               1.102    0.117    9.385    0.000
#   YC =~                                               
#     til               1.000                           
#     pl                0.968    0.137    7.058    0.000
#     grp               6.103    0.742    8.230    0.000
#     tgw               1.060    0.151    7.006    0.000
# 
# Regressions:
#                    Estimate  Std.Err  z-value  P(>|z|)
#   gy ~                                                
#     EA                3.644    1.843    1.977    0.048
#     YC                0.082    1.034    0.080    0.937
# 
# Covariances:
#                    Estimate  Std.Err  z-value  P(>|z|)
#   EA ~~                                               
#     YC                6.341    2.359    2.689    0.007
# 
# Variances:
#                    Estimate  Std.Err  z-value  P(>|z|)
#    .aba               0.594    0.217    2.739    0.006
#    .apx               0.012    0.007    1.655    0.098
#    .pod               0.195    0.100    1.948    0.051
#    .til               2.748    0.974    2.822    0.005
#    .pl                1.287    0.492    2.619    0.009
#    .grp               9.552    8.392    1.138    0.255
#    .tgw               1.614    0.612    2.637    0.008
#    .gy               11.480    4.110    2.793    0.005
#     EA                3.767    1.443    2.611    0.009
#     YC               11.584    4.700    2.465    0.014

Interpreting model results

Model Test:

The user model’s chi-square test statistic is 23.193 with 18 degrees of freedom, resulting in a p-value of 0.183. This indicates that the user model fits the data reasonably well, as a higher p-value suggests a better fit. The baseline model’s chi-square test statistic is 278.306 with 28 degrees of freedom and a p-value of 0.000. The significant difference between the user model and the baseline model suggests that the user model provides a better fit.

Fit Indices:

The Comparative Fit Index (CFI) is 0.979, indicating a good fit of the user model. A CFI value close to 1 indicates a better fit. The Tucker-Lewis Index (TLI) is 0.968, also indicating a good fit. Similar to CFI, TLI values close to 1 indicate a better fit.

Loglikelihood and Information Criteria:

The loglikelihood of the user model is -258.507, while the loglikelihood of the unrestricted model is -246.911. A higher loglikelihood indicates a better fit. Information criteria such as Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and Sample-size adjusted Bayesian Information Criterion (SABIC) are provided as well. Lower values of these criteria indicate a better fit.

Root Mean Square Error of Approximation (RMSEA):

The RMSEA is 0.127, which suggests a reasonable fit of the model. The 90% confidence interval for RMSEA ranges from 0.000 to 0.259. A smaller RMSEA value and a narrower confidence interval indicate a better fit. However, the p-values for testing the null hypotheses regarding RMSEA are 0.223 and 0.711, indicating that the model does not significantly deviate from a perfect fit.

Standardized Root Mean Square Residual (SRMR):

The SRMR is 0.024, which is below the commonly accepted threshold of 0.08. This suggests a good fit of the model, as smaller values indicate better fit.

Parameter Estimates:

The parameter estimates provide information about the relationships between latent variables and observed variables. For example, the estimated regression coefficient of gy on EA is 3.644 with a standard error of 1.843 and a p-value of 0.048. Similarly, the regression coefficient of gy on YC is 0.082 with a standard error of 1.034 and a p-value of 0.937.

The estimated covariation between EA and YC is 6.341 with a standard error of 2.359 and a p-value of 0.007. The variances of the latent variables and the observed variables are also provided. For example, the estimated variance of EA is 3.767 with a standard error of 1.443 and a p-value of 0.009.

In summary, the SEM analysis results indicate that the user model fits the data reasonably well. The fit indices, loglikelihood, RMSEA, and SRMR suggest a good fit. The parameter estimates provide insights into the relationships between latent and observed variables in the model.

Path Diagram

The final step was to visualize the model using a path diagram. I used the semPlot package to draw the path diagram with different options.

library(semPlot)
semPaths(
          object = mod.est,
          what = "path",
          whatLabels = "par"
)

This code snippet creates a basic path diagram displaying the paths between latent and observed variables. You can adjust the appearance of the path diagram by modifying various parameters like style, layout, rotation, size, and color.

For example, the following code generates a path diagram in a tree layout:

P <- semPaths(
          object = mod.est,
          what = "path",
          whatLabels = "par",
          style = "ram",
          layout = "tree",
          rotation = 2,
          sizeMan = 7,
          sizeLat = 7,
          color = "lightgray",
          edge.label.cex = 1.2,
          label.cex = 1.3
)

Interpretation of the Path Diagram

  • The path diagram provides a graphical representation of the SEM model. It distinguishes observed variables (enclosed in squares or rectangles) from latent variables (represented by circles or ellipses).
  • The single-headed arrows, known as paths, indicate the causal relationships between variables. The variable at the tail of the arrow influences or causes the variable at the head. These paths help us understand how variables interact within the model.
  • The double-headed arrows represent the covariance between variables. When two variables are connected by a double-headed arrow, it indicates that they share a common underlying factor or are directly related to each other. The double-headed arrows pointing to the same variable reflect the variances associated with that variable.

However, in some cases, these variances may be omitted if the variables are standardized to have a unit variance. The path diagram visually represents the structural relationships, including causality, covariance, and variances, between the observed and latent variables in the SEM model.

Conclusion

This blog post provided a detailed explanation of implementing Structural Equation Modeling (SEM) in R. We covered the steps involved, including clearing the R environment, importing data, specifying the model, estimating parameters, and creating a path diagram. The path diagram serves as a valuable tool for visualizing and interpreting the relationships between variables in the SEM model. By leveraging SEM in R, researchers can gain deeper insights into complex relationships and make informed decisions based on their findings.


Download dataset — Click_here

Download R program — Click_here

Download R studio — Click_here