Moderated multiple regression (MMR) is frequently used to test moderation hypotheses in the behavioral and social sciences. In MMR with a categorical moderator, between-groups heteroscedasticity is not uncommon and can inflate Type I error rates or reduce statistical power. Compared with research on remedial procedures that can mitigate the effects of this violated assumption, less research attention has focused on statistical procedures that can be used to detect between-groups heteroscedasticity. In the current article, we briefly review such procedures. Then, using Monte Carlo methods, we compare the performance of various procedures that can be used to detect between-groups heteroscedasticity in MMR with a categorical moderator, including a heuristic method and a variant of a procedure suggested by O’Brien. Of the various procedures, the heuristic method had the greatest statistical power at the expense of inflated Type I error rates. Otherwise, assuming that the normality assumption has not been violated, Bartlett’s test generally had the greatest statistical power when direct pairing occurs (i.e., when the group with the largest sample size has the largest error variance). In contrast, O’Brien’s procedure tended to have the greatest power when there was indirect pairing (i.e., when the group with the largest sample size has the smallest error variance). We conclude with recommendations for researchers and practitioners in the behavioral and social sciences.
- moderated multiple regression
- heterogeneity of variance
- statistical assumptions
Testing for the equality of regression slopes is frequently conducted in the behavioral and social sciences. Evidence of this can be found in research on differential prediction (Aguinis, Culpepper, & Pierce, 2010; American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 1999; Saad & Sackett, 2002) and analysis of covariance (Fox, 2008; Huitema, 1980; Rutherford, 1992). Testing for the equality of regression slopes is equivalent to testing whether the relationship between a continuous outcome and a continuous predictor differs depending on a third variable—a moderator (Saunders, 1956; Stone-Romero & Liakhovitski, 2002).
The study of moderator variables, in general, is important for theory development and knowledge cumulation in education, management, industrial-organizational psychology, and related disciplines. Consistent with this, Hall and Rosenthal (1991) noted,
If we want to know how well we are doing in the biological, psychological, and social sciences, an index that will serve us well is how far we have advanced in our understanding of the moderator variables of our field. (p. 447)
Although a variety of procedures exist for detecting the effects of continuous and categorical moderators (Stone-Romero & Liakhovitski, 2002; Zedeck, 1971), researchers have noted that moderated multiple regression (MMR) has become the major procedure for testing hypotheses involving categorical moderators (Aguinis, 2004; Overton, 2001; Sackett & Wilk, 1994; Shieh, 2009).
Regrettably, in MMR with a categorical moderator, it is not uncommon to violate the homoscedasticity assumption (see Aguinis & Pierce, 1998; DeShon & Alexander, 1996; Overton, 2001), which can lead to inflated Type I errors or reduced statistical power (DeShon & Alexander, 1996; Ng & Wilcox, 2010; Overton, 2001). More specifically, in MMR, the form of heteroscedasticity that can manifest is one in which the error variance differs across the levels of a categorical moderator (e.g., gender; for a review, see Aguinis, 2004; DeShon & Alexander, 1996; Ng & Wilcox, 2010; Rosopa, Schaffer, & Schroeder, 2013; Wilcox, 1997), or stated another way, between-groups heteroscedasticity exists (Ng & Wilcox, 2010).
Based on a review of three journals (Academy of Management Journal, Journal of Applied Psychology, and Personnel Psychology) from 1987 to 1999, Aguinis, Peterson, and Pierce (1999) identified 87 articles that reported at least one test for the equality of regression slopes. Out of 117 tests, Aguinis and his colleagues found that at least 39% of these violated the assumption. The implication of this finding is that researchers might have wrongly concluded that an interaction exists in the population when it does not (Type I error) or that an interaction does not exist in the population when it does (Type II error). In either case, “substantive research conclusions can be erroneous, theory development can be hindered, and incorrect decisions can be made . . .” (p. 319).
Although there exist a number of remedial procedures (Rosopa et al., 2013) that can be used to mitigate the effects of between-groups heteroscedasticity in MMR, including the use of statistical approximations (Alexander & Govern, 1994; DeShon & Alexander, 1994; Shieh, 2009), robust methods (Cribari-Neto, 2004; Long & Ervin, 2000; Wilcox, 2005), and weighted least squares regression (Overton, 2001; Rosopa, 2006), less research attention has focused on statistical procedures that can be used to detect between-groups heteroscedasticity in MMR. Currently, there is no empirical research that systematically compares the various approaches that can be used to detect between-groups heteroscedasticity. Thus, consistent with recommendations by Rosopa et al. (2013), one major purpose of the present article is to compare the performance of various procedures that can be used to detect between-groups heteroscedasticity in MMR with a categorical moderator.
Although researchers across diverse disciplines (e.g., econometrics, psychology, and statistics) have suggested different approaches for detecting heteroscedasticity in general (Rosopa et al., 2013), some procedures are sensitive to non-normality. A robust approach by O’Brien (1979, 1981), however, has been recommended for use in ANOVA. Thus, another purpose of the present article is to suggest a variation of O’Brien’s procedure that can be used for instances in which a researcher is interested in testing for the equality of regression slopes.
Our article is divided into four major sections. First, we formally define the model used in MMR with a categorical moderator. Second, we describe between-groups heteroscedasticity and its biasing effects. Third, we review various procedures that can be used to detect between-groups heteroscedasticity, including O’Brien’s (1979, 1981) procedure. Fourth, we describe the results of a Monte Carlo simulation designed to assess the relative performance of various procedures that can be used to detect between-groups heteroscedasticity.
MMR With a Continuous Predictor and a Categorical Moderator
When testing for the equality of k regression slopes using MMR, a continuous outcome (y) is modeled as a function of a continuous predictor (x), a categorical moderator (z) (indexed by k − 1 regressors, that is, z1, z2, . . ., zk−1), and the two-way interaction between x and z (indexed by k − 1 product terms between x and the regressors). Population parameters are denoted by Greek letters such as β and , for example, to differentiate them from sample estimates using a circumflex, that is, and , respectively.
When k = 2, the full linear model for the ith observed response in the jth group can be expressed as
for i = 1, 2, . . ., nj, j = 1, 2, . . ., k, where nj is the jth sample size; β0, β1, β2, and β3 are unstandardized regression coefficients; and eij is the ith residual in the jth group (an estimate of εij, an unknown population error). Although Equation 1 is a fixed effects regression model, when the predictors are treated as random variables, the estimates of E(yij) can be viewed as conditional on the specific values of the predictors, that is, E(yij | xij, z1ij) (Bauer & Curran, 2005; Rencher, 2000).
More generally, for k ≥ 2, the full linear model for the observations (with the number of terms p = 2k − 1) can be compactly expressed in matrix form. See the appendix for the general form, including the assumptions that specify that the model is linear, all relevant terms are included in the model, and εijs are constant and uncorrelated.
Note that normally distributed εijs are neither required nor assumed for the linear model to be valid (Rencher, 2000). However, when the normality assumption is invoked for statistical inferences (e.g., confidence intervals, hypothesis tests), this implies that the yijs (and εijs) are independent. When k = 2, the test for the equality of independent slopes is distributed as t with df = N − 4. When k > 2, the test statistic is distributed as an F random variable (see the appendix).
In Equation 1, the N residuals (eij) are assumed to have a diagonal N × N covariance matrix given by
(Fox, 2008; Rencher, 2000). Note the common variance on the main diagonal in Equation 2. Heteroscedasticity, in contrast, is said to exist when these variances are no longer equal. This can be denoted by
where for some m and m’ such that m = 1, 2, . . ., N. As an example of heteroscedasticity, assume that k = 2 where the first n1 observations that are from Group 1 have a variance 2 times larger (i.e., 2σ2) than the remaining n2 observations that are from Group 2; heteroscedasticity exists such that the N diagonal elements of Equation 3 would assume two different values—2σ2 or σ2. This is an example of between-groups heteroscedasticity (Ng & Wilcox, 2010).1
Between-Groups Heteroscedasticity and Its Biasing Effects
Extant research has found that between-groups heteroscedasticity can affect statistical inferences (e.g., increased Type I or Type II error rates) and these effects are nontrivial (DeShon & Alexander, 1996; Dretzke, Levin, & Serlin, 1982; Ng & Wilcox, 2010; Overton, 2001).
The error variance in the jth group ( ) can be expressed as
where and , respectively, are the variance of y in the jth group and the correlation coefficient between y and x in the jth group (DeShon & Alexander, 1996; Dretzke et al., 1982). When between-groups homoscedasticity exists, = . . . = .
Inspection of Equation 4 shows that if (or ) is constant across the k groups, then any difference in (or ) across the k groups will result in between-groups heteroscedasticity, unless values for and are such that they “balance out” so as to satisfy the homoscedasticity assumption. Moreover, when population regression slopes actually differ, the assumption is likely to be violated (Overton, 2001). This is evident in the following expression, after substituting into Equation 4:
where and , respectively, are the slope based on the regression of y on x in the jth group and the variance of x in the jth group. Thus, when slopes are unequal, and in each group must have values that offset one another so as to allow = . . . = . As research suggests, when testing for the equality of regression slopes, violating the between-groups homoscedasticity assumption is not uncommon (Aguinis & Pierce, 1998; DeShon & Alexander, 1996; Overton, 2001; Wilcox, 1997).
This violated assumption has biasing effects on the Type I error rates and the statistical power of MMR whether subgroup sample size (nj) is equal or unequal across the categorical moderator. Although with equal njs and equal across groups, some argue that Type I error rates perform “acceptably well” (Dretzke et al., 1982, p. 376), when equal across groups is untenable, Type I error rates become conservative, which can reduce the power of MMR (DeShon & Alexander, 1996). However, the power of MMR does not suffer greatly when njs are equal and error variances do not differ considerably (Alexander & DeShon, 1994).
With between-groups heteroscedasticity and unequal njs, however, the effects are much more severe. Type I error rates and statistical power “can be either gross underestimates or severe overestimates depending on the pattern of sample sizes relative to the pattern of error variances” (DeShon & Alexander, 1996, p. 270). More precisely, when the larger is paired with the larger nj (direct pairing), statistical tests based on MMR become conservative. This results in actual Type I error rates less than the nominal level and, ceteris paribus, power is decreased. Conversely, when the larger is paired with the smaller nj (indirect pairing), statistical tests based on MMR become liberal. This results in actual Type I error rates greater than the nominal level and, ceteris paribus, power is increased (albeit illegitimately; see, for example, DeShon & Alexander, 1996; Overton, 2001).2
To exacerbate matters, unequal njs are quite common in the behavioral and social sciences for a number of reasons. One reason is that attrition may result in unbalanced data (Shadish, Cook, & Campbell, 2002), such as in randomized experiments where participants in some conditions fail to complete an outcome measure. Another reason is that the population from which a researcher purposively samples could be disproportionate across subpopulations of the characteristic of interest (e.g., race; Shadish et al., 2002). This commonly occurs in the validation of personnel selection instruments (see, for example, Hattrup & Schmitt, 1990; Hunter, Schmidt, & Hunter, 1979). In addition, in longitudinal studies or in the analysis of archival data, missing values can lead to unequal njs across variables of interest (Schafer & Graham, 2002).
Overall, the biasing effects of between-groups heteroscedasticity on Type I error rates and statistical power can have implications on both theory development and practice in the behavioral and social sciences (Aguinis & Pierce, 1998; Oswald, Saad, & Sackett, 2000; Rosopa et al., 2013). For example, assume that sample sizes are unequal between two independent groups (e.g., male vs. female) and between-groups heteroscedasticity exists such that the larger error variance is paired with the group with the larger sample size (i.e., direct pairing). Furthermore, assume that the researcher/practitioner failed to detect a hypothesized slope difference between groups (i.e., between males and females) that actually exists in the population. Stated differently, the failure to detect a hypothesized moderating effect that exists in the population might be due to the influence of between-groups heteroscedasticity. As detailed in a review by Aguinis and Pierce (1998), inflated Type I error rates could lead to the publication of specious results. This seems plausible considering that, for decades, researchers have noted the problem of failing to detect hypothesized moderators using MMR (Aguinis & Stone-Romero, 1997; McClelland & Judd, 1993; Zedeck, 1971).
As noted above, researchers have identified a number of alternatives to MMR when between-groups heteroscedasticity exists. For example, DeShon and Alexander (1996) conducted a comprehensive Monte Carlo study evaluating the relative performance of various statistical approximations, with two statistical approximations (viz., A and J approximations) having the most promise. With a dichotomous moderator, Overton (2001) suggested a weighted least squares approach for MMR. In addition, some researchers have recommended certain robust estimators regardless of the form of heteroscedasticity (Cribari-Neto, 2004; Long & Ervin, 2000).
Because violation of the between-groups homoscedasticity assumption can afflict the Type I error rates and power of MMR, it would be useful to assess whether this assumption has been violated. The following section considers this issue.
A Review of Procedures for Detecting Between-Groups Heteroscedasticity
An issue seldom raised by researchers or practitioners in the context of MMR is how to detect violations of the between-groups homoscedasticity assumption. Although Aguinis (2004) explained that there are two methods (to be noted below) for evaluating whether the assumption has been violated, any procedure that can be used to test the equality of k independent variances could potentially be used to detect between-groups heteroscedasticity in MMR with a categorical moderator. Some procedures involve the variances of the residuals, whereas others may use some other measure of dispersion (Boos & Brownie, 2004; Conover, Johnson, & Johnson, 1981). Some procedures are used specifically in the context of ANOVA (e.g., Brown & Forsythe, 1974). Another procedure is used primarily in regression models in econometrics (e.g., Breusch & Pagan, 1979). In addition, as noted below, a rule-of-thumb has also been recommended in MMR with a categorical moderator (see DeShon & Alexander, 1996). However, the relative performance of these and other procedures described below has not been examined.
In addition, although a number of studies have compared various tests for homogeneity of variances specifically in ANOVA (see, for example, Conover et al., 1981; Martin & Games, 1977), we could not find any studies involving MMR with between-groups heteroscedasticity and the effects of direct and indirect pairing. For example, a simulation conducted by Conover et al. (1981) involved a one-way ANOVA with four independent groups, and they included only direct pairing conditions when N = 80. Thus, because neither sample size nor type of pairing was manipulated, the effect of these factors could not be examined. Games, Winkler, and Probert (1972) empirically investigated the robustness of various tests for homogeneity of variances to violations of the normality assumption in the context of ANOVA with three independent groups. However, because sample sizes were always equal across groups, pairing of error variances was not considered. Boos and Brownie (2004) reviewed the two-sample case and a one-way ANOVA, but did not report simulation results. In a simulation by Sarkar, Kim, and Basu (1999), they included a one-way ANOVA with three independent groups and considered both direct and indirect pairing for Ns as large as 120.
As noted above, Aguinis (2004) mentioned two methods for detecting between-groups heteroscedasticity. One was a heuristic method suggested by DeShon and Alexander (1996). The second was a statistical test by Bartlett (1937). In the sections that follow, we describe these and other procedures that could be used to detect between-groups heteroscedasticity.
DeShon and Alexander (1996) described a heuristic method to signal whether the between-groups heteroscedasticity assumption has been violated to such a degree as to unduly influence the results of MMR analyses. Specifically, when a researcher calculates the variance of the residuals separately within each of the k groups, the ratio of the largest estimated variance to the smallest estimated variance should not exceed 1.5. This ratio is computationally simple and does not require specialized software.
Note that the heuristic method is not a statistical test, but rather a rule-of-thumb and its statistical performance, in terms of Type I error or power, has not been examined. As a rule-of-thumb, the heuristic method may not possess the desirable property of being robust at any Type I error rate (α). That is, regardless of α (e.g., .01 or .05), a researcher would conclude that heteroscedasticity exists if the ratio (based on sample estimates of two variances) exceeds 1.5. However, the heuristic method was included in the present simulation to assess its performance relative to other procedures.
Bartlett (1937) developed a procedure that can be used to test for homogeneity of variances by conducting a transformation of the variances. To use this procedure, this test involves transforming the variances of the residuals across the levels of z. To test the null hypothesis that = . . . = , we calculate
where is an estimate of the variance of the eijs in the jth group with degrees of freedom equal to . The test statistic, u/c, is approximately distributed as chi-square with degrees of freedom equal to (k − 1). The null hypothesis is rejected if u/c > . As suggested by DeShon and Alexander (1996), this procedure can be used to test whether between-groups heteroscedasticity exists. Although Games et al. (1972) found that this test had greater power than a number of other tests, it has been noted that this procedure is sensitive to departures from normality (Box, 1953; Levene, 1960).
Brown and Forsythe
To detect heteroscedasticity in the context of ANOVA, Brown and Forsythe (1974) suggested conducting a one-way ANOVA on the absolute value of the residuals around the group median instead of the mean (cf. Levene, 1960). Based on simulations conducted by Conover et al. (1981), tests for homogeneity of variances based on the median tend to control Type I error rates better than tests based on the mean. Brown and Forsythe’s procedure is relatively straightforward and appears to be less affected by skewed data in unbalanced designs than other procedures, while still providing adequate statistical power (Lix, 1996). In addition, because of its computational ease, it may be a very practical procedure for researchers and practitioners (Boos & Brownie, 2004; Conover et al., 1981).
The score test, developed independently in the econometrics (Breusch & Pagan, 1979) and statistics (Cook & Weisberg, 1983) literature, can be used to detect various forms of heteroscedasticity. For example, the score test can be used to test whether error variances differ as a function of continuous predictors, categorical predictors, or predicted values. This procedure requires two regression analyses. In the first analysis, the sum of squares error (SSE) from the regression equation of interest is required (see the numerator of Equation A4 in the appendix). Then, in a second regression analysis, the squared residuals from the first analysis are regressed on the variables believed to be the cause of the heteroscedasticity (e.g., the categorical moderator), and the sum of squares regression (SSR) is calculated. The test statistic for the score test, (SSR/2) / (SSE / N)2, is asymptotically distributed as chi-square with degrees of freedom equal to the number of variables used to predict the squared residuals.
Although the score test is not frequently used in the behavioral sciences, this procedure was included in the present study because of its flexibility. In addition, because the components needed for the statistical test are based on two regression equations (i.e., customized syntax or a stand-alone program is not required), this procedure would be generally accessible for a wide variety of users.
Analogous to testing for the main and interactive effects in ANOVA, O’Brien (1979, 1981) developed a procedure that could be used to test for the main and interactive effects of the variances in the cells of one-way and factorial designs. This robust procedure has been recommended even when the normality assumption is violated (Maxwell & Delaney, 2000; O’Brien, 1979, 1981). An especially lucid description of the procedure can be found in Maxwell and Delaney (2000).
Because O’Brien’s (1979, 1981) procedure is limited to designs that have only categorical predictors (e.g., one-way and factorial ANOVAs), it would be useful to generalize this method to designs that include categorical and continuous predictors. Below, we describe how O’Brien’s (1979, 1981) method can be used where hypotheses involving the equality of regression slopes are being tested. Here, we focus on a dichotomous moderator.3
The modified procedure requires three steps. The first step is to calculate the residuals (eij) from the full model (see Equation 1). Then, for each group, we calculate
The second step involves a transformation of each of the individual residuals. This calculation is achieved using the following equation:
To check the calculations, the average of the in Equation 7, within each group, should equal the corresponding value in Equation 6. Specifically, and .
The third step is to conduct a two-independent-samples t test on the transformed residuals from Equation 7, using the categorical moderator (z) (e.g., female vs. male, treatment group vs. control group) as the grouping variable. If the results of this test are statistically significant at some predetermined α, then there is evidence to suggest that between-groups heteroscedasticity exists.4
In the following sections, we describe the design and results of a Monte Carlo study used to compare the performance (viz., Type I error and statistical power) of the five procedures described above—heuristic method, Bartlett’s (1937) test, Brown and Forsythe’s (1974) test, score test, and modified O-Brien’s (1979, 1981) test.
We used Monte Carlo methods (Robert & Casella, 2004) to evaluate the performance of five procedures that can be used to detect between-groups heteroscedasticity in MMR with a dichotomous moderator. Note that the nominal α for all tests was .05. The manipulated parameters of our 5 × 3 × 8 × 2 × 5 research design resulted in 1,200 conditions. Each of the manipulated parameters is described next.
Total sample size
Five levels of N were used in the present study. These levels were 60, 120, 180, 240, and 300. The Ns for the present study overlap with those used in previous research on MMR (e.g., Aguinis & Stone-Romero, 1997; DeShon & Alexander, 1996) and bracket the Ns typically encountered in validation studies (Salgado, 1998).
Subgroup sample size
Sample size within groups was systematically manipulated using the following three ratios (n1:n2): (a) 1:1, (b) 1:2, and (c) 1:3. For example, when N = 120, the subgroup sample sizes, based on the three ratios, were (a) n1 = n2 = 60, (b) n1 = 40 and n2 = 80, and (c) n1 = 30 and n2 = 90.
Between-groups heterosce-dasticity assumed eight levels, which involved the ratios of the population error variance in each group ( ). Specifically, the ratios of the population error variances were (a) 1:1, (b) 1:1.25, (c) 1:1.5, (d) 1:1.75, (e) 1:2, (f) 1:2.5, (g) 1:3, and (h) 1:4. Note that the ratio 1:1 represents homoscedasticity because the population error variances are the same between groups, and the remaining ratios represent increasing levels of between-groups heteroscedasticity. These levels bracket the heuristic approach suggested by DeShon and Alexander (1996), and we did not manipulate the ratio of the error variances beyond these eight levels (e.g., 1:6) because such ratios would result in very high rejection rates, making it difficult to distinguish differences in performance among the procedures.
Type of pairing
Depending on whether the larger error variance ( ) is paired with a small versus a large nj, the ability of MMR to detect a hypothesized moderator can result in inflated (or conservative) Type I error rates or reduced statistical power (DeShon & Alexander, 1996; Ng & Wilcox, 2010; Overton, 2001; Shieh, 2009). Although our focus is not on the power of MMR analyses but on the performance of the five procedures to detect between-groups heteroscedasticity, we felt that it would be useful to assess whether the performance of these five procedures differed depending on the type of pairing. We considered direct pairing (i.e., largest paired with the largest nj) and indirect pairing (i.e., largest paired with the smallest nj; see DeShon & Alexander, 1996; Overton, 2001).
In the present study, the MMR effect size (f 2) was also manipulated. Although varying the size of the moderating effect was not the focus of our study, we felt that it was useful to determine whether the moderator effect size influenced the performance of the various procedures to detect between-groups heteroscedasticity. We used the modified effect size by Aguinis, Beaty, Boik, and Pierce (2005). Based on their 30-year review of research involving MMR with a categorical moderator in applied psychology and allied fields, the median effect size was .002.
Thus, in the present study, the levels of the manipulated effect size were .001, .002, .005, .01, and .02. These levels included the median effect size reported by Aguinis et al. (2005). Although Cohen (1988) labeled f 2 = .02 as a small effect, in the review by Aguinis et al., they found that this was the effect size at which studies in applied psychology and management had an average power level of .84 to detect such an effect. Beyond this effect size, the power of the usual MMR test for equality of regression slopes would exceed typical recommended levels for power.
For each condition, data generation and statistical analyses were conducted in R—a free, open source, statistical software package (Culpepper & Aguinis, 2011; R Development Core Team, 2011). For the jth group, nj observations of bivariate normal data with population means of 0 were generated using the mvrnorm function in the MASS library in R. The population variances for x ( ) were set to 1. assumed values as described above. = 0.5 whereas was allowed to differ so as to equal one of the specified values for f 2. Equation 5 was used to solve for . Then, Equation 4 was used to solve for . Note that to compute , the Solver function in MS Excel 2010 was used. This function can minimize or maximize a formula by changing user-specified cells. Alternatively, it can be used to set a formula to a specified value (e.g., f 2 = .002) by changing user-specified cells. More precisely, given the just-noted parameters and equalities, Solver was used to find the value for (referred to as the cell to be changed in MS Excel) that would result in a specific f 2 (referred to as the target cell in MS Excel). The target cell contained the formula by Aguinis et al. (2005). Note that all default options were used in the function. However, the precision option was set to 1 × 10−17. Although the Solver function could not find a solution for that results in an f 2 exactly equal to the manipulated value, the difference was miniscule and, therefore, was retained. For example, for one condition with parameters as specified above, where f 2 should equal .002, the Solver function found the value for that results in an f2 = .00199995713664686.5
We also conducted a series of accuracy checks to ensure that the data we generated conformed to the various parameters that we manipulated. In addition, we checked our data generation algorithm against similar conditions considered by DeShon and Alexander (1996), Dretzke et al. (1982), and Overton (2001).
On each simulated data set, the five procedures (viz., heuristic method; Bartlett’s, 1937, test; Brown and Forsythe’s, 1974, test; the score test; and the modified O’Brien’s, 1979, 1981, test) were used to test whether between-groups heteroscedasticity existed. For each condition, there were 5,000 replications. The proportion of times that the null hypothesis was rejected within a condition was recorded for each procedure.
The performance of the five procedures are compared below in terms of Type I error rate and power. Due to space limitations, we do not present the results of all 1,200 conditions. Because the pattern of results were the same regardless of the size of the moderating effect, we present results when f 2 = .002, the median effect size based on the 30-year review by Aguinis et al. (2005). Note that the complete set of results and R code can be obtained from the first author.
Type I Error
For the conditions in which homoscedasticity existed (i.e., = 1:1), the average Type I error rate for Bartlett’s (1937) test, Brown and Forsythe’s (1974) test, the score test, and the modified O’Brien’s (1979, 1981) test were similar to the nominal alpha (i.e., .05; see Table 1). The procedure that appeared to perform poorly was the heuristic method. Although the other four procedures appear to control Type I error at the nominal level, the heuristic method was not robust (Serlin, 2000), with empirical rejection rates typically much greater than .05. For example, when N = 60 and subgroup sample sizes were equal, the empirical Type I error rate for the heuristic method was .2864 whereas the other four procedures had empirical Type I error rates near .05.
Although the heuristic method is not a formal statistical test, but simply a rule-of-thumb, and given that sampling error will affect the estimate of the residual variance in Group 1 and the estimate of the residual variance in Group 2, when N = 60 and subgroup sample sizes are equal, it appears that due to chance alone, 28.64% of the time the heuristic method would signal that between-groups heteroscedasticity exists when it does not. Perhaps not surprisingly, this inflated Type I error rate becomes increasingly worse as the sample size of the subgroups becomes more disproportionate. That is, the effect of sampling error on the estimates of the subgroup residual variance is exacerbated. For example, when N = 60 and n1:n2 = 1:2, the empirical Type I error rate for the heuristic method was .3362 and when n1:n2 = 1:3, the empirical Type I error rate was further inflated to .3714. Notably, the empirical Type I error rates for the other four procedures remained near .05.
Another interesting result regarding the heuristic method is that as N increases its empirical Type I error rate decreases. For example, when N = 120 and subgroup sample sizes were equal, the empirical Type I error rate for the heuristic method was .1184 and when N = 240, the empirical Type I error rate decreased to .0302. This is due to the fact that as N increases, the sampling error associated with estimating the residual variance in each group also decreases. Thus, as N increases, the two estimated variances are much more precise estimates of the population ratio ( ) equal to 1. Again, for the other four procedures, the empirical Type I error rates remained near .05 regardless of N and subgroup sample sizes.
In this section, empirical rejection rates when the homoscedasticity assumption is violated (i.e., heteroscedasticity exists) are presented (see Tables 2⇓-4). Table 2 presents results when subgroup sample sizes are equal (i.e., n1:n2 = 1:1). Table 3 and Table 4 present results when subgroup sample sizes are unequal, n1:n2 = 1:2 and n1:n2 = 1:3, respectively. There were notable differences in the performance of the five procedures in terms of power, which we describe across equal subgroup sample sizes, and both the direct and indirect pairing conditions.
Equal subgroup sample sizes
Recall that when subgroup sample sizes are equal, direct versus indirect pairing does not apply because the pairing of the larger (or the smaller) error variance with the group with the larger (or smaller) sample size is a non-issue because sample sizes are the same. In Table 2, with the exception of the heuristic method, four procedures had power that increased monotonically as N increased and as the degree of between-groups heteroscedasticity increased. Although Bartlett’s (1937) test tended to be the most powerful and Brown and Forsythe’s (1974) test tended to be the least powerful, it appears that when subgroup sample sizes are equal, there is relatively little difference in the power of these four procedures.
For the heuristic method, although it had the greatest power of all five procedures, recall that it had very inflated Type I error rates (see Table 1). Thus, the increased power comes at the cost of inflated Type I error rates. The heuristic method had power that increased monotonically as the degree of between-groups heteroscedasticity increased. For example, in Table 2, assuming N = 120, the heuristic method had power equal to .2480 when between-groups heteroscedasticity ( ) was 1:1.25, and increased to .4928 when the degree of between-groups heteroscedasticity increased to 1:1.5, and increased to .8630 when the degree of between-groups heteroscedasticity increased to 1:2.
Interestingly, when the degree of between-groups heteroscedasticity is fixed (e.g., 1:1.25), the power of the heuristic method did not increase monotonically as N increased. Recall that when subgroup sample sizes were equal, N = 60, and homoscedasticity was satisfied, the empirical Type I error rate was .2864 (see Table 1). Then, for a fixed N (e.g., 60), as the degree of between-groups heteroscedasticity increased, power increased (see Table 2). Because at larger Ns, the empirical Type I error rate of the heuristic method was always decreasing (cf. Table 1), power then increases as heteroscedasticity increases albeit at a much lower starting point due to the lower Type I error rate.
In Table 3, when there was direct pairing, the heuristic method was generally the most powerful of the five procedures when N ≤ 180. Otherwise, the most powerful procedure was Bartlett’s (1937) test followed by the score test, Brown and Forsythe’s (1974) test, and the modified O’Brien (1979, 1981). For these four procedures, power increased monotonically as N increased and as the degree of between-groups heteroscedasticity increased. Consistent with Table 2, at milder levels of direct pairing ( = 1:1.25), the power of the heuristic method decreased as N increased. Recall that as N increased, the empirical Type I error rate for the heuristic method decreased. Thus, at smaller Ns, the heuristic method had more of a power advantage to start because of its inflated Type I error rate. This relative power advantage at smaller Ns tended to decrease as N increased because of the lower and lower Type I error rates. This is unlike the other four procedures, which are statistical tests that have their minimum value (i.e., lower asymptote) at alpha regardless of N or subgroup sample sizes.
In Table 4, with direct pairing, the trends were similar to Table 3. However, compared with Table 3, because of the increasingly disproportionate subgroup sample sizes in Table 4, power generally decreased. However, the rank order of the various procedures remained the same. Excluding the heuristic method, which was the most powerful due to its inflated Type I error rate, Bartlett’s (1937) test continued to be the most powerful and the modified O’Brien (1979, 1981) was still the least powerful.
For all five procedures, power was lower when there was indirect pairing versus direct pairing. Notably, of the four procedures that were able to control Type I error rate at the nominal level (viz., Bartlett’s, 1937, test; Brown and Forsythe’s, 1974, test; the score test; and the modified O’Brien’s, 1979, 1981, test), the rank order of these procedures changed when there was indirect pairing. When there was indirect pairing (see Table 3 and Table 4), the modified O’Brien (1979, 1981) tended to be the most powerful, followed by the score test, Bartlett’s (1937) test, and Brown and Forsythe’s (1974) test.
It deserves noting that the heuristic method had the greatest statistical power of all five procedures because of its inflated Type I error rate. However, as its Type I error rate decreased with increasing N, the heuristic method has power that becomes similar to the other four procedures (see Table 4).
Because between-groups heteroscedasticity is a problem in MMR analyses with categorical moderators, the present study compared the performance of various procedures that could be used to detect this statistical violation. As noted above, research has focused primarily on remedial procedures that can be used when between-groups heteroscedasticity exists. However, we felt that it was also important to compare different ways of detecting between-groups heteroscedasticity that have not been previously examined empirically in MMR with a dichotomous moderator (viz., heuristic method; Bartlett’s, 1937, test; Brown and Forsythe’s, 1974, test; score test; and modified O’Brien’s, 1979, 1981, test). By comparing various procedures, we hoped to offer some initial recommendations for researchers and practitioners in the behavioral and social sciences.
A number of key findings can be gleaned from our study. In general, Bartlett’s (1937) test is the most powerful in detecting between-groups heteroscedasticity when sample sizes are equal or when direct pairing occurs, thus providing empirical support for the recommendation offered by DeShon and Alexander (1996). It is noteworthy, however, that when there is indirect pairing (i.e., the largest paired with the smallest nj), the modified O’Brien test (1979, 1981) appears to be the most powerful procedure. This suggests that multiple statistical procedures may be necessary when diagnosing between-groups heteroscedasticity, such that one procedure should be used for direct pairing and a different procedure should be used for indirect pairing.
The score test performed well across conditions, typically with the second highest power levels. Perhaps due to its origins in econometrics and statistics, it does not appear to be well known in the psychology literature and related fields. However, the score test may still be a very attractive alternative for researchers because of its flexibility to detect heteroscedasticity of various forms, including between-groups heteroscedasticity.
Brown and Forsythe’s (1974) test was the least powerful across conditions. It deserves noting, however, that this procedure was developed to be robust against violations of normality. Because normality was not manipulated in the present study, it is plausible that under conditions of non-normality, in which research has suggested that Bartlett’s (1937) test performs poorly (Box, 1953; Levene, 1960), Brown and Forsythe’s test could potentially outperform Bartlett’s test. Similarly, because O’Brien’s (1979, 1981) method has been found to be robust when the normality assumption is violated, it is possible that, under conditions of non-normality, the modified O’Brien could outperform Bartlett’s test even in the direct pairing conditions.
As N increases and the degree of between-groups heteroscedasticity increases, the differences in power among the five procedures are not substantial. For the conditions considered in the present study, it appears that for Ns ≥ 240, it generally makes little difference which procedure is used, especially if there is a high degree of between-groups heteroscedasticity.
The present study demonstrated that O’Brien’s (1979, 1981) procedure can be extended to designs beyond one-way and factorial ANOVA to include continuous predictors. The modified procedure controlled Type I error at the nominal level and had power levels comparable with, and in some cases greater than, other procedures.
The heuristic method had very poor properties. Admittedly, it is not a statistical test. Thus, it may not be reasonable to expect the heuristic method to be robust. Note that the empirical rejection rates (i.e., Type I error and power) for the heuristic method are unaffected by whether α = .01, .05, or .10. Thus, at any alpha, for the 1,200 conditions considered in the present study, the heuristic method would have the same rejection rates. To counteract its inflated Type I error rate, and interpolating from Table 1, the heuristic method may be recommended for use when k = 2 and N > 200.
Recommendations for Research and Practice
A few recommendations for research and practice can be identified. First, when testing for the equality of regression slopes, it is important that researchers and practitioners evaluate whether the homoscedasticity assumption has been satisfied. Consistent with Rosopa et al. (2013), the residuals from Equation 1 (for the two-group case, specifically) or Equation A1 in the appendix (for two or more groups, more generally) should be calculated. Then, the sample-based variance of these residuals can be calculated separately for each group. Assuming that N > 200, a simple ratio of the largest to the smallest residual variance can be calculated. In addition, direct pairing exists if the largest group has the largest residual variance; alternatively, indirect pairing exists if the largest group has the smallest residual variance. As subgroup sample sizes become increasingly disproportionate, it becomes increasingly important to know whether direct pairing or indirect pairing exists.
Second, based on the results of the present study, Bartlett’s (1937) test is the most powerful for detecting between-groups heteroscedasticity when the normality assumption is satisfied and direct pairing exists. However, when there is indirect pairing, the modified O’Brien’s (1979, 1981) test should be used. Notably, if subgroup sample sizes are approximately equal, it makes little difference which statistical procedure is used because the differences in statistical power are generally small.
Third, if between-groups heteroscedasticity is detected, an alternative procedure should be used instead of ordinary least squares regression. To mitigate the biasing effects of between-groups heteroscedasticity, Rosopa et al. (2013) discussed a number of procedures including weighted least squares regression and heteroscedasticity-consistent covariance matrices.
The present study adds incrementally to the extant literature on statistical procedures that can be used to detect between-groups heteroscedasticity in MMR with categorical moderators. It appears that different procedures may be needed to optimally detect between-groups heteroscedasticity when there is direct pairing (viz., Bartlett’s, 1937, test) versus indirect pairing (viz., modified O’Brien’s, 1979, 1981, test). This is a finding unique to this study. Moreover, because the heuristic method has never been empirically examined, the present simulation results are the first to note that this method has very inflated Type I error rates and it may be best to use this method when N > 200 to counteract the inflated Type I error rates. In addition to comparing the performance of various procedures, we proffered a modification to O’Brien’s (1979, 1981) method, which can be added to the statistical tools used by researchers and practitioners in the behavioral and social sciences.
The full linear model for the observations (with the number of terms p = 2k − 1) can be compactly expressed in matrix form as
where y is an N × 1 response vector, X is an N × (p + 1) model matrix, is a (p + 1) × 1 vector of unstandardized regression coefficients, and e is an N × 1 residual vector. In addition, it is assumed that the first-order and second-order moments of e have E(e) = 0 and cov(e) = IN, respectively (where 0 = a null vector, = the common variance, and IN = an identity matrix of order N; Schott, 2005).
The best linear unbiased estimator of the parameters in Equation A1 is
Although X, in Equation A2, can be partitioned differently, for convenience:
where j is an N × 1 vector of 1s, x is an N × 1 vector for the continuous predictor, Dz is an N × (k − 1) matrix of regressors, and Dxz is an N × (k − 1) matrix of product terms between x and the regressors in Dz. Based on Equation A2 and the constant variance assumption (i.e., homoscedasticity), an unbiased estimator of can be expressed as
where SSE = sum of squared errors. Moreover, when e is normally distributed, maximum likelihood estimators of and , respectively, are given by Equation A2 and SSE / N (Rencher, 2000).
Although X in Equation A3 represents the full model matrix, a full-and-reduced linear model approach can be used to construct the test of whether the k population regression slopes are equal (Rencher, 2000). The reduced model matrix (XReduced) excludes Dxz. Then, becomes a (k + 1) × 1 vector of regression coefficients.
Assuming that e is normally distributed, the test for the equality of regression slopes is conducted using an F ratio. It assesses whether the decrease in the SSE from a reduced (SSEReduced) to a full (SSEFull) model is statistically significant. The F random variable can be expressed as
where df1 = the number of terms omitted from the full model and df2 = the error degrees of freedom for the full model. It is worth noting that an equivalent general linear hypothesis test can be conducted using the full model in Equation A3 (see Equation 8.27 in Rencher, 2000). If F > F(1 − α, df1, df2) (where α = Type I error rate), then the null hypothesis of equal regression slopes is rejected; stated differently, z moderates the relation between y and x. Otherwise, the null hypothesis of equal regression slopes cannot be rejected. These procedures are described in greater detail in numerous texts (see Cohen, Cohen, West, & Aiken, 2003; Fox, 2008; Maxwell & Delaney, 2000; Neter, Kutner, Nachtsheim, & Wasserman, 1996). Note that when k = 2, the test of the moderating effect based on the F ratio in Equation A5 is equivalent to a two-tailed t test with df2 = N − 4.
Declaration of Conflicting Interests The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding The author(s) received no financial support for the research and/or authorship of this article.
Authors’ Note Portions of this article were presented at the 75th annual conference of the Psychometric Society in Athens, Georgia, and the 26th annual conference of the Society for Industrial and Organizational Psychology in Chicago, Illinois.
↵1. In the literature, between-groups heteroscedasticity has also been labeled Type II heteroscedasticity (Wilcox, 1997) and heterogeneity of subgroup error variance (Aguinis & Pierce, 1998). Because heteroscedasticity can take many forms (Fox, 2008; Long & Ervin, 2000, Rosopa, Schaffer, & Schroeder, 2013), we adopt the term “between-groups heteroscedasticity” (Ng & Wilcox, 2010) because we believe that it concisely describes the specific form that can manifest in moderated multiple regression (MMR) with a categorical moderator.
↵2. Historically, in the univariate two-sample case involving independent means, heteroscedasticity has been referred to as the Behrens–Fisher problem (Miller, 1997; Rencher, 1998) after the researchers who proposed approximate solutions in the 1920s and 1930s. In addition, numerous approximate solutions have been proposed for the multivariate version of the Behrens–Fisher problem (Kim, 1992; Rencher, 1998).
↵3. Note that by modifying O’Brien’s (1979, 1981) procedure, it can also be used to detect between-groups heteroscedasticity when the categorical moderator has more than two levels (e.g., Treatment A, Treatment B, vs. Control; varying levels of race or ethnicity). These formulas are available from the first author. However, for simplicity in our exposition, we focus on a dichotomous moderator (see Equation 1).
↵5. The values for various parameters in one of the experimental conditions were as follows: N = 120, n1 = 40, n2 = 80, , , , , , , , , and f 2 = .00199995713664686. Note that, given the values of the manipulated parameters (i.e., N, n1:n2, , and type of pairing), f 2 was not precisely equal to .002. We refer the reader to DeShon and Alexander (1996) and Dretzke et al. (1982) for a similar overview of the data generation process.
- © The Author(s) 2016
This article is distributed under the terms of the Creative Commons Attribution 3.0 License (http://www.creativecommons.org/licenses/by/3.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access page (https://us.sagepub.com/en-us/nam/open-access-at-sage).
Patrick J. Rosopa is an associate professor of industrial-industrial organizational psychology in the Department of Psychology at Clemson University. He has co-authored a book titled Statistical Reasoning in the Behavioral Sciences (6th ed., 2010, Wiley). His research has been published in such outlets as Psychological Methods, Organizational Research Methods, Human Resource Management Review, Personality and Individual Differences, and Scandinavian Journal of Psychology.
Amber N. Schroeder is an assistant professor in the Department of Psychological Sciences at Western Kentucky University. Her research interests focus primarily on (a) the use of social media in employment settings, (b) the examination of negative employee and organizational behavior, and (c) the assessment of employee personality and culture and their impact on work outcomes. Her research has been published in such outlets as Psychological Bulletin, Psychological Methods, Journal of Occupational Health Psychology, and Journal of Managerial Psychology.
Jessica L. Doll is an assistant professor in the Department of Management and Decision Sciences at Coastal Carolina University. Her research interests include workplace romances, impression management and political skill, and cross-cultural and gender differences in selection and engagement. She has presented at national and international refereed conferences and her research has been published in Journal of Managerial Psychology and Journal of Organizational Behavior Management.