1 Introduction
In longitudinal clinical trials the number and the collection times of the repeated measurements are oftentimes the same for all participants. Consequently, there is usually enough replications at each measurement occasion and within each treatment group, in the case of moderately sized or large randomized trials, to estimate the true underlying covariance matrix. Mixed-model repeated measures (MMRM) – a likelihood-based regression model for modeling the mean with time included categorically – has been generally recommended for the primary analysis of continuous outcomes with longitudinal trial designs with the use of unstructured (UN) modeling of within-patient covariances [1]. With the focus on modeling the mean response with time included categorically, the corresponding generalized estimating equations (GEE) method can provide an alternative (i.e., marginal models with an identity link). When the longitudinal measurements are multivariate Gaussian, the solution to the estimating equations is the generalized least squares estimator of the regression coefficients from MMRM [2]. Like MMRM, GEE can handle missing data easily, albeit provided that the longitudinal outcomes satisfy the missing completely at random (MCAR) assumption [3].
A compelling argument can sometimes be made to support the MCAR assumption in well-conducted trials in rare diseases. Patient centeredness is central to trial design. Involving patients, patient advocates, and family members in trial design can increase retention and ensure adequate assessment of trial outcomes [4, 5]. Compared with non-rare disease trials, trials in rare diseases tend to enroll fewer patients, address a high unmet medical need, and use patient-centered or surrogate endpoints to assess efficacy [6, 7]. Such design characteristics can mitigate dropouts and incomplete data. The use of external controls, shorter placebo-phase duration, and designs that assure participants that they will receive the experimental treatment favor patient participation and retention and further reduce the risk of dropout [8, 9]. However, if MCAR is not clinically defensible and dropout is assumed to be missing at random (MAR), the estimating equations need to be weighted by the propensity to dropout [10]. Whereas it is generally appreciated that MMRM provides a valid framework under the MAR assumption, it is not well recognized that this assumption does not hold if the covariance pattern has not been correctly specified [11]; and, therefore, may lead to a false sense of security with its use.
Little priori information about the covariance pattern is available in rare diseases. Furthermore, there are no reliable ways to identify the best covariance model in small samples [12]. If the data allow it, an UN covariance pattern should be considered; it minimizes the risk of a misspecified covariance structure. If convergence problems in numerical optimization are encountered, strategies to achieve convergence should be considered before adopting a parsimonious covariance pattern [1, 13, 14]. In some controlled or single-arm trials in rare diseases, however, there simply are not enough participants relative to the number of time points to estimate the covariance terms with an UN covariance pattern. In either case, and because we regard the correlation structure as a nuisance, the compound symmetry (CS) covariance pattern could pragmatically be considered in trials in rare diseases, which recognizes the dependence at the cost of only one extra covariance parameter.
Modeling time categorically imposes no structure on the mean response over time. It is unlikely, therefore, that the model for the mean response is misspecified; MMRM and GEE estimates of the regression coefficients will be valid. However, because the CS covariance pattern will likely be misspecified, the model-based standard errors (SEs) for MMRM and GEE estimates of the regression coefficients will be invalid [11, 15]. This is true in small and large samples [16]. With both methods, valid SEs of the regression coefficients can be based on the sandwich variance estimator, which are robust to any misspecification of the covariance [15]. Of course, the sandwich variance estimator is based on large-sample properties and its performance deteriorates for small samples [17–19]. Small-sample corrections for these SEs are needed to correct for bias.
Researchers have developed and compared a variety of bias-corrected SEs across a range of simulated scenarios, albeit separately for MMRM and GEE [18–27]. Furthermore, as opposed to relying on large-sampling properties of the regression coefficients when building confidence intervals (CIs) and testing hypotheses for the regression coefficients, $t-$ and F-distributions using degrees of freedom (DOF) approximations to account for uncertainty in estimation of the regression coefficients can be used. Because it is not always clear what defines a small sample, a suitable bias-corrected strategy should be considered for trials conducted in rare diseases along with the choice to use either MMRM or GEE.
While there are numerous statistical publications on using these methods spanning decades, there is a dearth of literature in the rare disease community that uses real data to collectively report on both methods applied to continuous outcomes collected longitudinally from trials in rare diseases that is succinctly presented in an accessible manner. Analysts in rare disease research may be confused on which method to apply in practice and how to use the chosen method correctly. Coupled with the need to specify the primary analysis method in a protocol before the trial begins [28], a case studies approach to prespecifying a modeling strategy in rare diseases was taken. Herein, we considered three unique, albeit representative longitudinal trial designs with continuous efficacy outcomes conducted in rare diseases. We report the degree of missing data, evaluate the missing data pattern, and assess common characteristics of the empirical covariance pattern of the repeated measurements. We then simulated data directly from these trials to share with other analysts in the rare disease community a concise review of the operating characteristics of these methods based on real data. We provide a succinct summary and practical advice to complement the statistical literature on when and how to apply these methods in practice.
2 Methods
2.1 Efficacy Outcomes
The primary (or important secondary) efficacy outcome from three completed, longitudinal, phase 3 FDA registration trials conducted in rare diseases were selected (Table 1): changes in the HHD upper extremity composite score, measured at baseline and every 8 weeks through study week 48 in UX001-CL301, a phase 3 randomized study evaluating sialic acid extended-release for GNE myopathy (N = 44 placebo; N = 45 treatment) [29]; changes in rickets severity, assessed by the Radiographic Global Impression of Change (RGI-C) global score and measured at weeks 40 and 64 in UX023-CL301, a randomized phase 3 trial (N = 32 control; N = 29 treatment) comparing the efficacy and safety of burosumab to conventional therapy for XLH [30]; and changes in alkaline phosphatase (ALP), measured at baseline and at weeks 4, 12, 24, 36, and 48 in UX023-CL304, a phase 3 single arm study in N = 14 participants investigating the effects of burosumab on osteomalacia in adults with X-linked hypophosphatemia (XLH) [31]. For each outcome, we report the degree and pattern of missing data, evaluated the MCAR assumption using Little’s MCAR Test [32], and assessed the empirical covariance pattern of the changes from baseline.
Table 1
The three longitudinal, phase 3 clinical trials conducted in rare diseases, which formed the basis for the simulation study.
| Clinical Trial | No. Post-BL Occasions | Trial Duration | Sample Size | Primary (or Important Secondary) Efficacy Endpoint | Prospectively Defined Primary Analysis | ClinTrials.Gov | |
| Randomized or Enrolled | Primary Analysis Set | ||||||
| UX001-CL301 | 6 | 48 Weeks | 89 | 88 | Primary endpoint: Change from baseline at Week 48 in the HHD Upper Extremity (UE) Composite Score | GEE, Change from Baseline, compound symmetry, adjusting for sex, region, and baseline value | NCT02377921 |
| UX023-CL301 (Pediatrics) | 2 | 64 Weeks | 61 | 61 | Primary endpoint: Rickets severity, assessed by the Radiographic Global Impression of Change (RGI-C) global score and measured at Week 40 | GEE, Change from Baseline, compound symmetry, adjusting for age and the baseline total rickets severity score | NCT02915705 |
| UX023-CL304 (Adults) | 5 | 48 Weeks | 14 | 14 | Important secondary endpoint: Change from baseline at Week 48 in alkaline phosphatase (ALP) | GEE, Change from Baseline, compound symmetry, adjusting for the baseline value | NCT02537431 |
BL = Baseline; GEE = generalized estimating equations.
Trials UX001-CL301 and UX023-CL301 were randomized controlled trials, while UX023-CL304 was a single arm trial.
The prespecified Primary Analysis Set was defined as all randomized (or enrolled) participants with a baseline measurement and at least one post-baseline measurement.
Standard GEE was applied such that estimation was based on method-of-moments assuming asymptotic normality using the SAS GENMOD procedure.
The prespecified GEE method, which was the primary analysis method and specified in the protocol prior to seeing the data, modeled the longitudinal changes from baseline assuming a CS covariance pattern adjusting for the corresponding baseline values (baseline total rickets severity score in UX023-CL301) and other prespecified potential confounding factors; the sandwich variance estimator of the regression coefficients without a small-sample correction was used and also prespecified. The SAS procedure GENMOD was used such that estimation was based on the method-of-moments and CIs and tests assumed asymptotic normality [so-called standard GEE; Liang and Zeger (1986)] [3]. In each trial, the prespecified GEE method was applied to the primary analysis set, defined in the protocol as all randomized (or enrolled) participants who had a baseline measurement and ≥1 post-baseline measurement and was used to assess a treatment effect at a single time point. The number of participants in each primary analysis set was 88 (43 placebo; 45 treatment), 61 (32 control; 29 treatment), and 14 in UX001-CL301, UX023-CL301, and UX023-CL304, respectively (Table 1).
Table 2
Data generating model for UX001-CL301, UX023-CL301, and UX023-CL304.
| Clinical Trial | MMRM |
| Data Generating Model | |
| UX001-CL301 | $\begin{aligned}{}E\left({Y_{i}}\mid \hspace{2.5pt}{X_{i}}\right)={\beta _{0}}+{\beta _{1}}I\left(\mathrm{G}=1\right)& +{\beta _{2}}I\left(\mathrm{W}=16\right)+{\beta _{3}}I\left(\mathrm{W}=24\right)+{\beta _{4}}I\left(\mathrm{W}=32\right)+{\beta _{5}}I\left(\mathrm{W}=40\right)+{\beta _{6}}I\left(\mathrm{W}=48\right)+{\beta _{7}}I\left(\mathrm{G}=1\right)\times I\left(\mathrm{W}=16\right)\\ {} & +\hspace{2.5pt}{\beta _{8}}I\left(\mathrm{G}=1\right)\times I\left(\mathrm{W}=24\right)+{\beta _{9}}I\left(\mathrm{G}=1\right)\times I\left(\mathrm{W}=32\right)+{\beta _{10}}I\left(\mathrm{G}=1\right)\times I\left(\mathrm{W}=40\right)\\ {} & +{\beta _{11}}I\left(\mathrm{G}=1\right)\times I\left(\mathrm{W}=48\right)+{\gamma _{1}}\mathrm{Base}+{\gamma _{2}}I\left(\mathrm{Sex}=\mathrm{Male}\right)+{\gamma _{3}}I\left(\mathrm{Region}=\mathrm{NonUS}\right)\end{aligned}$ |
| Repeated measurements taken on $i=1,\hspace{2.5pt}\dots ,\hspace{2.5pt}88$ participants (45 treatment; 43 placebo) at the same set of 6 post-baseline occasions. ${Y_{i}}$ = Change from baseline. I represents an indicator variable. G = Group; W = Week; Base = Baseline value; Reference categories for variables G, W, Sex, and Region were placebo, week 8, female, and US, respectively. Unstructured (UN) covariance structure was assumed. The estimates of the regression coefficients and variance / covariance matrix from fitting the MMRM model to the trial data were used as the true values. The covariates Base, Sex, and Region were generated from a distribution reasonably consistent with the trial data: a truncated N(56.14, 27.954), 55% males, and 66% Non-US. 100,000 data sets generated; on average, in any given simulated data set, 1.4% of the data were missing completely at random. The regression coefficients of primary interest are ${\beta _{1}}$ and ${\beta _{11}}$; specifically, the null hypothesis ${H_{0}}:\hspace{2.5pt}{\beta _{1}}+{\beta _{11}}=0$ corresponds to no treatment difference in the mean change from baseline at week 48 adjusting for the other covariates; this is similar to testing ${H_{0}}:\hspace{2.5pt}\mathrm{Treament}\hspace{2.5pt}\mathrm{Difference}\hspace{2.5pt}\text{in}\hspace{2.5pt}\mathrm{LS}\hspace{2.5pt}\mathrm{Means}\hspace{2.5pt}\mathrm{at}\hspace{2.5pt}\mathrm{Week}\hspace{2.5pt}48=0$ (i.e., the marginal mean change from baseline is the same in each arm). | |
| UX023-CL301 (Pediatrics) | $\begin{aligned}{}E\left({Y_{i}}\mid \hspace{2.5pt}{X_{i}}\right)={\beta _{0}}+{\beta _{1}}I\left(\mathrm{G}=2\right)& +{\beta _{2}}I\left(\mathrm{W}=64\right)+{\beta _{3}}I\left(\mathrm{G}=2\right)\times I\left(\mathrm{W}=64\right)\\ {} & +{\gamma _{1}}\mathrm{TOTSCORE}+{\gamma _{2}}I\left(\mathrm{Age}=2\right)\end{aligned}$ |
| Repeated measurements taken on $i=1,\hspace{2.5pt}\dots ,61$ participants (32 conventional therapy; 29 burosumab) at the same set of 2 post-baseline occasions. ${Y_{i}}$ = Change from baseline. I represents an indicator variable. G = Group; W = Week; TOTSCORE = Baseline Total Rickets Severity Score; Reference categories for variables G, W, and Age were burosumab, week 40, and < 5 years, respectively. Unstructured (UN) covariance structure was assumed. The estimates of the regression coefficients and variance / covariance matrix from fitting the MMRM model to the trial data were used as the true values. The covariates TOTSCORE and Age were generated from a distribution reasonably consistent with the trial data: a truncated N(3.18, 1.057) and 57% ≥ 5 years old. 100,000 data sets were generated; on average, in any given simulated data set, 0.5% of the data were missing completely at random. The regression coefficient of primary interest, ${\beta _{1}}$, corresponds to the treatment difference in the mean change from baseline at week 40 adjusting for the other covariates; this is similar to testing ${H_{0}}:\hspace{2.5pt}\mathrm{Treament}\hspace{2.5pt}\mathrm{Difference}\hspace{2.5pt}\text{in}\hspace{2.5pt}\mathrm{LS}\hspace{2.5pt}\mathrm{Means}\hspace{2.5pt}\mathrm{at}\hspace{2.5pt}\mathrm{Week}\hspace{2.5pt}40=0$ (i.e., the marginal mean change from baseline is the same in each arm). | |
| UX023-CL304 (Adults) | $\begin{aligned}{}E\left({Y_{i}}\mid \hspace{2.5pt}{X_{i}}\right)={\beta _{0}}+{\beta _{1}}I\left(\mathrm{W}=12\right)& +{\beta _{2}}I\left(\mathrm{W}=24\right)+{\beta _{3}}I\left(\mathrm{W}=36\right)+{\beta _{4}}I\left(\mathrm{W}=48\right)\\ {} & +{\gamma _{1}}\mathrm{Base}\end{aligned}$ |
| Repeated measurements taken on $i=1,\hspace{2.5pt}\dots ,14$ participants at the same set of 5 post-baseline occasions. ${Y_{i}}$ = Change from baseline in alkaline phosphatase. I represents an indicator variable. W = Week; Base = Baseline value; Reference category for the variable W was study week 4. Unstructured (UN) covariance structure was assumed. The estimates of the regression coefficients and variance / covariance matrix from fitting the MMRM model to the trial data were used as the true values. The covariate Base was generated from a distribution reasonably consistent with the trial data: a truncated N(113.14, 51.600). 500,000 data sets were generated; on average, in any given simulated data set, 1.2% of the data were missing completely at random. The regression coefficients of primary interest are ${\beta _{0}}$ and ${\beta _{4}}$; specifically, the null hypothesis ${H_{0}}:\hspace{2.5pt}{\beta _{0}}+{\beta _{4}}=0$ corresponds to no difference in the mean change from baseline at week 48 adjusting for the baseline value. This is similar to testing ${H_{0}}:\hspace{2.5pt}\mathrm{LS}\hspace{2.5pt}\mathrm{Mean}\hspace{2.5pt}\mathrm{at}\hspace{2.5pt}\mathrm{Week}\hspace{2.5pt}48=0$ (i.e., the marginal mean change from baseline equals zero). |
2.2 Simulation Study
The aim was to understand the operating characteristics of the prespecified GEE method to estimation and MMRM applied to each trial, and to recommend when to use the sandwich variance estimator with or without a small-sample correction with both analytic approaches. Multivariate Gaussian longitudinal outcomes of changes from baseline were generated with a fixed covariance. For a linear model with a given covariance matrix, MMRM and GEE (marginal model with an identify link function) will give identical results of the regression coefficients; however, the methods differ in how they estimate the covariance [14]. Because GEE does not use likelihood methods, the estimated “model” is incomplete and not suitable for simulation. To study the operating characteristics of both methods, data were generated from an estimated MMRM fit to each dataset’s outcome (Appendix Table 1). For the two-arm trials (UX001-CL301 and UX023-CL301), the parameterization of the mean part of the model included an intercept, indicators for time, a group indicator, the two-way interaction between the indicators for time and the group indicator, the baseline value and a set of protocol-specified covariates. For the single arm trial (UX023-CL304), the mean part of the model comprised an intercept, indicators for time, and the baseline value. The baseline values and the protocol-specified covariates were generated from a distribution consistent with what was seen in the data. The models and primary statistical hypotheses expressed in mathematical terms are shown in Table 2. Based on re-analysis of the trials’ outcomes, we successfully estimated an UN covariance pattern. Compared with CS, the Akaike Information Criterion (AIC) [33] expressed a strong preference [34] (difference in AIC>10) for an UN covariance pattern in UX001-CL301 and UX023-CL304. The lack of a preference (difference in AIC<2) for an UN covariance pattern in UX023-CL301 notwithstanding, a fixed UN covariance pattern was used in data generation for the longitudinal outcomes from UX001-CL301, UX023-CL304, and UX023-CL301. To generate data that were MCAR, we removed each datum independently of the other data with probability 1.4%, 0.5%, and 1.2% in datasets simulated from UX001-CL301, UX023-CL301, and UX023-CL304, respectively. These missing data percentages were based on the real trial data; more dialogue is provided on this subject in the results section.
Table 3
For each dataset generated in the simulation study, the following models were applied.
| Clinical Trial | Model | Method | Covariance Pattern | Standard Error Adjustment | Small-Sample Correction to Standard Errors | Degree of Freedom (DOF) in t Tests and t-based Confidence Limits |
| UX001-CL301 & UX023-CL301 (Pediatrics) & UX023-CL304 (Adults) | M0 | MMRM | UN | Kenward-Roger | No | Kenward-Roger approximation for the DOF based on the SE adjustment |
| M1 | MMRM | CS | None, Model-based | No | Between-Within method DOF by Schluchter and Elashoff (1990) | |
| M2 | MMRM | CS | Sandwich Estimator | No | ||
| M3 | MMRM | CS | Sandwich Estimator | Mancl and DeRouen (2001) | ||
| M4 | GEE | CS | None, Model-based | No | Asymptotic normal distribution (standard GEE) with z-based inference | |
| M5* | GEE | CS | Sandwich Estimator | No | ||
| M6 | GEE | CS | Sandwich Estimator | Mancl and DeRouen (2001) | ||
| M7 | GEE | CS | None, Model-based | No | Between-Within method DOF by Schluchter and Elashoff (1990) | |
| M8 | GEE | CS | Sandwich Estimator | No | ||
| M9 | GEE | CS | Sandwich Estimator | Mancl and DeRouen (2001) |
GEE = generalized estimating equations; MMRM = mixed model repeated measures; SE = standard error; DOF = degrees of freedom; UN = Unstructured; CS = Compound symmetry.
* Prespecified GEE method, which was the primary analysis method defined in the protocol for each trial.
-
Note 1. Trials UX001-CL301 (N = 88) and UX023-CL301 (N = 61) were randomized controlled trials, while UX023-CL304 (N = 14) was a single arm trial.
-
Note 2. The SAS procedure PROC GLIMMIX with METHOD = RSPL was used to fit the MMRM models. The SAS Keyword EMPIRICAL = FIRORES was used to obtain the bias-corrected SEs by Mancl and DeRouen (2001) made to the empirical/robust standard errors. The SAS Keyword DDFM = BETWITHIN was used to obtain the between-within DOF approximation by Schluchter and Elashoff (1990): The DOF approximation used for the between and within-subject effects were ${N_{1}}-{p_{1}}$ and ${N_{2}}-({N_{1}}+{p_{2}})$, respectively; here, ${N_{1}}=$ number of analyzable subjects; ${N_{2}}=$ number of nonmissing observations; ${p_{1}}=$ number of between-subject effects + intercept; ${p_{2}}=$ number of within-subject effects. For hypothesis tests that involved within-subject effects, ${N_{2}}-({N_{1}}+{p_{2}})$ DOF was used.
-
Note 3. The standard GEE method of moments approach was used that mirrors the SAS procedure PROC GENMOD, albeit with the use of the GEE WITH DIAGNOSTICS (Version 1.06) SAS/IML macro by John Preisser, University of North Carolina-Chapel Hill; the SAS/IML macro calculates the model-based, empirical, and bias-corrected SEs (Mancl and DeRouen, 2001).
-
Note 4. All models adjusted for the prespecified covariates.
Table 3 lists the modeling scenarios studied. MMRM was estimated using restricted maximum likelihood with the SAS procedure GLIMMIX. MMRM with unstructured covariance (Model M0) served as the benchmark. Here, the Kenward-Roger method [35] was considered, which combines a SE adjustment with a Satterthwaite-type DOF approximation. Models M1, M2, and M3 represented an MMRM with CS covariance pattern that, respectively, relied on model-based, sandwich-based, and bias-corrected sandwich-based standard errors by Mancl and DeRouen (2001) [19] by specifying the EMPIRICAL = FIRORES option in the GLIMMIX statement. For models M1-M3, inferences were based on the t-distribution using the between-within method [36] (DDFM = BETWITHIN) for computing DOF. The DOF approximation used for the between- and within-subject effects in the model were ${N_{1}}-{p_{1}}$ and ${N_{2}}-({N_{1}}+{p_{2}})$, respectively, where ${N_{1}}=$ number of analyzable subjects; ${N_{2}}=$ number of nonmissing observations; ${p_{1}}=$ number of between-subject regression coefficients + intercept; ${p_{2}}=$ number of within-subject regression coefficients. For hypothesis tests that involved within-subject effects, ${N_{2}}-({N_{1}}+{p_{2}})$ DOF was used. For GEE, estimation was based on method-of-moments assuming a CS covariance pattern (Models M4-M9). Because built-in bias-corrected SEs are not available with GENMOD, we used the SAS/IML GEE macro (Version-1.06) [37], which mirrors the estimation method in GENMOD, but also calculates bias-corrected SEs by Mancl and DeRouen (2001). Inference was based on asymptotic normality (M4-M6) or the t-distribution using the between-within DOF-approximation obtained with MMRM (M7-M9).
We compared the bias and SEs associated with the regression coefficients of primary interest, the coverage of their corresponding 95% CIs, and the SEs and power of the test expressed in terms of the least square means (estimated marginal means) testing the null hypothesis that the treatment difference of the marginal mean change from baseline at weeks 48 and 40 equaled zero in the randomized trials UX001-CL301 and UX023-CL301, respectively, and the null hypothesis that the marginal mean change from baseline at week 48 equaled zero in the single arm trial UX023-CL304. Testing the primary hypotheses in terms of the estimated marginal means was prespecified in each study.
3 Results
Table 4
Missing data summary and empirical covariance pattern of the changes from baseline for each outcome according to the phase 3 clinical trial.
| Phase 3 Clinical Trial | No. Randomized or Enrolled | No. Post-BL Occasions | No. Patients Discontinuing Early | Percent of BL and Post-BL Measurement Occasions Missing Data | Little’s MCAR Testd | Empirical Covariance Matrixe Changes from Baseline | ||||||
| UX001-CL301 | 89a | 6 | 2b | 1.4% (9 / 623) | $P=0.54$ | WK8 | WK16 | WK24 | WK32 | WK40 | WK48 | |
| WK8 | 16.03 | 8.14 | 9.68 | 14.96 | 12.91 | 10.95 | ||||||
| WK16 | 19.34 | 13.45 | 16.39 | 16.81 | 16.90 | |||||||
| WK24 | 27.72 | 23.29 | 18.94 | 17.14 | ||||||||
| WK32 | 41.50 | 29.96 | 26.17 | |||||||||
| WK40 | 34.40 | 24.82 | ||||||||||
| WK48 | 32.43 | |||||||||||
| UX023-CL301 (Pediatrics) | 61 | 2 | 0 | 0.5% (1 / 183) | $P=0.19$ | WK40 | WK64 | |||||
| WK40 | 1.26 | 1.14 | ||||||||||
| WK64 | 1.35 | |||||||||||
| UX023-CL304 (Adults) | 14 | 5 | 1c | 1.2% (1 / 84) | $P=0.42$ | WK4 | WK12 | WK24 | WK36 | WK48 | ||
| WK4 | 594.55 | 414.12 | 72.74 | −70.23 | −53.46 | |||||||
| WK12 | 591.14 | 217.44 | 68.62 | 203.40 | ||||||||
| WK24 | 239.92 | 217.23 | 336.04 | |||||||||
| WK36 | 301.54 | 427.17 | ||||||||||
| WK48 | 659.03 | |||||||||||
MCAR = Missing completely at random; WK = Week
Trials UX001-CL301 and UX023-CL301 were randomized controlled trials, while UX023-CL304 was a single arm trial.
-
a Of the 89 randomized participants, 88 were included in the prespecified Primary Analysis Set, which was defined in the protocol as all randomized participants with a baseline measurement and at least one post-baseline measurement; a single participant did not have any of the 6 post-baseline measurements.
-
b One participant submitted outcome data for the first five post-BL measurement occasions, while the other participant did not submit any post-BL measurements; reason for study discontinuation was subject non-compliance for both participants.
-
c The single participant, who withdrew consent, submitted outcome data for the first four post-BL measurement occasions.
-
d Data are missing completely at random (MCAR) when the pattern of missing values does not depend on the data values. The null hypothesis for Little’s MCAR Test is that the data are MCAR. A P > 0.05 indicates weak evidence against the null hypothesis.
-
e The empirical covariance matrix was estimated from the participants included in the prespecified Primary Analysis Set, namely, the N = 88, N = 61, and N = 14 participants in trials UX001-CL301, UX023-CL301, and UX023-CL304, respectively, who had a baseline measurement and at least one post-baseline measurement.
Within each trial, Table 4 shows the proportion of missing data, results from applying Little’s MCAR Test, and the empirical covariance pattern of the repeated measurements. Two participants in UX001-CL301 and one participant in trial UX023-CL304 discontinued the trial prematurely, while none of the participants in UX023-CL301 discontinued early. The reasons for study discontinuation were subject non-compliance (both participants in UX001-CL301) and withdrawal of consent (single participant in UX023-CL304). In total, 1.4%, 0.5%, and 1.2% of the measurement occasions in UX001-CL301, UX023-CL301, and UX023-CL304, respectively, were missing outcome data. In each trial, no missing outcome data pattern occurred in >1 participant such that the missing outcome data resembled a random sample of all the outcome data. Coupled with the clinical plausibility of MCAR and the results from applying Little’s MCAR test (all P>0.19), the longitudinal outcome data within each trial reasonably satisfied the MCAR assumption. Based on the empirical covariances for each outcome, the variances and the pairwise covariances were not constant over time.
Figure 1
UX001-CL301 (N = 88 Two-Arm Study) – Simulation Results for models M0-M9: Model M0 represents the mixed-model repeated measures (MMRM) model with correctly specified unstructured (UN) covariance and use of the Kenward-Roger method, which combined a standard error adjustment with a Satterthwaite-type degree of freedom (DOF) approximation. M1-M3 represent an MMRM with compound symmetry (CS) covariance pattern that, respectively, used model-based, sandwich-based, and bias-corrected sandwich-based standard errors by Mancl and DeRouen (2001); t-based inference was used with the Between-Within method for determining DOF. M4-M9 represent a generalized estimating equations (GEE) model (marginal model with identify link) with inference based on asymptotic normality (M4-M6) or the t-distribution using the Between-Within DOF-approximation (M7-M9). Plots of the coverage of the 95% confidence intervals for the primary regression coefficients of interest (A) ${\beta _{1}}$ and (B) ${\beta _{11}}$, with the average standard errors of the respective regression coefficients listed at the right of each plot; the actual standard error estimated from the trial data for ${\beta _{1}}$ and ${\beta _{11}}$ is shown in parentheses for comparison. The red dashed line on both plots represents the targeted 95% confidence coefficient while the shaded pink region represents coverage between 94% to 96% as a frame of reference. (C) Plot of the average standard errors of the treatment difference in the marginal mean change from baseline at week 48 with respect to the actual standard error estimated from the trial data (green dashed line); the shaded green region represents ± 5% of the standard error estimated from the trial data as a frame of reference. The power used to test the primary hypothesis, ${H_{0}}:\hspace{2.5pt}\mathrm{Treatment}\hspace{2.5pt}\mathrm{Difference}\hspace{2.5pt}\text{in}\hspace{2.5pt}LS\hspace{2.5pt}\mathrm{Means}\hspace{2.5pt}\mathrm{at}\hspace{2.5pt}\mathrm{Week}\hspace{2.5pt}48=0$, is listed at the right of the plot.
Appendix Table 2A shows the operating characteristics associated with each method, with and without the sandwich variance estimator and small-sample correction, based on 100,000 simulated datasets each from UX001-CL301 and UX023-CL301 and 500,000 from UX023-CL304. The prespecified GEE method assumed a CS covariance pattern with the use of the sandwich variance estimator without a small-sample correction and inference based on asymptotic normality, while the data were generated from MMRM with UN covariance pattern. On average, in any given dataset simulated from UX001-CL301, UX023-CL301, and UX023-CL304, 1.4%, 0.5%, and 1.2% of the longitudinal measurements were MCAR.
Figure 2
UX023-CL301 (N = 61 Two-Arm Study) – Simulation Results for models M0-M9: Model M0 represents the mixed-model repeated measures (MMRM) model with correctly specified unstructured (UN) covariance and use of the Kenward-Roger method, which combined a standard error adjustment with a Satterthwaite-type degree of freedom (DOF) approximation. M1-M3 represent an MMRM with compound symmetry (CS) covariance pattern that, respectively, used model-based, sandwich-based, and bias-corrected sandwich-based standard errors by Mancl and DeRouen (2001); t-based inference was used with the Between-Within method for determining DOF. M4-M9 represent a generalized estimating equations (GEE) model (marginal model with identify link) with inference based on asymptotic normality (M4-M6) or the t-distribution using the Between-Within DOF-approximation (M7-M9). Plots of the coverage of the 95% confidence intervals for the primary regression coefficient of interest (A) ${\beta _{1}}$, with the average standard errors of the respective regression coefficient listed at the right of the plot; the actual standard error estimated from the trial data for ${\beta _{1}}$ is shown in parentheses for comparison. The red dashed line on the plot represents the targeted 95% confidence coefficient while the shaded pink region represents coverage between 94% to 96% as a frame of reference. (B) Plot of the average standard errors of the treatment difference in the marginal mean change from baseline at week 40 with respect to the actual standard error estimated from the trial data (green dashed line); the shaded green region represents ± 5% of the actual standard error estimated from the trial data as a frame of reference. The power used to test the primary hypothesis, ${H_{0}}:\hspace{2.5pt}\mathrm{Treatment}\hspace{2.5pt}\mathrm{Difference}\hspace{2.5pt}\text{in}\hspace{2.5pt}LS\hspace{2.5pt}\mathrm{Means}\hspace{2.5pt}\mathrm{at}\hspace{2.5pt}\mathrm{Week}\hspace{2.5pt}40=0$, is listed at the right of the plot.
UX001-CL301
In the largest (N = 88) study with 6 post-baseline measurement occasions and little missing data, incorrect SEs resulted when the sandwich variance estimator was not used with MMRM and a misspecified CS covariance pattern leading to 95% CIs that were too wide or narrow for ${\beta _{1}}$ and ${\beta _{11}}$, respectively, and power that was too large (relative to the MMRM model with correctly specified covariance) for the test of the primary hypothesis (Figure 1). Specifically, the coverage probability for ${\beta _{1}}$(between-subject effect) and ${\beta _{11}}$ (within-subject effect) was 0.9903 and 0.9274, respectively, and the power testing the primary hypothesis was 0.1238 compared with 0.0977 achieved with MMRM that used a correctly specified covariance pattern and the Kenward-Roger adjustment. Findings were similar when the model-based SEs were used with GEE. Although the bias-corrected SEs for ${\beta _{1}}$, ${\beta _{11}}$, and for the primary hypothesis test with GEE and MMRM were larger, relative to the SEs seen in the data, t-based inference yielded coverage probabilities ≥95% for both ${\beta _{1}}$ and ${\beta _{11}}$, and power that was commensurate with the power achieved with MMRM that used a correctly specified covariance pattern and the Kenward-Roger adjustment.
UX023-CL301
In the modestly large study (N = 61) with 2 post-baseline measurement occasions and little missing data, any differences between the methods were subtle but no less instructive (Figure 2). A CS covariance pattern was a close approximation to the true underlying covariance pattern that was used to simulate the data. The elements of the 2x2 covariance matrix used in generating the data were ${\sigma _{1}^{2}}=0.5992$, ${\sigma _{2}^{2}}=0.5795$, and ${\sigma _{12}}={\sigma _{12}}=0.4437$. Consequently, both methods with model-based SEs and t-based inference achieved 95% coverage for ${\beta _{1}}$ (between-subject effect). MMRM with the Kenward-Roger adjustment also achieved the correct coverage. The use of the sandwich variance estimator without and with the bias-correction resulted in SEs that were modestly smaller and larger, respectively, relative to the SEs of ${\beta _{1}}$ and in the test of the primary hypothesis that was seen in the data.
UX023-CL304
In the small, single arm study (N = 14) with 5 post-baseline measurement occasions and for which there were also few missing data points, the CS covariance pattern was markedly misspecified for all models studied. Using bias-corrected sandwich variance estimators with both methods and with t-based inference was needed, consistently achieving coverage probabilities closest to 95% (${\beta _{0}}$ [between-subject effect] and ${\beta _{4}}$[within-subject effect]: 0.9541 and 0.9417 with both MMRM and GEE using t-based inference). Additionally, the standard error and power for the test of the primary hypothesis was similar to that achieved by MMRM with correctly specified covariance and the Kenward-Roger adjustment. Other models that assumed a CS covariance pattern consistently performed considerably worse (Figure 3). Interestingly, for MMRM with correctly specified covariance, the Kenward-Roger adjustment to the standard error for the intercept, ${\beta _{0}}$, was consistent with the data, however, the Satterthwaite-type DOF approximation based on that adjusted standard error resulted in 92% coverage. For all other regression coefficients (${\beta _{1}},\hspace{2.5pt}{\beta _{2}},\hspace{2.5pt}{\beta _{3}},\hspace{2.5pt}\text{and}\hspace{2.5pt}{\beta _{4}}$), which were fixed effects that changed within subjects, the correct coverage was achieved (Appendix Table 2).
Figure 3
UX023-CL304 (N = 14 Single-Arm Study) – Simulation Results for models M0-M9: Model M0 represents the mixed-model repeated measures (MMRM) model with correctly specified unstructured (UN) covariance and use of the Kenward-Roger method, which combined a standard error adjustment with a Satterthwaite-type degree of freedom (DOF) approximation. M1-M3 represent an MMRM with compound symmetry (CS) covariance pattern that, respectively, used model-based, sandwich-based, and bias-corrected sandwich-based standard errors by Mancl and DeRouen (2001); t-based inference was used with the Between-Within method for determining DOF. M4-M9 represent a generalized estimating equations (GEE) model (marginal model with identify link) with inference based on asymptotic normality (M4-M6) or the t-distribution using the Between-Within DOF-approximation (M7-M9). Plots of the coverage of the 95% confidence intervals for the primary regression coefficients of interest (A) ${\beta _{0}}$, and (B) ${\beta _{4}}$ with the average standard errors of the respective regression coefficients listed at the right of each plot; the actual standard error estimated from the trial data for each regression coefficient is shown in parentheses for comparison. The red dashed line represents the targeted 95% confidence coefficient while the shaded pink region represents coverage between 94% to 96% as a frame of reference. (C) Plot of the average standard errors of the marginal mean change from baseline at week 48 with respect to the actual standard error estimated from the trial data (green dashed line); the shaded green region represents ± 5% of the actual standard error estimated from the trial data as a frame of reference. The power used to test the primary hypothesis, ${H_{0}}:\hspace{2.5pt}LS\hspace{2.5pt}\mathrm{Means}\hspace{2.5pt}\mathrm{at}\hspace{2.5pt}\mathrm{Week}\hspace{2.5pt}48=0$, is listed at the right of the plot.
3.1 Summary of Results
In each trial, <1.5% of the measurement occasions were missing outcome data and the MCAR assumption was reasonably satisfied. Box 1 succinctly summarizes the main findings. Notably, the empirical variances and pairwise covariances were not constant over time in UX001-CL301 and UX023-CL304; assuming a CS covariance pattern, therefore, resulted in considerable model misspecification of the covariances. Consequently, MMRM and GEE assuming a CS covariance pattern without the use of the sandwich variance estimator performed badly. Bias-corrected SEs with t-based inference was needed in the smallest trial, UX023-CL304, with both methods to consistently achieve at least 94% coverage probability for the regression coefficients; furthermore, the power testing the primary hypothesis was commensurate with the power achieved with MMRM that used a correctly specified covariance pattern and the Kenward-Roger standard error adjustment. The largest trial, UX001-CL301, also benefited from bias-corrected SEs with t-based inference that yielded coverage probabilities ≥95% and power identical to the power achieved when the covariance pattern was correctly specified. In the modestly large study, UX023-CL301, with only two post-baseline measurement occasions, adopting the sandwich variance estimator was not helpful. Given that the CS covariance pattern was a close approximation to the truth, the larger sampling variability associated with the sandwich variance estimator [17] resulted in coverage probabilities that were less than the nominal level; choosing model-based SEs with t-based inference with either method or MMRM with the Kenward-Roger adjustment achieved the correct coverage probability.
4 Discussion
MMRM, a response profile analysis, arguably dates back to Greenhouse and Geisser (1959) [38], and has been widely adopted as the primary analysis for analyzing repeatedly measured continuous outcomes in randomized clinical trials (RCTs). For an example of a recently completed phase 3 trial (N = 65; NCT03399786) in a rare disease that applied MMRM as the primary analysis to assess a treatment effect at a single time point, albeit without the sandwich variance estimator and small-sample correction, see Raal et al (2020) [39]. Since the two initial companion papers on GEE for longitudinal data [3, 40] coupled with the connections drawn between GEE and likelihood-based methods [41] – and the widely available software notwithstanding – GEE has infrequently been used as the primary analysis for continuous outcomes in RCTs. Based on a recent comprehensive evaluation of statistical methods applied as the primary analysis to repeatedly measured continuous outcomes from RCTs published in 2019, Ren et al (2022) [42] estimated 4% (4/96) adopted GEE to assess a treatment effect at a single time point compared with 36% (35/96) that used MMRM; the remaining 59% (57/96) oversimplified the longitudinal design and applied conventional methods such as t-tests and analysis of covariance to assess a treatment effect at a single time point. One rare disease trial (N = 53; NCT02208687) that adopted in primary analysis GEE with the sandwich variance estimator was designed to investigate the effectiveness at a single time point of a self-management group program to improve social participation and endurance in patients with neuromuscular disease with chronic fatigue [43]. Either method can be considered for modeling longitudinal changes in the mean response in trials conducted in rare diseases with a parsimonious CS covariance structure; however, care needs to be exercised with both methods to ensure its correct use.
Inferences about longitudinal changes in mean response and its relationship to treatment arm are sensitive to the chosen covariance model. This is true for both MMRM and GEE when modeling the mean with time included categorically, and despite the covariances being treated as nuisance parameters with GEE. In rare disease trials, it is common, however, to pragmatically prespecify a parsimonious CS covariance pattern due to the smaller sample sizes or to prespecify a decision to adopt a CS covariance pattern if an UN covariance pattern was inestimable. Based on our case studies, the variances and covariances of the longitudinal outcomes were not constant over time. When the CS covariance pattern does not closely approximate the true but unknown covariance pattern, incorrect SEs of the regression coefficients will result. As seen with our case studies, to obtain valid SEs, the sandwich variance estimator coupled with a small-sample correction to correct for bias, must be considered with both methods.
With GEE, the sandwich variance estimator – as opposed to model-based SEs – is the default setting in most statistical packages, however, the opposite is true when fitting MMRM. Consider, for example, SAS and R. In SAS, the user needs to specify the EMPIRICAL option in the PROC GLIMMIX (or MIXED) statement to apply the sandwich variance estimator when fitting MMRM, while the sandwich variance estimator is the default when using GENMOD to fit the corresponding GEE. When fitting MMRM in R using the gls function in the nlme package [44], the analyst would inconveniently appeal to the clubSandwich package [45] and the function vcovCR, which returns a sandwich estimate of the variance-covariance matrix of the regression coefficients. The recently developed mmrm package [13] currently does not have an option to apply sandwich variance estimators. To fit the corresponding GEE in R, the user can appeal to the $\mathit{gee}$ [46] or $\mathit{geepack}$ [47] packages, and in both packages the sandwich variance estimator is the default. Defaulting to the sandwich variance estimator with GEE is not surprising. The properties of the sandwich variance estimator has been well studied with GEE [19, 48–51]. If there is misspecification of the covariance, however, not using the sandwich variance estimator can also result in misleading inferences for the regression coefficients in MMRM as seen in our case studies.
Some small-sample corrections to the sandwich variance estimators are available in SAS and R to correct for bias along with some DOF approximations to the t distribution. Pustejovsky and Tipton (2017) [24] and Wang and coauthors (2016) [25] provide a comprehensive review of small-sample corrections available in R and guidance on which approach to use for the methods considered in this article. Gosho et al (2021) [22] provide a practical review of the small-sample corrections for MMRM that can be obtained using the GLIMMIX procedure directly or programmatically using IML.
Longitudinal responses will invariably be incomplete. Unlike MCAR, when the missing response data mechanism is MAR, subjects with missing response data will no longer be a random subset of the sample. It is well known that MMRM provides a valid framework under MAR, a more defensible assumption than MCAR in any given trial, albeit provided that the covariance pattern has been correctly specified [1]. If estimating an UN covariance pattern is prohibitive due to the small sample, resulting in misspecification, inferences about the mean response over time will be invalid when dropout is MAR. This point further underscores the importance of considering the sandwich variance estimator with MMRM (± a small-sample correction). When dropout is MAR as opposed to MCAR, the validity of the analysis with the standard GEE method based on available observations will be compromised [14]. Inverse probability weighted (IPW) [10] GEE should be adopted to adjust the analysis for the propensity to dropout; otherwise, GEE will yield biased estimates of the coefficients in the model part of the mean response. The sandwich variance estimator when using IPW methods can be constructed to adjust for estimation of the weights. Because most statistical procedures allow for the inclusion of sampling weights, it is straightforward to apply IPW-GEE. While unlikely in a well-controlled longitudinal clinical trial [1], if the longitudinal responses are missing not at random (MNAR), none of the methods considered in this article will be valid [11].
In our case studies, the MCAR assumption was clinically defensible. The number of measurement occasions including baseline was ≤7, few patients discontinued prematurely, no missing outcome data pattern occurred in >1 participant, and the degree of missing outcome data was <1.5%. While the statistical test of the MCAR assumption likely had low power [52], its use combined with the empirical evidence provided justification to use GEE. Due to the hierarchy of dropout mechanisms [53], MMRM was also justified.
There were research limitations. First, we only considered GLIMMIX to estimate MMRM and the SAS/IML GEE macro, which reproduces the method-of-moments approach to estimation with GENMOD. SAS is widely used software and the primary analysis in each trial used GENMOD. Second, there are numerous bias-corrections to sandwich variance estimators in the literature but only a few are currently available with GLIMMIX and none with GENMOD. The choice of which bias-correction to use would depend on several factors [19, 24, 25] and should be prespecified. In this article, we only considered the bias-corrected SEs recommended by Mancl and DeRouen (2001) [19]. Third, for t-based inference, we only considered the between-within method DOF approximation for small-sample inference recommended by Schluchter and Elashoff (1990) [36]. When empirical or bias-corrected SEs are used with GLIMMIX, DOF approximation methods are limited to between-within, containment, and residual, where the latter two are not small-sample approximations; furthermore, GENMOD only performs inference assuming asymptotic normality. The determination of which DOF approximation to use is an active area of research, with many methods not accessible in standard SAS and R routines. Lastly, the research was limited to three case studies and, in each, <1.5% of the measurement occasions were missing outcome data. Although these case studies represented typical longitudinal designs in rare diseases, it was not an exhaustive representation and the conclusions may not necessarily extend to other design variations. In particular, dropout other than MCAR was not considered in the simulation study. However, to understand whether the findings held in the presence of moderate missingness defined as 10% of the measurement occasions being MCAR, the simulation study was repeated. The simulation-based conclusions when using a CS covariance pattern remained; however, for the smallest trial (UX023-CL304; N = 14), estimation was unstable or not always possible when assuming an unstructured covariance pattern. In any case, the intent of the current research was not to pressure test these methods; rather to elucidate the issues and the steps to address the issues using real datasets in rare diseases to aid an analyst working in a rare disease in selecting an appropriate modeling strategy.
5 Conclusion
Based on a review of three case studies recently conducted in rare diseases, the MCAR assumption was plausible and the missingness low. When modeling the mean response with time as a categorical variable, MMRM with UN covariance coupled with the Kenward-Roger standard error adjustment should be used if the data allow it. If not, then a parsimonious CS covariance structure can be considered. In the two case studies that exhibited nonconstant variance/covariances over time, the sandwich variance estimator and small-sample correction with t-based inference using the between-within DOF should be considered with both MMRM and GEE. If the CS pattern was a good approximation, as seen in the other case study, then model-based standard errors with t-based inference using the between-within DOF performed well with both methods.
Declaration of Any Potential Conflicts of Interest
At the time the research was conducted, David Zahrieh, Yi Wang, and Tony Koutsoukos were employees at Ultragenyx Pharmaceutical Inc. This research reflects the views of the authors and should not be construed to represent Ultragenyx’s views or policies.
Data Availability Statement
Data sharing is not applicable to this article as no new data were created or analyzed in this research.