ANOVA for Linear models with {statim}

What inference am I declaring?

When you fit a linear model with statim, the model is a class_lm_object stored inside a cld_exec. Because class_lm_object inherits from anova_able, every fitted model knows how to participate in an ANOVA comparison: it carries the residual degrees of freedom, the residual sum of squares, the model matrix, and the error family. anova() reads those slots directly, so no separate conversion step is needed.

statim supports two distinct ANOVA questions:

Single model (Type I): how much does each predictor reduce residual variance, in the order the terms were added? One row per term, plus a Residuals row.
Multiple models (incremental F or LRT): is the extra complexity of a larger model justified over a simpler nested model? One row per model.

Both questions are answered by the same anova() generic, dispatched on whatever object you hand it.

Single model: Type I ANOVA

The code below follows the standard declarative grammar. Here the cars dataset is used, which ships with base R: speed predicts stopping dist.

Define and fit

cars_lr = cars |>
    define_model(dist ~ speed) |>
    prepare_model(LINEAR_REG) |>
    conclude()

define_model() binds the formula to the data and processes it. prepare_model(LINEAR_REG) attaches the linear regression specification lazily; nothing is computed yet. conclude() resolves that specification and returns a cld_exec holding a class_lm_object.

Run the ANOVA

anova(cars_lr)


== ANOVA · Type I ============================================================== 

-- ANOVA Table -----------------------------------------------------------------

─────────────────────────────────────────────────────────
    term     df     ss         ms      f_value  p_value  
─────────────────────────────────────────────────────────
    speed    1   21185.459  21185.459  89.567   <0.001   
  Residuals  48  11353.521   236.532                     
─────────────────────────────────────────────────────────

The returned cld_anova prints the full Type I table. The final row is always Residuals, with f_value and p_value set to NA. A small p-value for speed says that adding speed to the intercept-only model significantly reduced the residual variance.

Tidy the coefficients

tidy() takes the cld_exec returned by conclude() and extracts a tibble of primary results. For a LINEAR_REG model, that tibble is the coefficient table: one row per term, with columns term, estimate, std_error, statistic, and p_value.

cars_lr |> tidy()

# A tibble: 2 × 5
  term        estimate std_error statistic  p_value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -17.6      6.76      -2.60 1.23e- 2
2 speed           3.93     0.416      9.46 1.49e-12

This is the programmatic form of the Coefficients section that print() displays. Using tidy() is useful when you need the estimates downstream, for example to filter significant terms, pass them to another function, or bind rows across multiple models.

From a `model_lazy` directly

anova() also dispatches on model_lazy, so conclude() is optional:

cars |>
    define_model(dist ~ speed) |>
    prepare_model(LINEAR_REG) |>
    anova()


== ANOVA · Type I ============================================================== 

-- ANOVA Table -----------------------------------------------------------------

─────────────────────────────────────────────────────────
    term     df     ss         ms      f_value  p_value  
─────────────────────────────────────────────────────────
    speed    1   21185.459  21185.459  89.567   <0.001   
  Residuals  48  11353.521   236.532                     
─────────────────────────────────────────────────────────

The out is identical as above.

Multiple models: incremental F-test

Comparing a sequence of nested models tells you whether each additional predictor improves fit beyond what the simpler model already explains. This is the main use case for write_models().

Using `write_models()`

write_models() accepts named model expressions and evaluates them sequentially against the same data frame. Names appear as row labels in the ANOVA output:

LifeCycleSavings |>
    write_models(
        null = sr ~ 1,
        main = sr ~ pop15,
        extend = sr ~ pop15 + pop75,
        full = sr ~ pop15 + pop75 + dpi + ddpi
    ) |>
    prepare_model(LINEAR_REG) |>
    anova()


== ANOVA · F =================================================================== 

-- ANOVA Table -----------------------------------------------------------------

────────────────────────────────────────────────────────────
  model   res_df  deviance  df  dev_diff  f_value  p_value  
────────────────────────────────────────────────────────────
   null     49    983.628                                   
   main     48    779.511   1   204.118   14.116   <0.001   
  extend    47    726.168   1    53.343    3.689    0.061   
   full     45    650.713   2    75.455    2.609    0.085   
────────────────────────────────────────────────────────────

The first row always has NA test statistics because there is no simpler model to compare it against. Each subsequent row answers: “given the model above, does adding these extra predictors significantly help?”

The test argument accepts "F" (default for Gaussian models), "LRT", or "Chisq". For non-Gaussian families such as binomial GLMs, "F" is automatically switched to "LRT" with an informational message.

Passing `model_lazy` objects directly

When the models come from separate pipeline branches, pass them directly to anova():

mod1 = LifeCycleSavings |> define_model(sr ~ 1) |> prepare_model(LINEAR_REG)
mod2 = LifeCycleSavings |> define_model(sr ~ pop15) |> prepare_model(LINEAR_REG)
mod3 = LifeCycleSavings |> define_model(sr ~ pop15 + pop75) |> prepare_model(LINEAR_REG)

anova(mod1, mod2, mod3)


== ANOVA · F =================================================================== 

-- ANOVA Table -----------------------------------------------------------------

───────────────────────────────────────────────────────────
  model  res_df  deviance  df  dev_diff  f_value  p_value  
───────────────────────────────────────────────────────────
    1      49    983.628                                   
    2      48    779.511   1   204.118   13.211   <0.001   
    3      47    726.168   1    53.343    3.453    0.069   
───────────────────────────────────────────────────────────

Passing `cld_exec` objects

Or, after calling conclude() on each model:

exec1 = LifeCycleSavings |> define_model(sr ~ 1) |> prepare_model(LINEAR_REG) |> conclude()
exec2 = LifeCycleSavings |> define_model(sr ~ pop15) |> prepare_model(LINEAR_REG) |> conclude()

anova(exec1, exec2, test = "LRT")


== ANOVA · LRT ================================================================= 

-- ANOVA Table -----------------------------------------------------------------

───────────────────────────────────────────────────────────────
  model  res_df  deviance  df  dev_diff  chisq_value  p_value  
───────────────────────────────────────────────────────────────
    1      49    983.628                                       
    2      48    779.511   1   204.118     12.569     <0.001   
───────────────────────────────────────────────────────────────

Multiple models: `conclude()` and `tidy()`

When the goal is to fit all models and inspect each one, conclude() on a multi_lazy returns a multi_exec collecting all results:

all_fits = LifeCycleSavings |>
    write_models(
        null = sr ~ 1,
        main = sr ~ pop15 + pop75
    ) |>
    prepare_model(LINEAR_REG) |>
    conclude()

all_fits


── 2 models · Linear Regression ──────────────────────────────────────────────── 

null : <cld_exec>
main : <cld_exec>

Use display() to inspect individual results.

tidy() on a multi_exec returns a nested tibble with one row per model:

tidy(all_fits)

# A tibble: 2 × 2
  model outs            
  <chr> <named list>    
1 null  <tibble [1 × 5]>
2 main  <tibble [3 × 5]>

Use display() to inspect individual model outputs when there are many:

LifeCycleSavings |>
    write_models(
        f1 = sr ~ 1,
        f2 = sr ~ pop15,
        f3 = sr ~ pop15 + pop75
    ) |>
    prepare_model(LINEAR_REG) |>
    conclude() |>
    display(2)


1. f1

== Model ======================================================================= 

Variable Mapper : formula 
Args : sr ~ 1 
    left_var : 1 
    right_var : 0 

== Linear Regression =========================================================== 

-- Coefficients ----------------------------------------------------------------

──────────────┬───────────────────────────────────────────
  term        │  estimate  std_error  statistic  p_value  
──────────────┼───────────────────────────────────────────
  (Intercept) │   9.671      0.634     15.263    <0.001   
──────────────┴───────────────────────────────────────────


-- Model Fit -------------------------------------------------------------------

Warning in system("tput cols", intern = TRUE): running command 'tput cols' had
status 2

-----------------------------------------------------
  R Squared      :    0.00    F-statistic :     NaN
  Adj. R Squared :    0.00    df1         :       0
  Sigma          :    4.48    df2         :      49
  n              :      50    p-value     :     NaN
  df (residual)  :      49                :        
-----------------------------------------------------



2. f2

== Model ======================================================================= 

Variable Mapper : formula 
Args : sr ~ pop15 
    left_var : 1 
    right_var : 1 

== Linear Regression =========================================================== 

-- Coefficients ----------------------------------------------------------------

──────────────┬───────────────────────────────────────────
  term        │  estimate  std_error  statistic  p_value  
──────────────┼───────────────────────────────────────────
  (Intercept) │   17.497     2.280      7.675    <0.001   
  pop15       │   -0.223     0.063     -3.545    <0.001   
──────────────┴───────────────────────────────────────────


-- Model Fit -------------------------------------------------------------------

Warning in system("tput cols", intern = TRUE): running command 'tput cols' had
status 2

------------------------------------------------------
  R Squared      :    0.21    F-statistic :    12.57
  Adj. R Squared :    0.19    df1         :        1
  Sigma          :    4.03    df2         :       48
  n              :      50    p-value     :   <0.001
  df (residual)  :      48                :         
------------------------------------------------------

GLMs: likelihood ratio test

For non-Gaussian families, anova() performs a likelihood ratio test (LRT) and reports a chi-squared statistic instead of an F-statistic. Pass test = "LRT" explicitly; "F" is not meaningful for GLMs.

Supply the family directly to prepare_model(). It is forwarded to every model in the pipeline:

mtcars |>
    write_models(
        null = am ~ 1,
        main = am ~ wt,
        full = am ~ wt + hp
    ) |>
    prepare_model(GLM, family = binomial()) |>
    anova(test = "LRT")


== ANOVA · LRT ================================================================= 

-- ANOVA Table -----------------------------------------------------------------

───────────────────────────────────────────────────────────────
  model  res_df  deviance  df  dev_diff  chisq_value  p_value  
───────────────────────────────────────────────────────────────
  null     31     7.719                                        
  main     30     4.017    1    3.702      31.584     <0.001   
  full     29     3.399    1    0.619       5.278      0.022   
───────────────────────────────────────────────────────────────

The output table has a chisq_value column in place of f_value. The first row has NA test statistics because there is no baseline to compare it against.

Definitive guide