Rationale
What’s the use of writing new <var_id> objects,
rather than relying on the existing <formula>
(~) interface? Why need to write another? Not all types of
inference in R can be fully described by <formula>
objects. In fact, ggplot2 doesn’t fully manifest
<formula> objects to describe the shape of the model
you want to visualize within ggplot2 pipelines. The
existing <var_id> objects from statim
shape the model to be analyzed in the statistical inference pipelines,
which act like mappers, just like ggplot2::aes().
The existing <formula> objects in R is often used
to describe the relationship between variables. Depending on the
implementation, if you have y ~ x, this tells you to
describe the relationship of x to y. In case
you didn’t know yet, ~ is also a function that captures the
code to retrieve and parse its abstract syntax tree (AST). That’s not
the whole reality, in fact, you can use <formula> but
interpreted differently.
<var_id> objects share this same lazy-capture
nature — bare variable names like extra or
group aren’t evaluated at the moment
x_by(extra, group) is called, just as ~
doesn’t evaluate extra or group in
extra ~ group. The difference is in how that captured
expression is resolved: a <formula> is
resolved on demand by whatever consumes it (stats::terms(),
model.frame(), and so on), while a
<var_id> is always resolved the same way, through
model_processor() — a single generic dispatched by
define_model() that turns the captured expressions into a
list of data structures (usually data frames), ready for the rest of the
statim pipeline.
What are Variable Mappers
<var_id> objects, or the variable mappers, are
built on top of S7, and serve as “mappers”, similar to how
ggplot2::aes() (aesthetic mappings) works. Where
aes() maps variables to plot aesthetics (x,
y, colour, …), a <var_id>
maps variables to the roles a statistical model needs (e.g. a response
and a grouping variable for x_by(), or a predictor and a
response for rel()). define_model() then takes
that mapping, resolves it via model_processor(), and
bundles the result into a def_model object that the rest of
the inference pipeline operates on.
Existing objects
statim has built-in <var_id>
objects you can use to describe the shape of the model you want to
analyze during statistical inference.
-
x_by(x, group): comparexbygroup, e.g.x_by(extra, group). -
rel(x, resp): describe the relationship ofxtoresp, e.g.rel(speed, dist). -
pairwise(...): generate all unique pairwise combinations of a set of variables. -
prop(x, n): describe a proportion test from a countxout ofntrials. -
on(...): independently tests the selected variables.
Each of these is a var_id subclass, and each is paired
with a model_processor() method that resolves the variables
into a list of data structures (usually data frames), ready to be picked
up by define_model().
Writing new Variable Mappers
Writing a new <var_id> object is easy but strict —
it always requires model_processor() to be dispatched,
because you need to retrieve the data and store it in a data structure —
by default, a list. While there’s a var_id_info() generic,
it is not always required. The var_id is an abstract S7
class used as a parent class for <var_id> objects.
The following namespaces are required: S7::new_class(),
S7::method(), statim::var_id, and
statim::model_processor(). For example, you want to
describe another model that describes 3 variables at once during the
analysis, naming this <var_id> object
xyz.
Two simple steps are required:
-
To write a new
<var_id>mapper, useS7::new_class()constructor, and putstatim::var_idunderparentargument. Usingconstructorargument is optional unless required.xyz = S7::new_class( "xyz", parent = statim::var_id, properties = list( x = S7::class_numeric, y = S7::class_numeric, z = S7::class_numeric ) ) Extract the info you need by dispatching
model_processor()with the newly createdxyz<var_id>object. Two arguments needed to place within the dispatched functions:x,data(which is optional), and the unused ellipsis....
``` r
S7::method(model_processor, xyz) = function(x, ...) {
list(
x = x@x,
y = x@y,
z = x@z
)
}
#> Warning: model_processor(<xyz>) doesn't have argument `data`
```
At this point, xyz is a fully functional
<var_id>. It can be passed straight into
define_model():
``` r
m = xyz(x = 1, y = 2, z = 3)
def = define_model(m)
def@processed
#> $x
#> [1] 1
#>
#> $y
#> [1] 2
#>
#> $z
#> [1] 3
```
A bit more complicated: capturing unevaluated expressions
The example above is simple — x, y, and
z are plain numeric properties, evaluated eagerly at
construction, so xyz(1, 2, 3) works but
xyz(extra, group, ID) would fail (extra isn’t
a value in globalenv()).
The built-in <var_id> objects
(x_by(), rel(), pairwise())
instead capture unevaluated expressions with
rlang::enquo(), so that bare variable names can be resolved
later against a data frame supplied to
define_model(), or against the calling environment if
data is omitted.
If your <var_id> needs this “lazy” behavior —
e.g. accepting bare column names like x_by(extra, group)
does — capture each argument with rlang::enquo() in a
custom constructor, and store the resulting quosure as the
property value:
xyz = S7::new_class(
"xyz",
parent = statim::var_id,
properties = list(
x = S7::class_any,
y = S7::class_any,
z = S7::class_any
),
constructor = function(x, y, z) {
S7::new_object(
S7::S7_object(),
x = rlang::enquo(x),
y = rlang::enquo(y),
z = rlang::enquo(z)
)
}
)Then, in model_processor(), resolve each quosure against
data (or the calling environment when data is
NULL). The internal helper resolve_quo() does
exactly this for the built-in <var_id> objects,
handling bare names, c() selections, I()
inline expressions, and inlines(). But for a first custom
<var_id>, a simpler rlang::eval_tidy()
is enough:
S7::method(model_processor, xyz) = function(x, data = NULL, ...) {
resolve = function(quo) {
if (is.null(data)) {
rlang::eval_tidy(quo)
} else {
data[[rlang::as_label(rlang::quo_get_expr(quo))]]
}
}
list(
x = resolve(x@x),
y = resolve(x@y),
z = resolve(x@z)
)
}
#> Overwriting method model_processor(<xyz>)To register a friendlier summary, with custom args,
extra metadata in other_info, and variable previews in
vars, write a method for your class. Return a
<class_var_inform> object via
class_var_inform(), and set
registered = TRUE:
S7::method(var_id_info, xyz) = function(.var_id, processed = NULL, ...) {
other_info = list()
vars = list()
if (!is.null(processed) && length(processed)) {
other_info = list(n_vars = length(processed))
vars = lapply(names(processed), function(nm) {
val = processed[[nm]]
list(
name = nm,
preview = paste0("<", pillar::type_sum(val), " [", length(val), "]>")
)
})
}
class_var_inform(
var_id = .var_id,
args = paste0(
"x = ", rlang::as_label(.var_id@x),
", y = ", rlang::as_label(.var_id@y),
", z = ", rlang::as_label(.var_id@z)
),
other_info = other_info,
vars = vars,
registered = TRUE
)
}Now define_model() prints a fuller summary:
define_model(xyz(extra, group, ID), sleep)
#>
#> -- Model Definition ------------------------------------------------------------
#>
#> Variable Mapper : xyz
#> Args : x = extra, y = group, z = ID
#> Other info:
#> n_vars : 3
#> Variables :
#> x : <dbl [20]>
#> y : <fct [20]>
#> z : <fct [20]>Summary
To write a new <var_id> object:
- Define an S7 class with
S7::new_class(), withparent = statim::var_id. - Register a
model_processor()method for your class. This is the only required step, and it’s whatdefine_model()dispatches on to populateprocessed. - If the
<var_id>should capture bare variable names lazily (recommended for anything that will be paired with adataframe), use a customconstructorthat wraps each argument withrlang::enquo(), and resolve the quosures insidemodel_processor(). - Optionally, register a
var_id_info()method for a friendlierprint()summary, returning aclass_var_informobject withregistered = TRUE.
With these in place, your new <var_id> works
seamlessly with define_model() and the rest of the
statim pipeline — exactly like x_by(),
rel(), pairwise(), and
prop().