Compute estimates of and confidence intervals for nonparametric variable importance based on the difference predictiveness obtained with and without the feature of interest. Designed for use with time-to-event outcomes subject to right censoring that may be informed by measured covariates.
Usage
vim(
type,
time,
event,
X,
landmark_times = stats::quantile(time[event == 1], probs = c(0.25, 0.5, 0.75)),
restriction_time = max(time[event == 1]),
approx_times = NULL,
large_feature_vector,
small_feature_vector,
conditional_surv_generator = NULL,
conditional_surv_generator_control = NULL,
large_oracle_generator = NULL,
large_oracle_generator_control = NULL,
small_oracle_generator = NULL,
small_oracle_generator_control = NULL,
conditional_surv_preds = NULL,
large_oracle_preds = NULL,
small_oracle_preds = NULL,
cf_folds = NULL,
cf_fold_num = 5,
sample_split = TRUE,
ss_folds = NULL,
robust = TRUE,
scale_est = FALSE,
alpha = 0.05,
verbose = FALSE
)
Arguments
- type
Type of VIM to compute. Options include
"accuracy"
,"AUC"
,"Brier"
,"R-squared"
"C-index"
, and"survival_time_MSE"
.- time
n x 1
numeric vector of observed follow-up times. If there is censoring, these are the minimum of the event and censoring times.- event
n x 1
numeric vector of status indicators of whether an event was observed.- X
n x p
data.frame of observed covariate values- landmark_times
Numeric vector of length J1 giving landmark times at which to estimate VIM (
"accuracy"
,"AUC"
,"Brier"
,"R-squared"
).- restriction_time
Maximum follow-up time for calculation of
"C-index"
and"survival_time_MSE"
. Essentially, this time should be chosen such that the conditional survival function is identified at this time for all covariate valuesX
present in the data. Choosing the restriction time such that roughly 10% of individuals remain at-risk at that time has been shown to work reasonably well in simulations.- approx_times
Numeric vector of length J2 giving times at which to approximate integrals. Defaults to a grid of 100 timepoints, evenly spaced on the quantile scale of the distribution of observed event times.
- large_feature_vector
Numeric vector giving indices of features to include in the 'large' prediction model.
- small_feature_vector
Numeric vector giving indices of features to include in the 'small' prediction model. Must be a subset of
large_feature_vector
.- conditional_surv_generator
A function to estimate the conditional survival functions of the event and censoring variables. Must take arguments (
time
,event
,X
) (for training purposes) and (X_holdout
andnewtimes
) (covariate values and times at which to generate predictions). Defaults to generate_nuisance_predictions_stackG, a pre-built generator function based on the stackG function. Alternatively, the user can provide their own function for this argument, or provide pre-computed estimates toconditional_surv_preds
in lieu of this argument.- conditional_surv_generator_control
A list of arguments to pass to
conditional_surv_generator
.- large_oracle_generator
A function to estimate the oracle prediction function using
large_feature_vector
. Must take argumentstime
,event
,X
,X_holdout
, andnuisance_preds
. For all VIM types except for"C-index"
, defaults to generate_oracle_predictions_DR, a pre-built generator function using doubly-robust pseudo-outcome regression. For"C-index"
, defaults to generate_oracle_predictions_boost, a pre-built generator function using doubly-robust gradient boosting. Alternatively, the user can provide their own function, or provide pre-computed estimates tolarge_oracle_preds
in lieu of this argument.- large_oracle_generator_control
A list of arguments to pass to
large_oracle_generator
.- small_oracle_generator
A function to estimate the oracle prediction function using
small_feature_vector
. Must take argumentstime
,event
,X
,X_holdout
, andnuisance_preds
. For all VIM types except for"C-index"
, defaults to generate_oracle_predictions_SL, a pre-built generator function based on regression the large oracle predictions on the small feature vector. For"C-index"
, defaults to generate_oracle_predictions_boost, a pre-built generator function using doubly-robust gradient boosting. Alternatively, the user can provide their own function, or provide pre-computed estimates tosmall_oracle_preds
in lieu of this argument.- small_oracle_generator_control
A list of arguments to pass to
small_oracle_generator
.- conditional_surv_preds
User-provided estimates of the conditional survival functions of the event and censoring variables given the full covariate vector (if not using the
conditional_surv_generator
functionality to compute these nuisance estimates). Must be a named list of lists with elementsS_hat
,S_hat_train
,G_hat
, andG_hat_train
. If using sample splitting, each of these is itself a list of length2K
, whereK
is the number of cross-fitting folds (if not using sample splitting, each is a list of lengthK
). Each element of these lists is a matrix with J2 columns and number of rows equal to either the number of samples in thek
th fold (forS_hat
andG_hat
) or the number of samples used to compute the nuisance estimates for thek
th fold (forS_hat_train
andG_hat_train
).- large_oracle_preds
User-provided estimates of the oracle prediction function using
large_feature_vector
(if not using thelarge_oracle_generator
functionality to compute these nuisance estimates). Must be a named list of lists with elementsf0_hat
andf0_hat_train
. If using sample splitting, each of these is itself a list of length2K
(if not using sample splitting, each is a list of lengthK
). Each element of these lists is a matrix with J1 columns (for landmark time VIMs) or 1 column (for"C-index"
and"survival_time_MSE"
) and number of rows equal to either the number of samples in thek
th fold (forf0_hat
) or the number of samples used to compute the nuisance estimates for thek
th fold (forf0_hat_train
).- small_oracle_preds
User-provided estimates of the oracle prediction function using
small_feature_vector
(if not using thesmall_oracle_generator
functionality to compute these nuisance estimates). Must be a named list of lists with elementsf0_hat
andf0_hat_train
. If using sample splitting, each of these is itself a list of length2K
(if not using sample splitting, each is a list of lengthK
). Each element of these lists is a matrix with J1 columns (for landmark time VIMs) or 1 column (for"C-index"
and"survival_time_MSE"
) and number of rows equal to either the number of samples in thek
th fold (forf0_hat
) or the number of samples used to compute the nuisance estimates for thek
th fold (forf0_hat_train
).- cf_folds
Numeric vector of length
n
giving cross-fitting folds, if specifying the folds explicitly. This is required if you are providing pre-computed nuisance estimations — if providing a nuisance generator function, thevim()
will assign folds.- cf_fold_num
The number of cross-fitting folds, if not providing
cf_folds
. Note that with samples-splitting, the data will be split into2 x cf_fold_num
folds (i.e., there will becf_fold_num
folds within each half of the data).- sample_split
Logical indicating whether or not to sample split. Sample-splitting is required for valid hypothesis testing of null importance and is generally recommended. Defaults to
TRUE
.- ss_folds
Numeric vector of length
n
giving sample-splitting folds, if specifying the folds explicitly. This is required if you are providing pre-computed nuisance estimations — if providing a nuisance generator function, thevim()
will assign folds.- robust
Logical, whether or not to use the doubly-robust debiasing approach. This option is meant for illustration purposes only — it should be left as
TRUE
.- scale_est
Logical, whether or not to force the VIM estimate to be nonnegative.
- alpha
The level at which to compute confidence intervals and hypothesis tests. Defaults to 0.05.
- verbose
Whether to print progress messages.
Value
Named list with the following elements:
- result
Data frame giving results. See the documentation of the individual
vim_*
functions for details.- folds
A named list giving the cross-fitting fold IDs (
cf_folds
) and sample-splitting fold IDs (ss_folds
).- approx_times
A vector of times used to approximate integrals appearing in the form of the VIM estimator.
- conditional_surv_preds
A named list containing the estimated conditional event and censoring survival functions.
- large_oracle_preds
A named list containing the estimated large oracle prediction function.
- small_oracle_preds
A named list containing the estimated small oracle prediction function.
Details
For nuisance estimation, it is generally advisable to use the pre-built nuisance generator functions provided by survML
. See the ”Variable importance in survival analysis” vignette, or the package website for an illustration.
References
Wolock C.J., Gilbert P.B., Simon N., and Carone, M. (2025). "Assessing variable importance in survival analysis using machine learning."
Examples
# This is a small simulation example
set.seed(123)
n <- 100
X <- data.frame(X1 = rnorm(n), X2 = rbinom(n, size = 1, prob = 0.5))
T <- rexp(n, rate = exp(-2 + X[,1] - X[,2] + .5 * X[,1] * X[,2]))
C <- rexp(n, exp(-2 -.5 * X[,1] - .25 * X[,2] + .5 * X[,1] * X[,2]))
C[C > 15] <- 15
time <- pmin(T, C)
event <- as.numeric(T <= C)
# landmark times for AUC
landmark_times <- c(3)
output <- vim(type = "AUC",
time = time,
event = event,
X = X,
landmark_times = landmark_times,
large_feature_vector = 1:2,
small_feature_vector = 2,
conditional_surv_generator_control = list(SL.library = c("SL.mean", "SL.glm")),
large_oracle_generator_control = list(SL.library = c("SL.mean", "SL.glm")),
small_oracle_generator_control = list(SL.library = c("SL.mean", "SL.glm")),
cf_fold_num = 2,
sample_split = FALSE,
scale_est = TRUE)
print(output$result)
#> landmark_time est var_est cil ciu cil_1sided p
#> 1 3 0.2823303 1.407984 0.04976388 0.5148967 0.08715441 NA
#> large_predictiveness small_predictiveness vim large_feature_vector
#> 1 0.8209323 0.538602 AUC 1,2
#> small_feature_vector
#> 1 2