Skip to contents

Compute estimates of and confidence intervals for nonparametric variable importance based on the difference predictiveness obtained with and without the feature of interest. Designed for use with time-to-event outcomes subject to right censoring that may be informed by measured covariates.

Usage

vim(
  type,
  time,
  event,
  X,
  landmark_times = stats::quantile(time[event == 1], probs = c(0.25, 0.5, 0.75)),
  restriction_time = max(time[event == 1]),
  approx_times = NULL,
  large_feature_vector,
  small_feature_vector,
  conditional_surv_generator = NULL,
  conditional_surv_generator_control = NULL,
  large_oracle_generator = NULL,
  large_oracle_generator_control = NULL,
  small_oracle_generator = NULL,
  small_oracle_generator_control = NULL,
  conditional_surv_preds = NULL,
  large_oracle_preds = NULL,
  small_oracle_preds = NULL,
  cf_folds = NULL,
  cf_fold_num = 5,
  sample_split = TRUE,
  ss_folds = NULL,
  robust = TRUE,
  scale_est = FALSE,
  alpha = 0.05,
  verbose = FALSE
)

Arguments

type

Type of VIM to compute. Options include "accuracy", "AUC", "Brier", "R-squared" "C-index", and "survival_time_MSE".

time

n x 1 numeric vector of observed follow-up times. If there is censoring, these are the minimum of the event and censoring times.

event

n x 1 numeric vector of status indicators of whether an event was observed.

X

n x p data.frame of observed covariate values

landmark_times

Numeric vector of length J1 giving landmark times at which to estimate VIM ("accuracy", "AUC", "Brier", "R-squared").

restriction_time

Maximum follow-up time for calculation of "C-index" and "survival_time_MSE". Essentially, this time should be chosen such that the conditional survival function is identified at this time for all covariate values X present in the data. Choosing the restriction time such that roughly 10% of individuals remain at-risk at that time has been shown to work reasonably well in simulations.

approx_times

Numeric vector of length J2 giving times at which to approximate integrals. Defaults to a grid of 100 timepoints, evenly spaced on the quantile scale of the distribution of observed event times.

large_feature_vector

Numeric vector giving indices of features to include in the 'large' prediction model.

small_feature_vector

Numeric vector giving indices of features to include in the 'small' prediction model. Must be a subset of large_feature_vector.

conditional_surv_generator

A function to estimate the conditional survival functions of the event and censoring variables. Must take arguments (time, event, X) (for training purposes) and (X_holdout and newtimes) (covariate values and times at which to generate predictions). Defaults to generate_nuisance_predictions_stackG, a pre-built generator function based on the stackG function. Alternatively, the user can provide their own function for this argument, or provide pre-computed estimates to conditional_surv_preds in lieu of this argument.

conditional_surv_generator_control

A list of arguments to pass to conditional_surv_generator.

large_oracle_generator

A function to estimate the oracle prediction function using large_feature_vector. Must take arguments time, event, X, X_holdout, and nuisance_preds. For all VIM types except for "C-index", defaults to generate_oracle_predictions_DR, a pre-built generator function using doubly-robust pseudo-outcome regression. For "C-index", defaults to generate_oracle_predictions_boost, a pre-built generator function using doubly-robust gradient boosting. Alternatively, the user can provide their own function, or provide pre-computed estimates to large_oracle_preds in lieu of this argument.

large_oracle_generator_control

A list of arguments to pass to large_oracle_generator.

small_oracle_generator

A function to estimate the oracle prediction function using small_feature_vector. Must take arguments time, event, X, X_holdout, and nuisance_preds. For all VIM types except for "C-index", defaults to generate_oracle_predictions_SL, a pre-built generator function based on regression the large oracle predictions on the small feature vector. For "C-index", defaults to generate_oracle_predictions_boost, a pre-built generator function using doubly-robust gradient boosting. Alternatively, the user can provide their own function, or provide pre-computed estimates to small_oracle_preds in lieu of this argument.

small_oracle_generator_control

A list of arguments to pass to small_oracle_generator.

conditional_surv_preds

User-provided estimates of the conditional survival functions of the event and censoring variables given the full covariate vector (if not using the conditional_surv_generator functionality to compute these nuisance estimates). Must be a named list of lists with elements S_hat, S_hat_train, G_hat, and G_hat_train. If using sample splitting, each of these is itself a list of length 2K, where K is the number of cross-fitting folds (if not using sample splitting, each is a list of length K). Each element of these lists is a matrix with J2 columns and number of rows equal to either the number of samples in the kth fold (for S_hat and G_hat) or the number of samples used to compute the nuisance estimates for the kth fold (for S_hat_train and G_hat_train).

large_oracle_preds

User-provided estimates of the oracle prediction function using large_feature_vector (if not using the large_oracle_generator functionality to compute these nuisance estimates). Must be a named list of lists with elements f0_hat and f0_hat_train. If using sample splitting, each of these is itself a list of length 2K (if not using sample splitting, each is a list of length K). Each element of these lists is a matrix with J1 columns (for landmark time VIMs) or 1 column (for "C-index" and "survival_time_MSE") and number of rows equal to either the number of samples in the kth fold (for f0_hat) or the number of samples used to compute the nuisance estimates for the kth fold (for f0_hat_train).

small_oracle_preds

User-provided estimates of the oracle prediction function using small_feature_vector (if not using the small_oracle_generator functionality to compute these nuisance estimates). Must be a named list of lists with elements f0_hat and f0_hat_train. If using sample splitting, each of these is itself a list of length 2K (if not using sample splitting, each is a list of length K). Each element of these lists is a matrix with J1 columns (for landmark time VIMs) or 1 column (for "C-index" and "survival_time_MSE") and number of rows equal to either the number of samples in the kth fold (for f0_hat) or the number of samples used to compute the nuisance estimates for the kth fold (for f0_hat_train).

cf_folds

Numeric vector of length n giving cross-fitting folds, if specifying the folds explicitly. This is required if you are providing pre-computed nuisance estimations — if providing a nuisance generator function, the vim() will assign folds.

cf_fold_num

The number of cross-fitting folds, if not providing cf_folds. Note that with samples-splitting, the data will be split into 2 x cf_fold_num folds (i.e., there will be cf_fold_num folds within each half of the data).

sample_split

Logical indicating whether or not to sample split. Sample-splitting is required for valid hypothesis testing of null importance and is generally recommended. Defaults to TRUE.

ss_folds

Numeric vector of length n giving sample-splitting folds, if specifying the folds explicitly. This is required if you are providing pre-computed nuisance estimations — if providing a nuisance generator function, the vim() will assign folds.

robust

Logical, whether or not to use the doubly-robust debiasing approach. This option is meant for illustration purposes only — it should be left as TRUE.

scale_est

Logical, whether or not to force the VIM estimate to be nonnegative.

alpha

The level at which to compute confidence intervals and hypothesis tests. Defaults to 0.05.

verbose

Whether to print progress messages.

Value

Named list with the following elements:

result

Data frame giving results. See the documentation of the individual vim_* functions for details.

folds

A named list giving the cross-fitting fold IDs (cf_folds) and sample-splitting fold IDs (ss_folds).

approx_times

A vector of times used to approximate integrals appearing in the form of the VIM estimator.

conditional_surv_preds

A named list containing the estimated conditional event and censoring survival functions.

large_oracle_preds

A named list containing the estimated large oracle prediction function.

small_oracle_preds

A named list containing the estimated small oracle prediction function.

Details

For nuisance estimation, it is generally advisable to use the pre-built nuisance generator functions provided by survML. See the ”Variable importance in survival analysis” vignette, or the package website for an illustration.

References

Wolock C.J., Gilbert P.B., Simon N., and Carone, M. (2025). "Assessing variable importance in survival analysis using machine learning."

Examples

# This is a small simulation example
set.seed(123)
n <- 100
X <- data.frame(X1 = rnorm(n), X2 = rbinom(n, size = 1, prob = 0.5))

T <- rexp(n, rate = exp(-2 + X[,1] - X[,2] + .5 *  X[,1] * X[,2]))

C <- rexp(n, exp(-2 -.5 * X[,1] - .25 * X[,2] + .5 * X[,1] * X[,2]))
C[C > 15] <- 15

time <- pmin(T, C)
event <- as.numeric(T <= C)

# landmark times for AUC
landmark_times <- c(3)

output <- vim(type = "AUC",
              time = time,
              event = event,
              X = X,
              landmark_times = landmark_times,
              large_feature_vector = 1:2,
              small_feature_vector = 2,
              conditional_surv_generator_control = list(SL.library = c("SL.mean", "SL.glm")),
              large_oracle_generator_control = list(SL.library = c("SL.mean", "SL.glm")),
              small_oracle_generator_control = list(SL.library = c("SL.mean", "SL.glm")),
              cf_fold_num = 2,
              sample_split = FALSE,
              scale_est = TRUE)

print(output$result)
#>   landmark_time       est  var_est        cil       ciu cil_1sided  p
#> 1             3 0.2823303 1.407984 0.04976388 0.5148967 0.08715441 NA
#>   large_predictiveness small_predictiveness vim large_feature_vector
#> 1            0.8209323             0.538602 AUC                  1,2
#>   small_feature_vector
#> 1                    2