Estimate variable importance with multiple seeds

Repeat the VIM estimation procedure multiple times and aggregate the results, mitigating the additional randomness introduced by sample-splitting and cross-fitting.

Usage

multiseed_vim(
  n_seed,
  agg_method = "compound_bg",
  ci_grid,
  type,
  time,
  event,
  X,
  landmark_times = stats::quantile(time[event == 1], probs = c(0.25, 0.5, 0.75)),
  restriction_time = max(time[event == 1]),
  approx_times = NULL,
  large_feature_vector,
  small_feature_vector,
  conditional_surv_generator = NULL,
  conditional_surv_generator_control = NULL,
  large_oracle_generator = NULL,
  large_oracle_generator_control = NULL,
  small_oracle_generator = NULL,
  small_oracle_generator_control = NULL,
  cf_fold_num = 5,
  sample_split = TRUE,
  scale_est = FALSE,
  alpha = 0.05,
  verbose = FALSE
)

Arguments

n_seed: Number of iterations (seeds) to perform the VIM estimation procedure. These will be aggregated into a single result.
agg_method: P-value aggregation method use to combine results from different seeds. Current options are "bonferroni" (Bonferroni's method), "hommel" (Hommel's method), "arithmetic" (arithmetic mean), "geometric" (geometric mean), "harmonic" (harmonic mean), "compound_bg" (compound Bonferroni and geometric mean), and "compound_ba" (compound Bonferroni and arithmetic mean). These approaches are discussed at length in Vovk and Wang (2020). Defaults to "compound_bg", which has been shown to work well in many settings.
ci_grid: Grid of VIM values over which to construct a confidence interval by inverting a hypothesis test. The aggregation works by constructing hypothesis tests (at level alpha) of the null corresponding to each value in ci_grid, and then inverting these tests to yield a 1 - alpha confidence interval. For example, for "AUC" importance, the VIM takes values in (0,1), so a grid of values between 0 and 1 would be a reasonable choice.
type: Type of VIM to compute. Options include "accuracy", "AUC", "Brier", "R-squared" "C-index", and "survival_time_MSE".
time: n x 1 numeric vector of observed follow-up times. If there is censoring, these are the minimum of the event and censoring times.
event: n x 1 numeric vector of status indicators of whether an event was observed.
X: n x p data.frame of observed covariate values
landmark_times: Numeric vector of length J1 giving landmark times at which to estimate VIM ("accuracy", "AUC", "Brier", "R-squared").
restriction_time: Maximum follow-up time for calculation of "C-index" and "survival_time_MSE". Essentially, this time should be chosen such that the conditional survival function is identified at this time for all covariate values X present in the data. Choosing the restriction time such that roughly 10% of individuals remain at-risk at that time has been shown to work reasonably well in simulations.
approx_times: Numeric vector of length J2 giving times at which to approximate integrals. Defaults to a grid of 100 timepoints, evenly spaced on the quantile scale of the distribution of observed event times.
large_feature_vector: Numeric vector giving indices of features to include in the 'large' prediction model.
small_feature_vector: Numeric vector giving indices of features to include in the 'small' prediction model. Must be a subset of large_feature_vector.
conditional_surv_generator: A function to estimate the conditional survival functions of the event and censoring variables. Must take arguments (time, event, X) (for training purposes) and (X_holdout and newtimes) (covariate values and times at which to generate predictions). Defaults to generate_nuisance_predictions_stackG, a pre-built generator function based on the stackG function. Alternatively, the user can provide their own function for this argument, or provide pre-computed estimates to conditional_surv_preds in lieu of this argument.
conditional_surv_generator_control: A list of arguments to pass to conditional_surv_generator.
large_oracle_generator: A function to estimate the oracle prediction function using large_feature_vector. Must take arguments time, event, X, X_holdout, and nuisance_preds. For all VIM types except for "C-index", defaults to generate_oracle_predictions_DR, a pre-built generator function using doubly-robust pseudo-outcome regression. For "C-index", defaults to generate_oracle_predictions_boost, a pre-built generator function using doubly-robust gradient boosting. Alternatively, the user can provide their own function, or provide pre-computed estimates to large_oracle_preds in lieu of this argument.
large_oracle_generator_control: A list of arguments to pass to large_oracle_generator.
small_oracle_generator: A function to estimate the oracle prediction function using small_feature_vector. Must take arguments time, event, X, X_holdout, and nuisance_preds. For all VIM types except for "C-index", defaults to generate_oracle_predictions_SL, a pre-built generator function based on regression the large oracle predictions on the small feature vector. For "C-index", defaults to generate_oracle_predictions_boost, a pre-built generator function using doubly-robust gradient boosting. Alternatively, the user can provide their own function, or provide pre-computed estimates to small_oracle_preds in lieu of this argument.
small_oracle_generator_control: A list of arguments to pass to small_oracle_generator.
cf_fold_num: The number of cross-fitting folds, if not providing cf_folds. Note that with samples-splitting, the data will be split into 2 x cf_fold_num folds (i.e., there will be cf_fold_num folds within each half of the data).
sample_split: Logical indicating whether or not to sample split. Sample-splitting is required for valid hypothesis testing of null importance and is generally recommended. Defaults to TRUE.
scale_est: Logical, whether or not to force the VIM estimate to be nonnegative.
alpha: The level at which to compute confidence intervals and hypothesis tests. Defaults to 0.05.
verbose: Whether to print progress messages.

Value

Named list with the following elements:

agg_result: Data frame giving results aggregated over seeds.
agg_method: P-value aggregation method used.
n_seed: Number of iterations (seeds) used to perform the VIM estimation procedure.
vim_objects: A list of vim return objects, each corresponding to a different seed.

Details

Using a larger value of n_seed will result in more stable results, at a greater computational cost.

References

Vovk V. and Wang R. (2020). "Combining p-values via averaging."

Wolock C.J., Gilbert P.B., Simon N., and Carone, M. (2025). "Assessing variable importance in survival analysis using machine learning."

Examples

# This is a small simulation example
set.seed(123)
n <- 100
X <- data.frame(X1 = rnorm(n), X2 = rbinom(n, size = 1, prob = 0.5))

T <- rexp(n, rate = exp(-2 + X[,1] - X[,2] + .5 *  X[,1] * X[,2]))

C <- rexp(n, exp(-2 -.5 * X[,1] - .25 * X[,2] + .5 * X[,1] * X[,2]))
C[C > 15] <- 15

time <- pmin(T, C)
event <- as.numeric(T <= C)

# landmark times for AUC
landmark_times <- c(3)

output <- multiseed_vim(n_seed = 2,
              agg_method = "compound_bg",
              ci_grid = seq(0, 1, by = 0.01),
              type = "AUC",
              time = time,
              event = event,
              X = X,
              landmark_times = landmark_times,
              large_feature_vector = 1:2,
              small_feature_vector = 2,
              conditional_surv_generator_control = list(SL.library = c("SL.mean", "SL.glm")),
              large_oracle_generator_control = list(SL.library = c("SL.mean", "SL.glm")),
              small_oracle_generator_control = list(SL.library = c("SL.mean", "SL.glm")),
              cf_fold_num = 2,
              sample_split = TRUE,
              scale_est = TRUE)

print(output$result)
#> NULL