Repeat the VIM estimation procedure multiple times and aggregate the results, mitigating the additional randomness introduced by sample-splitting and cross-fitting.
Usage
multiseed_vim(
n_seed,
agg_method = "compound_bg",
ci_grid,
type,
time,
event,
X,
landmark_times = stats::quantile(time[event == 1], probs = c(0.25, 0.5, 0.75)),
restriction_time = max(time[event == 1]),
approx_times = NULL,
large_feature_vector,
small_feature_vector,
conditional_surv_generator = NULL,
conditional_surv_generator_control = NULL,
large_oracle_generator = NULL,
large_oracle_generator_control = NULL,
small_oracle_generator = NULL,
small_oracle_generator_control = NULL,
cf_fold_num = 5,
sample_split = TRUE,
scale_est = FALSE,
alpha = 0.05,
verbose = FALSE
)
Arguments
- n_seed
Number of iterations (seeds) to perform the VIM estimation procedure. These will be aggregated into a single result.
- agg_method
P-value aggregation method use to combine results from different seeds. Current options are
"bonferroni"
(Bonferroni's method),"hommel"
(Hommel's method),"arithmetic"
(arithmetic mean),"geometric"
(geometric mean),"harmonic"
(harmonic mean),"compound_bg"
(compound Bonferroni and geometric mean), and"compound_ba"
(compound Bonferroni and arithmetic mean). These approaches are discussed at length in Vovk and Wang (2020). Defaults to"compound_bg"
, which has been shown to work well in many settings.- ci_grid
Grid of VIM values over which to construct a confidence interval by inverting a hypothesis test. The aggregation works by constructing hypothesis tests (at level
alpha
) of the null corresponding to each value inci_grid
, and then inverting these tests to yield a 1 -alpha
confidence interval. For example, for"AUC"
importance, the VIM takes values in (0,1), so a grid of values between 0 and 1 would be a reasonable choice.- type
Type of VIM to compute. Options include
"accuracy"
,"AUC"
,"Brier"
,"R-squared"
"C-index"
, and"survival_time_MSE"
.- time
n x 1
numeric vector of observed follow-up times. If there is censoring, these are the minimum of the event and censoring times.- event
n x 1
numeric vector of status indicators of whether an event was observed.- X
n x p
data.frame of observed covariate values- landmark_times
Numeric vector of length J1 giving landmark times at which to estimate VIM (
"accuracy"
,"AUC"
,"Brier"
,"R-squared"
).- restriction_time
Maximum follow-up time for calculation of
"C-index"
and"survival_time_MSE"
. Essentially, this time should be chosen such that the conditional survival function is identified at this time for all covariate valuesX
present in the data. Choosing the restriction time such that roughly 10% of individuals remain at-risk at that time has been shown to work reasonably well in simulations.- approx_times
Numeric vector of length J2 giving times at which to approximate integrals. Defaults to a grid of 100 timepoints, evenly spaced on the quantile scale of the distribution of observed event times.
- large_feature_vector
Numeric vector giving indices of features to include in the 'large' prediction model.
- small_feature_vector
Numeric vector giving indices of features to include in the 'small' prediction model. Must be a subset of
large_feature_vector
.- conditional_surv_generator
A function to estimate the conditional survival functions of the event and censoring variables. Must take arguments (
time
,event
,X
) (for training purposes) and (X_holdout
andnewtimes
) (covariate values and times at which to generate predictions). Defaults to generate_nuisance_predictions_stackG, a pre-built generator function based on the stackG function. Alternatively, the user can provide their own function for this argument, or provide pre-computed estimates toconditional_surv_preds
in lieu of this argument.- conditional_surv_generator_control
A list of arguments to pass to
conditional_surv_generator
.- large_oracle_generator
A function to estimate the oracle prediction function using
large_feature_vector
. Must take argumentstime
,event
,X
,X_holdout
, andnuisance_preds
. For all VIM types except for"C-index"
, defaults to generate_oracle_predictions_DR, a pre-built generator function using doubly-robust pseudo-outcome regression. For"C-index"
, defaults to generate_oracle_predictions_boost, a pre-built generator function using doubly-robust gradient boosting. Alternatively, the user can provide their own function, or provide pre-computed estimates tolarge_oracle_preds
in lieu of this argument.- large_oracle_generator_control
A list of arguments to pass to
large_oracle_generator
.- small_oracle_generator
A function to estimate the oracle prediction function using
small_feature_vector
. Must take argumentstime
,event
,X
,X_holdout
, andnuisance_preds
. For all VIM types except for"C-index"
, defaults to generate_oracle_predictions_SL, a pre-built generator function based on regression the large oracle predictions on the small feature vector. For"C-index"
, defaults to generate_oracle_predictions_boost, a pre-built generator function using doubly-robust gradient boosting. Alternatively, the user can provide their own function, or provide pre-computed estimates tosmall_oracle_preds
in lieu of this argument.- small_oracle_generator_control
A list of arguments to pass to
small_oracle_generator
.- cf_fold_num
The number of cross-fitting folds, if not providing
cf_folds
. Note that with samples-splitting, the data will be split into2 x cf_fold_num
folds (i.e., there will becf_fold_num
folds within each half of the data).- sample_split
Logical indicating whether or not to sample split. Sample-splitting is required for valid hypothesis testing of null importance and is generally recommended. Defaults to
TRUE
.- scale_est
Logical, whether or not to force the VIM estimate to be nonnegative.
- alpha
The level at which to compute confidence intervals and hypothesis tests. Defaults to 0.05.
- verbose
Whether to print progress messages.
Value
Named list with the following elements:
- agg_result
Data frame giving results aggregated over seeds.
- agg_method
P-value aggregation method used.
- n_seed
Number of iterations (seeds) used to perform the VIM estimation procedure.
- vim_objects
A list of
vim
return objects, each corresponding to a different seed.
Details
Using a larger value of n_seed
will result in more stable results, at a greater computational cost.
References
Vovk V. and Wang R. (2020). "Combining p-values via averaging."
Wolock C.J., Gilbert P.B., Simon N., and Carone, M. (2025). "Assessing variable importance in survival analysis using machine learning."
Examples
# This is a small simulation example
set.seed(123)
n <- 100
X <- data.frame(X1 = rnorm(n), X2 = rbinom(n, size = 1, prob = 0.5))
T <- rexp(n, rate = exp(-2 + X[,1] - X[,2] + .5 * X[,1] * X[,2]))
C <- rexp(n, exp(-2 -.5 * X[,1] - .25 * X[,2] + .5 * X[,1] * X[,2]))
C[C > 15] <- 15
time <- pmin(T, C)
event <- as.numeric(T <= C)
# landmark times for AUC
landmark_times <- c(3)
output <- multiseed_vim(n_seed = 2,
agg_method = "compound_bg",
ci_grid = seq(0, 1, by = 0.01),
type = "AUC",
time = time,
event = event,
X = X,
landmark_times = landmark_times,
large_feature_vector = 1:2,
small_feature_vector = 2,
conditional_surv_generator_control = list(SL.library = c("SL.mean", "SL.glm")),
large_oracle_generator_control = list(SL.library = c("SL.mean", "SL.glm")),
small_oracle_generator_control = list(SL.library = c("SL.mean", "SL.glm")),
cf_fold_num = 2,
sample_split = TRUE,
scale_est = TRUE)
print(output$result)
#> NULL