Data Generation: Mechanisms and Examples
DataGen.Rmd
Purpose
This vignette describes how the simulator builds datasets from a recipe: covariates → treatment → event time → censoring. It gives exact parameterizations and runnable examples for each mechanism. You can pass a list recipe directly, or serialize to YAML for reuse.
Output: a data.frame
with columns
time
, status
, arm
(if treatment
is present), and your covariates. Attributes include
attr(dat, "tau")
(restriction time) and
attr(dat, "achieved_censoring")
(realized censoring).
Recipe structure
list(
n = 300,
covariates = list(defs = list(/* see Covariates */)),
treatment = list(/* see Treatment */),
event_time = list(/* see Event-time engines */),
censoring = list(/* see Censoring */),
seed = 42
)
You may write/read this recipe as YAML:
# write
tmp <- tempfile(fileext = ".yml")
write_recipe_yaml(rec, tmp)
# read via path
dat <- simulate_from_recipe(tmp, seed = 123)
Validate and fill defaults:
rec <- validate_recipe(rec)
Covariates
Each definition has name
, type
("continuous"
or "categorical"
), a
dist
with params
, and optional
transform
steps applied after generation
("center(a)"
, "scale(b)"
).
Continuous: normal(mean, sd)
,
lognormal(meanlog, sdlog)
,
gamma(shape, scale)
, weibull(shape, scale)
,
uniform(min, max)
, beta(shape1, shape2)
,
t(df)
.
Categorical: bernoulli(p)
(0/1
numeric), categorical(prob, labels)
(factor),
ordinal(prob, labels)
(ordered factor).
Example:
covs <- list(
list(name="age", type="continuous", dist="normal", params=list(mean=62, sd=10),
transform=c("center(60)","scale(10)")),
list(name="sex", type="categorical", dist="bernoulli", params=list(p=0.45)),
list(name="stage", type="categorical", dist="ordinal",
params=list(prob=c(0.3,0.5,0.2), labels=c("I","II","III")))
)
# gen_covariates() is internal; for the vignette we simulate a tiny dataset
# with a supported model and just show the covariate columns.
tmp <- list(
n = 5,
covariates = list(defs = covs),
treatment = NULL,
event_time = list(
model = "aft_lognormal",
baseline = list(mu = 3, sigma = 0.4),
effects = list(intercept = 0),
tau = 1
),
censoring = list(mode = "explicit", administrative = list(time = 1e6))
)
d <- simulate_from_recipe(validate_recipe(tmp), seed = 101)
cov_cols <- setdiff(names(d), c("time", "status", "arm"))
if (length(cov_cols)) {
d[, cov_cols, drop = FALSE]
} else {
head(d)
}
age sex stage
1 -0.1260365 1 I
2 0.7524619 1 III
3 -0.4749438 1 II
4 0.4143595 1 II
5 0.5107692 0 II
Treatment
Choose treatment$assignment
:
-
"randomization"
withallocation = "a:b"
;p1 = a/(a+b)
. -
"stratified"
: same allocation within strata instratify_by
(categorical covariates). -
"logistic_ps"
: treatment probability islogit^{-1}(η)
from a user formula.
Randomization:
tr <- list(assignment="randomization", allocation="1:1")
Stratified:
Logistic propensity (use explicit beta to avoid parsing issues in formulas):
Event-time engines
Let Z
be treatment (0/1), X
covariates, and
η
the linear predictor (below). Supported engines:
AFT Lognormal (
model="aft_lognormal"
)
Baselinebaseline=list(mu, sigma)
; simulateT = exp(μ + η + σ ε)
.AFT Weibull (
model="aft_weibull"
)
Baseline survival . AFT shift:
, .
Baselinebaseline=list(shape=k, scale=lambda)
.Cox Exponential (
model="cox_exp"
)
. Baselinebaseline=list(rate = λ0)
; sample .Cox Piecewise Exponential (
model="cox_pwexp"
)
Providebaseline=list(rates=c(r1,r2,...), cuts=c(c1,c2,...))
; in segment , wait time is exponential with rate . Carry over elapsed time to the next segment at each cut.
Effects and linear predictor
Specify effects on the appropriate scale in
event_time$effects
:
event_time = list(
model = "aft_weibull",
baseline = list(shape = 1.3, scale = 12),
effects = list(
intercept = 0, # defaults to 0
treatment = -0.25, # AFT: log-time; Cox: log-hazard
covariates = list(age = 0.01, sex = -0.2) # NOTE: named LIST (not a vector)
# or: formula="~ age + sex", beta=c(0.01, -0.2)
),
tau = 24
)
Important:
effects$covariates
must be a named list of numerics in this package, e.g.,list(age=0.01, sex=-0.2)
. Using a numeric vector likec(age=0.01, sex=-0.2)
will triggervalidate_recipe()
to error.
Censoring
Two modes:
-
mode="target_overall"
: choosetarget
andadmin_time
. The solver finds an exponential random-censoring rate to hit the target, subject to feasibility. -
mode="explicit"
: provide any ofadministrative
,random
(e.g., exponential), and an optionaldependent
component driven by covariates.
Administrative floor: with only admin at time
,
censoring fraction is
.
If target
is below this floor, random
censoring is unnecessary; the simulator returns the floor.
Worked examples (tailored defaults)
Defaults used below unless noted: tau = 24
,
admin_time = 36
, allocation "1:1"
,
seed
set per example.
Example 1: AFT Lognormal
covs1 <- list(
list(name="age", type="continuous", dist="normal", params=list(mean=62, sd=10),
transform=c("center(60)","scale(10)")),
list(name="sex", type="categorical", dist="bernoulli", params=list(p=0.45))
)
rec1 <- list(
n = 300,
covariates = list(defs = covs1),
treatment = list(assignment="randomization", allocation="1:1"),
event_time = list(model="aft_lognormal",
baseline=list(mu=3.0, sigma=0.6),
effects=list(intercept=0, treatment=-0.25,
covariates=list(age=0.01, sex=-0.2)),
tau=24),
censoring = list(mode="target_overall", target=0.25, admin_time=36),
seed = 11
)
rec1 <- validate_recipe(rec1)
dat1 <- simulate_from_recipe(rec1)
c(tau = attr(dat1, "tau"), achieved_censoring = attr(dat1, "achieved_censoring"))
tau achieved_censoring
24.0000000 0.2633333
head(dat1)
time status arm age sex
1 24.393468 1 0 -0.3910311 0
2 10.090388 1 0 0.2265944 1
3 19.042554 1 1 -1.3165531 0
4 11.484246 1 0 -1.1626533 0
5 5.334633 1 0 1.3784892 0
6 22.840156 1 0 -0.7341513 0
Example 2: AFT Weibull
rec2 <- rec1
rec2$event_time <- list(model="aft_weibull",
baseline=list(shape=1.3, scale=12),
effects=list(intercept=0, treatment=-0.20,
covariates=list(age=0.008)),
tau=24)
dat2 <- simulate_from_recipe(validate_recipe(rec2), seed=12)
summary(dat2$time)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1522 2.7773 6.4097 8.1138 11.4299 36.0000
Example 3: Cox Exponential (PH)
rec3 <- list(
n = 400,
covariates = list(defs = covs1),
treatment = list(assignment="randomization", allocation="1:1"),
event_time = list(model="cox_pwexp",
baseline=list(rates=c(0.05), cuts=numeric(0)), # single segment = exponential
effects=list(intercept=0, treatment=-0.3,
covariates=list(age=0.01)),
tau=24),
censoring = list(mode="target_overall", target=0.20, admin_time=30),
seed = 13
)
dat3 <- simulate_from_recipe(validate_recipe(rec3))
mean(dat3$status==0)
[1] NA
Example 4: Cox Piecewise Exponential
rec4 <- list(
n = 500,
covariates = list(defs = list(list(name="age", type="continuous", dist="normal", params=list(mean=60, sd=8)))),
treatment = list(assignment="randomization", allocation="1:1"),
event_time = list(model="cox_pwexp",
baseline=list(rates=c(0.10, 0.06, 0.03), cuts=c(6, 18)),
effects=list(intercept=0, treatment=-0.4,
covariates=list(age=0.01)),
tau=24),
censoring = list(mode="target_overall", target=0.25, admin_time=30)
)
dat4 <- simulate_from_recipe(validate_recipe(rec4), seed=123)
c(n=nrow(dat4), events=sum(dat4$status==1), cens_rate=mean(dat4$status==0))
n events cens_rate
500 NA NA
Example 5: Stratified randomization
rec5 <- rec4
rec5$covariates$defs <- c(rec5$covariates$defs,
list(list(name="stage", type="categorical", dist="categorical",
params=list(prob=c(0.3,0.5,0.2), labels=c("I","II","III")))))
rec5$treatment <- list(assignment="stratified", allocation="1:1", stratify_by=c("stage"))
d5 <- simulate_from_recipe(validate_recipe(rec5), seed=9)
prop.table(table(d5$stage, d5$arm), 1)
0 1
I 0.4915254 0.5084746
II 0.5774059 0.4225941
III 0.4642857 0.5357143
Example 6: Logistic propensity (explicit beta)
rec6 <- list(
n = 800,
covariates = list(defs = list(
list(name="x", type="continuous", dist="normal", params=list(mean=0, sd=1)),
list(name="sex", type="categorical", dist="bernoulli", params=list(p=0.5))
)),
treatment = list(assignment="logistic_ps",
ps_model=list(formula="~ 1 + x + sex",
beta=c(-0.3, 1.2, -0.6))),
event_time = list(model="cox_pwexp", baseline=list(rates=c(0.05), cuts=numeric(0)),
effects=list(intercept=0, treatment=0.0,
covariates=list(x=0.0, sex=0.0)), tau=12),
censoring = list(mode="target_overall", target=0.15, admin_time=50)
)
d6 <- simulate_from_recipe(validate_recipe(rec6), seed=7)
cor(d6$x, d6$arm)
[1] 0.4718621
Diagnostics
-
Positivity:
all(time > 0)
should hold. If not, lighten hazard rates or check parameters. -
Censoring target: compare
attr(dat, "achieved_censoring")
vstarget
. If far off, estimate the administrative floor (admin-only censoring) and adjustadmin_time
/target. - Balance: randomization ≈ allocation; stratified balance within each stratum.
-
Factors: supply
labels
to stabilize factor levels across replicates.
Admin-only floor demo:
Reproducibility
Set seeds in the recipe (seed
) or per call
(seed=
). For grids, use deterministic schemes like
seed_base + scenario_id*1000 + rep
.
Quick reference (schema)
list(
n = <int>,
covariates = list(defs = list( list(name, type, dist, params = list(...), transform = c(...)), ... )),
treatment = list(
assignment = "randomization" | "stratified" | "logistic_ps",
allocation = "a:b", stratify_by = c("..."),
ps_model = list(formula = "~ ...", beta = c(...)) # optional
),
event_time = list(
model = "aft_lognormal" | "aft_weibull" | "cox_exp" | "cox_pwexp",
baseline = list(...),
effects = list(intercept = 0, treatment = 0,
covariates = list(name = coef, ...)
# or: formula="~ ...", beta=c(...)
),
tau = 24
),
censoring = list(
mode = "target_overall" | "explicit",
target = 0.25, admin_time = 36,
administrative = list(time = ...),
random = list(dist="exponential", params=list(rate=...)),
dependent = list(formula="~ ...", base=..., beta=c(...))
),
seed = 42
)
Saving & reusing datasets (compatible with RMSTSS)
You’ll typically want one of three patterns when you want to reuse or ship simulated data.
1) Single dataset you’ll reuse later (RDS preserves attributes)
simulate_from_recipe()
returns a data frame with
attributes: - attr(dat, "tau")
-
attr(dat, "achieved_censoring")
When you save as RDS, those attributes are preserved and restored on load.
# Generate once
rec <- validate_recipe(list(
n = 300,
covariates = list(defs = list(
list(name="age", type="continuous", dist="normal", params=list(mean=62, sd=10)),
list(name="sex", type="categorical", dist="bernoulli", params=list(p=0.45))
)),
treatment = list(assignment="randomization", allocation="1:1"),
event_time = list(model="aft_weibull",
baseline=list(shape=1.3, scale=12),
effects=list(intercept=0, treatment=-0.25,
covariates=list(age=0.01, sex=-0.2)),
tau=24),
censoring = list(mode="target_overall", target=0.25, admin_time=36),
seed = 2025
))
dat <- simulate_from_recipe(rec)
# Save and reload with attributes intact
saveRDS(dat, "mydata.rds") # eval=FALSE during vignette build
dat2 <- readRDS("mydata.rds") # eval=FALSE during vignette build
attr(dat2, "tau")
attr(dat2, "achieved_censoring")
If you must save as CSV/TXT, attributes are lost; either set them back manually (utility below) or use the manifest workflow.
2) Many scenarios/replicates + cross-format exports (manifest helpers)
For simulation studies, let the package write a manifest so attributes and seeds are tracked automatically. This also reattaches attributes when you reload CSV/TXT/RData.
base <- rec # from above (or any validated recipe)
# Sweep scenarios and write multiple formats + manifest.rds
man <- generate_recipe_sets(
base_recipe = base,
vary = list(n = c(200, 400),
"event_time.effects.treatment" = c(-0.15, -0.25)),
out_dir = "sim-sets",
formats = c("rds","csv","txt","rdata"),
n_reps = 2,
seed_base = 2025
) # eval=FALSE
# Load back as a named list
sets <- load_recipe_sets("sim-sets/manifest.rds") # eval=FALSE
# Access a dataset:
k <- names(sets)[1]
res <- sets[[k]] # res is a list with $data and $meta
d <- res$data # data.frame with attributes restored
str(res$meta) # tau, achieved_censoring, n, file, scenario params, etc.
attr(d, "tau") # available for downstream functions
Why this is useful: - Attributes (tau
,
achieved_censoring
) are restored for all
formats. - The manifest records scenario parameters and seeds for exact
reproducibility. - Factor levels for labeled categoricals are stabilized
to your declared labels.
3) Shipping datasets inside your package (CRAN-style)
Put a script in data-raw/
that builds objects, then save
to data/*.rda
using usethis::use_data()
. This
is the canonical, CRAN-friendly way; attributes are preserved inside
.rda
.
# data-raw/make_datasets.R
if (requireNamespace("devtools", quietly = TRUE)) devtools::load_all(".")
rec_small <- rec
rec_small$n <- 100
dat_small <- simulate_from_recipe(rec_small, seed = 7)
# Save to package data/ and document with Rd (see dataset roxygen)
usethis::use_data(dat_small, overwrite = TRUE, compress = "xz")
# Optional CSV/TXT mirrors for non-R users
utils::write.csv(dat_small, "inst/extdata/dat_small.csv", row.names = FALSE)
utils::write.table(dat_small, "inst/extdata/dat_small.txt", sep = "\t", quote = FALSE, row.names = FALSE)
Minimal dataset documentation (place in
R/data-docs.R
):
#' Simulated trial: small example (n = 100)
#'
#' @format A data frame with 100 rows and columns:
#' \describe{
#' \item{time}{numeric, observed time}
#' \item{status}{integer, 1 = event, 0 = censored}
#' \item{arm}{integer, 0/1 treatment (if present)}
#' \item{age}{numeric}
#' \item{sex}{integer, 0/1}
#' }
#' @usage data(dat_small)
#' @keywords datasets
#' @name dat_small
#' @docType data
NULL
Then:
Utility: re-attaching attributes when you only have CSV/TXT
If you saved to plain text without a manifest, restore key attributes yourself:
set_dataset_attrs <- function(df, tau, achieved_censoring = NULL) {
if (is.null(achieved_censoring)) {
stopifnot(all(c("time","status") %in% names(df)))
achieved_censoring <- mean(df$status == 0, na.rm = TRUE)
}
attr(df, "tau") <- tau
attr(df, "achieved_censoring") <- achieved_censoring
df
}
dat_csv <- utils::read.csv("path/to/data.csv") # eval=FALSE
dat_csv <- set_dataset_attrs(dat_csv, tau = 24) # eval=FALSE
Reproducing the exact same dataset later
Two options:
-
Store the recipe + seed:
Save the list recipe (or usewrite_recipe_yaml(rec, "recipe.yml")
) and calldat_again <- simulate_from_recipe("recipe.yml", seed = 2025)
Using the same seed gives the same dataset.
Use the manifest:
generate_recipe_sets()
records scenario IDs and seeds. Recreate any dataset by matchingscenario_id
andrep
(the loader already does this).
Compatibility checklist
- Columns:
time
(numeric > 0),status
(integer in {0,1}), and, if present,arm
(integer in {0,1}); plus any covariates. - Attributes:
attr(dat, "tau")
used by RMST workflows;attr(dat, "achieved_censoring")
is informational. - Factors: supply
labels
in the recipe so factor levels stay consistent across replicates. - Units: keep event-time units consistent with
tau
andcensoring$admin_time
.