Data Generation: Mechanisms and Examples • RMSTSS

What this covers

We show how to construct a recipe (covariates → treatment → event time → censoring), and then simulate datasets with simulate_from_recipe(). We also show batch generation (generate_recipe_sets()) and how to read compact metadata back (load_recipe_sets()), without any platform-specific tricks.

The simulator takes a plain list as the recipe. No YAML is required.

Recipe skeleton

list(
  n = 300,
  covariates = list(defs = list(/* see Covariates */)),
  treatment  = list(/* see Treatment */),
  event_time = list(/* see Event-time engines */),
  censoring  = list(/* see Censoring */),
  seed = 42
)

Covariates

Each covariate has a name, type ("continuous" or "categorical"), a dist with params, and optional transform steps applied after generation ("center(a)", "scale(b)").

Available distributions

Type	`dist` (string)	Parameters
continuous	`normal`	`mean`, `sd`
continuous	`lognormal`	`meanlog`, `sdlog`
continuous	`gamma`	`shape`, `scale`
continuous	`weibull`	`shape`, `scale`
continuous	`uniform`	`min`, `max`
continuous	`beta`	`shape1`, `shape2`
continuous	`t`	`df`
categorical	`bernoulli`	`p` (probability of 1)
categorical	`categorical`	`prob = c(...)`, `labels = c(...)` (optional)
categorical	`ordinal`	`prob = c(...)`, `labels = c(...)` (optional, ordered)

Example: define covariates

covs <- list(
  list(name="age",   type="continuous",  dist="normal",     params=list(mean=62, sd=10),
       transform=c("center(60)","scale(10)")),
  list(name="sex",   type="categorical", dist="bernoulli",  params=list(p=0.45)),
  list(name="stage", type="categororical", dist="ordinal",
       params=list(prob=c(0.3,0.5,0.2), labels=c("I","II","III"))),
  list(name="x",     type="continuous",  dist="lognormal",  params=list(meanlog=0, sdlog=0.6))
)

Treatment

Choose one assignment:

Assignment	Key fields	Meaning
`"randomization"`	`allocation = "a:b"`	Bernoulli with probability $p_1 = a/(a+b)$ .
`"stratified"`	`allocation`, `stratify_by = c("...")`	Same allocation within each stratum defined by listed categorical covariates.
`"logistic_ps"`	`ps_model = list(formula = "~ ...", beta = c(...))`	Treatment probability is $\mathrm{logit}^{-1}(\eta)$ from user model. Provide explicit `beta` to avoid parsing edge-cases.

Examples

Randomization:

tr_rand <- list(assignment="randomization", allocation="1:1")

Stratified by "stage":

tr_strat <- list(assignment="stratified", allocation="2:1", stratify_by=c("stage"))

Logistic propensity:

tr_ps <- list(
  assignment = "logistic_ps",
  ps_model  = list(
    formula = "~ 1 + x + sex",
    beta    = c(-0.3, 1.2, -0.6)  # (Intercept), x, sex
  )
)

Event-time engines

Let $Z$ be treatment (0/1), $X$ be covariates, and $\eta$ be the linear predictor (defined in effects, below). Supported engines and baseline parameterizations:

Model (user-facing)	`model` value	Baseline parameters	Notes
AFT Lognormal	`"aft_lognormal"`	`mu`, `sigma`	$\log T = \mu + \eta + \sigma \varepsilon$ , $\varepsilon \sim \mathcal{N}(0,1)$ .
AFT Weibull	`"aft_weibull"`	`shape`, `scale`	$S_0(t) = \exp(-(t/\lambda)^k)$ ; AFT shift via $\eta$ .
AFT Log-Logistic	`"aft_loglogistic"`	`shape`, `scale`	$T = \lambda \exp(\eta) (U/(1-U))^{1/k}$ .
PH Exponential	`"ph_pwexp"`	`rates = c(λ)`, `cuts = numeric(0)`	Piecewise-Exp with a single segment is exponential.
PH Weibull	`"ph_weibull"`	`shape`, `scale`	Proportional hazards with Weibull baseline.
PH Gompertz	`"ph_gompertz"`	`rate`, `gamma`	Hazard $h(t) = a \exp(bt)$ .
PH Piecewise Exponential	`"ph_pwexp"`	`rates = c(r1,r2,...)`, `cuts = c(c1,c2,...)`	Rate in segment $s$ is $r_s \exp(\eta)$ .

Effects and linear predictor

Specify effects on the appropriate scale (AFT: log-time; PH: log-hazard):

effects = list(
  intercept  = 0,                      # default is 0
  treatment  = -0.25,
  covariates = list(age = 0.01, sex = -0.2)  # NOTE: named LIST
  # or: formula="~ age + sex", beta=c(0.01, -0.2)
)

effects$covariates must be a named list of numerics (e.g., list(age=0.01)), not a named vector created with c().

Censoring

Two modes are supported:

Mode	Fields	Semantics
`"target_overall"`	`target`, `admin_time`	Solver finds an exponential random-censoring rate $\lambda_c$ so that overall censoring fraction $\approx$ `target`, subject to any administrative floor at `admin_time`.
`"explicit"`	Any of: `administrative = list(time=...)`; `random = list(dist="exponential", params=list(rate=...))`; `dependent = list(formula="~ ...", base=..., beta=c(...))`	Compose administrative, random, and covariate-dependent censoring directly.

Examples

Target overall censoring:

cz_target <- list(mode="target_overall", target=0.25, admin_time=36)

Explicit mix (admin + random):

cz_explicit <- list(
  mode = "explicit",
  administrative = list(time = 36),
  random = list(dist = "exponential", params = list(rate = 0.02))
)

Worked examples

We now build full recipes and call simulate_from_recipe(). We report realized censoring via attr(dat, "achieved_censoring").

Example 1 — AFT Lognormal

covs1 <- list(
  list(name="age",   type="continuous",  dist="normal",     params=list(mean=62, sd=10),
       transform=c("center(60)","scale(10)")),
  list(name="sex",   type="categorical", dist="bernoulli",  params=list(p=0.45)),
  list(name="stage", type="categorical", dist="ordinal",
       params=list(prob=c(0.3,0.5,0.2), labels=c("I","II","III"))),
  list(name="x",     type="continuous",  dist="lognormal",  params=list(meanlog=0, sdlog=0.6))
)

rec1 <- list(
  n = 300,
  covariates = list(defs = covs1),
  treatment  = list(assignment="randomization", allocation="1:1"),
  event_time = list(model="aft_lognormal",
                    baseline=list(mu=3.0, sigma=0.6),
                    effects=list(intercept=0, treatment=-0.25,
                                 covariates=list(age=0.01, sex=-0.2, x=0.05))),
  censoring  = list(mode="target_overall", target=0.25, admin_time=36),
  seed = 11
)

dat1 <- simulate_from_recipe(validate_recipe(rec1))
head(dat1)
       time status arm        age sex stage         x
1 17.256803      1   0 -0.3910311   0    II 1.2192376
2 19.531621      1   1  0.2265944   1    II 0.6122083
3 11.902048      1   0 -1.3165531   0   III 1.2334828
4 18.770511      1   0 -1.1626533   0    II 0.5784534
5 16.584294      1   1  1.3784892   0    II 0.2619596
6  9.377759      1   1 -0.7341513   0    II 1.1455235
attr(dat1, "achieved_censoring")
[1] 0.23

Example 2 — AFT Weibull

rec2 <- rec1
rec2$event_time <- list(model="aft_weibull",
                        baseline=list(shape=1.3, scale=12),
                        effects=list(intercept=0, treatment=-0.20,
                                     covariates=list(age=0.008, x=0.04)))
dat2 <- simulate_from_recipe(validate_recipe(rec2), seed=12)
summary(dat2$time)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2221  3.4534  7.2184  8.8069 12.1700 36.0000 
attr(dat2, "achieved_censoring")
[1] 0.2466667

Example 3 — PH piecewise exponential (single segment)

rec3 <- list(
  n = 400,
  covariates = list(defs = covs1),
  treatment  = list(assignment="randomization", allocation="1:1"),
  event_time = list(model="ph_pwexp",
                    baseline=list(rates=c(0.05), cuts=numeric(0)),
                    effects=list(intercept=0, treatment=-0.3,
                                 covariates=list(age=0.01, x=0.03))),
  censoring  = list(mode="target_overall", target=0.20, admin_time=30),
  seed = 13
)
dat3 <- simulate_from_recipe(validate_recipe(rec3))

Example 3 — Summary
Metric	Value
n	400.000
Events	297.000
Censoring rate	0.258
Mean time	16.440
Median time	15.130

Example 4 — PH piecewise exponential (multi-segment)

rec4 <- list(
  n = 500,
  covariates = list(defs = list(
    list(name="age", type="continuous",  dist="normal",    params=list(mean=60, sd=8)),
    list(name="sex", type="categorical", dist="bernoulli", params=list(p=0.5)),
    list(name="x",   type="continuous",  dist="lognormal", params=list(meanlog=0, sdlog=0.5))
  )),
  treatment  = list(assignment="randomization", allocation="1:1"),
  event_time = list(model="ph_pwexp",
                    baseline=list(rates=c(0.10, 0.06, 0.03), cuts=c(6, 18)),
                    effects=list(intercept=0, treatment=-0.4,
                                 covariates=list(age=0.01, x=0.03))),
  censoring  = list(mode="target_overall", target=0.25, admin_time=30)
)
dat4 <- simulate_from_recipe(validate_recipe(rec4), seed=123)

Example 4 — Summary
Metric	Metric.1	Value
n	n	500.000
Events	Events	367.000
Censoring rate	Censoring rate	0.266
Mean time	Mean time	5.640
Median time	Median time	3.560

Batch generation with metadata

For simulation studies, write multiple scenarios and formats together. The writer creates a manifest.rds with a list-column meta describing each dataset. The loader reattaches attributes when reading back.

base <- validate_recipe(rec2)

out_dir <- file.path(tempdir(), "rmstss-manifest-demo")
unlink(out_dir, recursive = TRUE, force = TRUE)

man <- generate_recipe_sets(
  base_recipe = base,
  vary = list(n = c(200, 400),
              "event_time.effects.treatment" = c(-0.15, -0.25)),
  out_dir  = out_dir,
  formats  = c("rds","csv"),
  n_reps   = 1,
  seed_base = 2025
)

# Inspect the first row's compact metadata (fields only; no file paths)
m <- readRDS(file.path(out_dir, "manifest.rds"))
names(m)
 [1] "scenario_id"                     "rep"                            
 [3] "seed"                            "achieved_censoring"             
 [5] "n"                               "file_txt"                       
 [7] "file_csv"                        "file_rds"                       
 [9] "file_rdata"                      "p__n"                           
[11] "p__event_time.effects.treatment" "meta"                           
if ("meta" %in% names(m) && length(m$meta[[1]]) > 0) {
  list(model = m$meta[[1]]$model,
       baseline = m$meta[[1]]$baseline,
       effects = m$meta[[1]]$effects,
       achieved_censoring = m$meta[[1]]$achieved_censoring,
       n = m$meta[[1]]$n)
} else {
  "Manifest is minimal (older run); use rebuild_manifest() to enrich."
}
$model
[1] "aft_weibull"

$baseline
$baseline$shape
[1] 1.3

$baseline$scale
[1] 12


$effects
$effects$intercept
[1] 0

$effects$treatment
[1] -0.15

$effects$covariates
$effects$covariates$age
[1] 0.008

$effects$covariates$x
[1] 0.04



$achieved_censoring
[1] 0.28

$n
[1] 200

# Load datasets back
sets <- load_recipe_sets(file.path(out_dir, "manifest.rds"))
attr(sets[[1]]$data, "achieved_censoring")
[1] 0.28
str(sets[[1]]$meta)
List of 19
 $ dataset_id        : chr "sc001_r01"
 $ scenario_id       : int 1
 $ rep               : int 1
 $ seed_used         : int 3026
 $ n                 : int 200
 $ n_treat           : int 110
 $ n_control         : int 90
 $ event_rate        : num 0.72
 $ achieved_censoring: num 0.28
 $ model             : chr "aft_weibull"
 $ baseline          :List of 2
  ..$ shape: num 1.3
  ..$ scale: num 12
 $ effects           :List of 3
  ..$ intercept : num 0
  ..$ treatment : num -0.15
  ..$ covariates:List of 2
  .. ..$ age: num 0.008
  .. ..$ x  : num 0.04
 $ treatment         :List of 2
  ..$ assignment: chr "randomization"
  ..$ allocation: chr "1:1"
 $ censoring         :List of 3
  ..$ mode      : chr "target_overall"
  ..$ target    : num 0.25
  ..$ admin_time: num 36
 $ covariates        :List of 4
  ..$ :List of 4
  .. ..$ name  : chr "age"
  .. ..$ type  : chr "continuous"
  .. ..$ dist  : chr "normal"
  .. ..$ params:List of 2
  .. .. ..$ mean: num 62
  .. .. ..$ sd  : num 10
  ..$ :List of 4
  .. ..$ name  : chr "sex"
  .. ..$ type  : chr "categorical"
  .. ..$ dist  : chr "bernoulli"
  .. ..$ params:List of 1
  .. .. ..$ p: num 0.45
  ..$ :List of 4
  .. ..$ name  : chr "stage"
  .. ..$ type  : chr "categorical"
  .. ..$ dist  : chr "ordinal"
  .. ..$ params:List of 2
  .. .. ..$ prob  : num [1:3] 0.3 0.5 0.2
  .. .. ..$ labels: chr [1:3] "I" "II" "III"
  ..$ :List of 4
  .. ..$ name  : chr "x"
  .. ..$ type  : chr "continuous"
  .. ..$ dist  : chr "lognormal"
  .. ..$ params:List of 2
  .. .. ..$ meanlog: num 0
  .. .. ..$ sdlog  : num 0.6
 $ allocation        : chr "1:1"
 $ params            :List of 2
  ..$ n                           : num 200
  ..$ event_time.effects.treatment: num -0.15
 $ files             :List of 4
  ..$ txt  : chr NA
  ..$ csv  : chr "/tmp/Rtmp2Ikrna/rmstss-manifest-demo/sc1_r1.csv"
  ..$ rds  : chr "/tmp/Rtmp2Ikrna/rmstss-manifest-demo/sc1_r1.rds"
  ..$ rdata: chr NA
 $ created_at        : chr "2025-09-06 20:33:01.183117"

Reproducibility tips

Set seed in the recipe or pass seed= to simulate_from_recipe().
For grids, fix a deterministic scheme like seed_base + scenario_id*1000 + rep (this is what generate_recipe_sets() does).

That’s it—you now have the moving parts to define covariates, choose an event-time engine, specify censoring, simulate data, and (optionally) batch-create scenarios with compact metadata for downstream analysis.