SyntheticDataGenerator

SyntheticDataGenerator(
    self,
    n_obs=10000,
    n_cont_outcomes=1,
    n_binary_outcomes=0,
    n_cont_treatments=0,
    n_binary_treatments=1,
    n_discrete_treatments=0,
    n_cont_confounders=0,
    n_binary_confounders=0,
    n_discrete_confounders=0,
    n_cont_modifiers=0,
    n_binary_modifiers=0,
    n_discrete_modifiers=0,
    n_confounding_modifiers=0,
    stddev_outcome_noise=1.0,
    stddev_treatment_noise=1.0,
    causal_model_functional_form='linear',
    n_nonlinear_transformations=None,
    seed=None,
)

Generate highly flexible synthetic data for use in causal inference and CaML testing.

SyntheticDataGenerator is experimental and may change significantly in future versions.

The general form of the data generating process is:

\[ \mathbf{Y_i} = \tau (\mathbf{X_i}) \mathbf{T_i} + g(\mathbf{W_i}, \mathbf{X_i}) + \mathbf{\epsilon_i} \] \[ \mathbf{T}_i=f(\mathbf{W}_i, \mathbf{X_{i,\mathcal{S}}})+\mathbf{\eta_i} \]

where \(\mathbf{Y_i}\) are the outcome(s), \(\mathbf{T_i}\) are the treatment(s), \(\mathbf{X_i}\) are the effect modifiers (leveraged for treatment effect heterogeneity) with an optional random subset \(\mathcal{S}\) selected as confounders, \(\mathbf{W_i}\) are the confounders, \(\mathbf{\epsilon_i}\) and \(\mathbf{\eta_i}\) are the error terms drawn from normal distributions with optional specified standard deviation, \(\tau\) is the CATE function, \(g\) is the linearly seperable/nuisance component of the outcome function, and \(f\) is the treatment function. Note in the case of no modifier variables, we obtain a purely partially linear model, with \(\tau\) as a constant.

For linear data generating process, \(f\) and \(g\) consist of strictly linear terms and untransformed variables. \(\tau(\mathbf{X_i})\) consists of linear interaction terms.

For nonlinear data generating process, \(f\) and \(g\) are generated via Generalized Additive Models (GAMs) with randomly selected nonlinear transformations. \(\tau(\mathbf{X_i})\) contains interaction terms with \(\mathbf{X}\) and nonlinear transformations of \(\mathbf{X}\).

Note in the case of binary/discrete outcomes or treatments, sigmoid and softmax functions are used to transform log odds to probabilities.

As a DAG, the data generating process can be roughly represented as:

flowchart TD;
    X((X))-->Y((Y));
    W((W))-->Y((Y));
    W((W))-->T((T));
    X((X))-->|"S"|T((T));
    T((T))-->|"τ(X)"|Y((Y));

    linkStyle 0,1,2,3,4 stroke:black,stroke-width:2px

For a more detailed working example, see SyntheticDataGenerator Example.

Parameters

Name Type Description Default
n_obs int Number of observations. 10000
n_cont_outcomes int Number of continuous outcomes (\(Y\)). 1
n_binary_outcomes int Number of binary outcomes (\(Y\)). 0
n_cont_treatments int Number of continuous treatments (\(T\)). 0
n_binary_treatments int Number of binary treatments (\(T\)). 1
n_discrete_treatments int Number of discrete treatments (\(T\)). 0
n_cont_confounders int Number of continuous confounders (\(W\)). 0
n_binary_confounders int Number of binary confounders (\(W\)). 0
n_discrete_confounders int Number of discrete confounders (\(W\)). 0
n_cont_modifiers int Number of continuous treatment effect modifiers (\(X\)). 0
n_binary_modifiers int Number of binary treatment effect modifiers (\(X\)). 0
n_discrete_modifiers int Number of discrete treatment effect modifiers (\(X\)). 0
n_confounding_modifiers int Number of confounding treatment effect modifiers (\(X_{\mathcal{S}}\)). 0
stddev_outcome_noise float Standard deviation of the outcome noise (\(\epsilon\)). 1.0
stddev_treatment_noise float Standard deviation of the treatment noise (\(\eta\)). 1.0
causal_model_functional_form str Functional form of the causal model, can be “linear” or “nonlinear”. 'linear'
n_nonlinear_transformations int | None Number of nonlinear transformations, only applies if causal_model_functional_form=“nonlinear”. None
seed int | None Random seed to use for generating the data. None

Attributes

Name Type Description
df pd.DataFrame The data generated by the data generation process.
cates pd.DataFrame The true conditional average treatment effects (CATEs) of the data.
ates pd.DataFrame The true average treatment effects (ATEs) of the data.
dgp dict The true data generating processes of the treatments and outcomes. Contains the design matrix formula, parameters, noise, raw_scores, and function used to generate the data.

Examples

from caml.extensions.synthetic_data import SyntheticDataGenerator

data_generator = SyntheticDataGenerator(n_cont_outcomes=1,
                                        n_binary_treatments=1,
                                        n_cont_confounders=2,
                                        n_cont_modifiers=2,
                                        seed=10)
data_generator.df
W1_continuous W2_continuous X1_continuous X2_continuous T1_binary Y1_continuous
0 0.354380 -3.252276 2.715662 -3.578800 1 -10.372900
1 0.568499 2.484069 -6.402235 -2.611815 0 13.437245
2 0.162715 8.842902 1.288770 -3.788545 0 -51.695014
3 0.362944 -0.959538 1.080988 -3.542550 1 -10.163549
4 0.612101 1.417536 4.143630 -4.112453 0 -33.613222
... ... ... ... ... ... ...
9995 0.340436 0.241095 -6.524222 -3.188783 1 28.300943
9996 0.019523 1.338152 -2.555492 -3.643733 1 -0.252336
9997 0.325401 1.258659 -3.340546 -4.255203 1 5.992318
9998 0.586715 1.263264 -2.826709 -4.149383 1 1.543645
9999 0.003002 6.723381 1.260782 -3.660600 1 -44.114285

10000 rows × 6 columns

data_generator.cates
CATE_of_T1_binary_on_Y1_continuous
0 -2.446437
1 6.601527
2 -1.289209
3 -0.922547
4 -4.137320
... ...
9995 6.299893
9996 2.337202
9997 2.618441
9998 2.223415
9999 -1.171886

10000 rows × 1 columns

data_generator.ates
Treatment ATE
0 T1_binary_on_Y1_continuous 0.678957
for t, df in data_generator.dgp.items():
    print(f"\nDGP for {t}:")
    print(df)

DGP for T1_binary:
{'formula': '1 + W1_continuous + W2_continuous', 'params': array([ 0.4609703 ,  0.2566887 , -0.03896251]), 'noise': array([ 0.14476544, -0.51949108, -1.88624383, ..., -0.59020672,
        0.87157749,  0.0697439 ]), 'raw_scores': array([0.69496136, 0.49765524, 0.15083742, ..., 0.47633017, 0.80751307,
       0.5669763 ]), 'function': <function SyntheticDataGenerator._create_dgp_function.<locals>.f_binary at 0x7f5c121b6f80>}

DGP for Y1_continuous:
{'formula': '1 + W1_continuous + W2_continuous + X1_continuous + X2_continuous + T1_binary + T1_binary*X1_continuous + T1_binary*X2_continuous', 'params': array([-1.80242342,  1.11129512, -4.1263484 , -4.82709212,  1.87319625,
        2.60635605, -0.91633948,  0.71653213]), 'noise': array([-0.12533653, -1.15370094, -0.26681987, ...,  1.85405702,
       -0.18887322, -0.45736583]), 'raw_scores': array([-10.3729005 ,  13.43724492, -51.69501383, ...,   5.99231789,
         1.54364473, -44.11428464]), 'function': <function SyntheticDataGenerator._create_dgp_function.<locals>.f_cont at 0x7f5c121b7490>}

Methods

Name Description
create_design_matrix Create a design matrix from a formula and data.

create_design_matrix

extensions.synthetic_data.SyntheticDataGenerator.create_design_matrix(
    df,
    formula,
    return_type='dataframe',
    **kwargs,
)

Create a design matrix from a formula and data.

This method can be used to reconstruct the design matrices used to generate the treatment and outcome variables. Furthermore, using dgp attribute, using the returned design matrix, one can generate the original outcomes and treatment variables. See below example.

Parameters

Name Type Description Default
df pd.DataFrame The input data. required
formula str The formula to be used with patsy. required
return_type str The type of the returned design matrix. Can be either “dataframe” or “matrix”. Default is “dataframe”. 'dataframe'
**kwargs Additional keyword arguments to be passed to patsy.dmatrix. {}

Returns

Name Type Description
pd.DataFrame | np.ndarray The design matrix.

Examples

import numpy as np
df = data_generator.df
dgp = data_generator.dgp['Y1_continuous']

design_matrix = data_generator.create_design_matrix(df,formula=dgp['formula'])

print(design_matrix.columns)

# Recreate Y1_continuous
params = dgp['params']
noise = dgp['noise']
f = dgp['function']

f(design_matrix,params,noise)

assert np.allclose(f(design_matrix,params,noise), df['Y1_continuous'])
Index(['Intercept', 'W1_continuous', 'W2_continuous', 'X1_continuous',
       'X2_continuous', 'T1_binary', 'T1_binary:X1_continuous',
       'T1_binary:X2_continuous'],
      dtype='object')
Back to top