SyntheticDataGenerator

SyntheticDataGenerator(
    self,
    n_obs=10000,
    n_cont_outcomes=1,
    n_binary_outcomes=0,
    n_cont_treatments=0,
    n_binary_treatments=1,
    n_discrete_treatments=0,
    n_cont_confounders=0,
    n_binary_confounders=0,
    n_discrete_confounders=0,
    n_cont_modifiers=0,
    n_binary_modifiers=0,
    n_discrete_modifiers=0,
    n_confounding_modifiers=0,
    stddev_outcome_noise=1.0,
    stddev_treatment_noise=1.0,
    causal_model_functional_form='linear',
    n_nonlinear_transformations=None,
    seed=None,
)

Generate highly flexible synthetic data for use in causal inference and CaML testing.

SyntheticDataGenerator is experimental and may change significantly in future versions.

The general form of the data generating process is:

\[ \mathbf{Y_i} = \tau (\mathbf{X_i}) \mathbf{T_i} + g(\mathbf{W_i}, \mathbf{X_i}) + \mathbf{\epsilon_i} \] \[ \mathbf{T}_i=f(\mathbf{W}_i, \mathbf{X_{i,\mathcal{S}}})+\mathbf{\eta_i} \]

where \(\mathbf{Y_i}\) are the outcome(s), \(\mathbf{T_i}\) are the treatment(s), \(\mathbf{X_i}\) are the effect modifiers (leveraged for treatment effect heterogeneity) with an optional random subset \(\mathcal{S}\) selected as confounders, \(\mathbf{W_i}\) are the confounders, \(\mathbf{\epsilon_i}\) and \(\mathbf{\eta_i}\) are the error terms drawn from normal distributions with optional specified standard deviation, \(\tau\) is the CATE function, \(g\) is the linearly seperable/nuisance component of the outcome function, and \(f\) is the treatment function. Note in the case of no modifier variables, we obtain a purely partially linear model, with \(\tau\) as a constant.

For linear data generating process, \(f\) and \(g\) consist of strictly linear terms and untransformed variables. \(\tau(\mathbf{X_i})\) consists of linear interaction terms.

For nonlinear data generating process, \(f\) and \(g\) are generated via Generalized Additive Models (GAMs) with randomly selected nonlinear transformations. \(\tau(\mathbf{X_i})\) contains interaction terms with \(\mathbf{X}\) and nonlinear transformations of \(\mathbf{X}\).

Note in the case of binary/discrete outcomes or treatments, sigmoid and softmax functions are used to transform log odds to probabilities.

As a DAG, the data generating process can be roughly represented as:

flowchart TD;
    X((X))-->Y((Y));
    W((W))-->Y((Y));
    W((W))-->T((T));
    X((X))-->|"S"|T((T));
    T((T))-->|"τ(X)"|Y((Y));

    linkStyle 0,1,2,3,4 stroke:black,stroke-width:2px

For a more detailed working example, see SyntheticDataGenerator Example.

Parameters

Name	Type	Description	Default
n_obs	int	Number of observations.	`10000`
n_cont_outcomes	int	Number of continuous outcomes (\(Y\)).	`1`
n_binary_outcomes	int	Number of binary outcomes (\(Y\)).	`0`
n_cont_treatments	int	Number of continuous treatments (\(T\)).	`0`
n_binary_treatments	int	Number of binary treatments (\(T\)).	`1`
n_discrete_treatments	int	Number of discrete treatments (\(T\)).	`0`
n_cont_confounders	int	Number of continuous confounders (\(W\)).	`0`
n_binary_confounders	int	Number of binary confounders (\(W\)).	`0`
n_discrete_confounders	int	Number of discrete confounders (\(W\)).	`0`
n_cont_modifiers	int	Number of continuous treatment effect modifiers (\(X\)).	`0`
n_binary_modifiers	int	Number of binary treatment effect modifiers (\(X\)).	`0`
n_discrete_modifiers	int	Number of discrete treatment effect modifiers (\(X\)).	`0`
n_confounding_modifiers	int	Number of confounding treatment effect modifiers (\(X_{\mathcal{S}}\)).	`0`
stddev_outcome_noise	float	Standard deviation of the outcome noise (\(\epsilon\)).	`1.0`
stddev_treatment_noise	float	Standard deviation of the treatment noise (\(\eta\)).	`1.0`
causal_model_functional_form	str	Functional form of the causal model, can be “linear” or “nonlinear”.	`'linear'`
n_nonlinear_transformations	int \| None	Number of nonlinear transformations, only applies if causal_model_functional_form=“nonlinear”.	`None`
seed	int \| None	Random seed to use for generating the data.	`None`

Attributes

Name	Type	Description
df	pd.DataFrame	The data generated by the data generation process.
cates	pd.DataFrame	The true conditional average treatment effects (CATEs) of the data.
ates	pd.DataFrame	The true average treatment effects (ATEs) of the data.
dgp	dict	The true data generating processes of the treatments and outcomes. Contains the design matrix formula, parameters, noise, raw_scores, and function used to generate the data.

Examples

from caml.extensions.synthetic_data import SyntheticDataGenerator

data_generator = SyntheticDataGenerator(n_cont_outcomes=1,
                                        n_binary_treatments=1,
                                        n_cont_confounders=2,
                                        n_cont_modifiers=2,
                                        seed=10)
data_generator.df

	W1_continuous	W2_continuous	X1_continuous	X2_continuous	T1_binary	Y1_continuous
0	0.354380	-3.252276	2.715662	-3.578800	1	-10.372900
1	0.568499	2.484069	-6.402235	-2.611815	0	13.437245
2	0.162715	8.842902	1.288770	-3.788545	0	-51.695014
3	0.362944	-0.959538	1.080988	-3.542550	1	-10.163549
4	0.612101	1.417536	4.143630	-4.112453	0	-33.613222
...	...	...	...	...	...	...
9995	0.340436	0.241095	-6.524222	-3.188783	1	28.300943
9996	0.019523	1.338152	-2.555492	-3.643733	1	-0.252336
9997	0.325401	1.258659	-3.340546	-4.255203	1	5.992318
9998	0.586715	1.263264	-2.826709	-4.149383	1	1.543645
9999	0.003002	6.723381	1.260782	-3.660600	1	-44.114285

10000 rows × 6 columns

data_generator.cates

	CATE_of_T1_binary_on_Y1_continuous
0	-2.446437
1	6.601527
2	-1.289209
3	-0.922547
4	-4.137320
...	...
9995	6.299893
9996	2.337202
9997	2.618441
9998	2.223415
9999	-1.171886

10000 rows × 1 columns

data_generator.ates

	Treatment	ATE
0	T1_binary_on_Y1_continuous	0.678957

for t, df in data_generator.dgp.items():
    print(f"\nDGP for {t}:")
    print(df)


DGP for T1_binary:
{'formula': '1 + W1_continuous + W2_continuous', 'params': array([ 0.4609703 ,  0.2566887 , -0.03896251]), 'noise': array([ 0.14476544, -0.51949108, -1.88624383, ..., -0.59020672,
        0.87157749,  0.0697439 ]), 'raw_scores': array([0.69496136, 0.49765524, 0.15083742, ..., 0.47633017, 0.80751307,
       0.5669763 ]), 'function': <function SyntheticDataGenerator._create_dgp_function.<locals>.f_binary at 0x7efca67dd3f0>}

DGP for Y1_continuous:
{'formula': '1 + W1_continuous + W2_continuous + X1_continuous + X2_continuous + T1_binary + T1_binary*X1_continuous + T1_binary*X2_continuous', 'params': array([-1.80242342,  1.11129512, -4.1263484 , -4.82709212,  1.87319625,
        2.60635605, -0.91633948,  0.71653213]), 'noise': array([-0.12533653, -1.15370094, -0.26681987, ...,  1.85405702,
       -0.18887322, -0.45736583]), 'raw_scores': array([-10.3729005 ,  13.43724492, -51.69501383, ...,   5.99231789,
         1.54364473, -44.11428464]), 'function': <function SyntheticDataGenerator._create_dgp_function.<locals>.f_cont at 0x7efca67dd480>}

Methods

Name	Description
create_design_matrix	Create a design matrix from a formula and data.

create_design_matrix

extensions.synthetic_data.SyntheticDataGenerator.create_design_matrix(
    df,
    formula,
    return_type='dataframe',
    **kwargs,
)

Create a design matrix from a formula and data.

This method can be used to reconstruct the design matrices used to generate the treatment and outcome variables. Furthermore, using dgp attribute, using the returned design matrix, one can generate the original outcomes and treatment variables. See below example.

Parameters

Name	Type	Description	Default
df	pd.DataFrame	The input data.	required
formula	str	The formula to be used with patsy.	required
return_type	str	The type of the returned design matrix. Can be either “dataframe” or “matrix”. Default is “dataframe”.	`'dataframe'`
**kwargs		Additional keyword arguments to be passed to patsy.dmatrix.	`{}`

Returns

Name	Type	Description
	pd.DataFrame \| np.ndarray	The design matrix.

Examples

import numpy as np
df = data_generator.df
dgp = data_generator.dgp['Y1_continuous']

design_matrix = data_generator.create_design_matrix(df,formula=dgp['formula'])

print(design_matrix.columns)

# Recreate Y1_continuous
params = dgp['params']
noise = dgp['noise']
f = dgp['function']

f(design_matrix,params,noise)

assert np.allclose(f(design_matrix,params,noise), df['Y1_continuous'])

Index(['Intercept', 'W1_continuous', 'W2_continuous', 'X1_continuous',
       'X2_continuous', 'T1_binary', 'T1_binary:X1_continuous',
       'T1_binary:X2_continuous'],
      dtype='object')