flowchart TD; X((X))-->Y((Y)); W((W))-->Y((Y)); W((W))-->T((T)); X((X))-->|"S"|T((T)); T((T))-->|"τ(X)"|Y((Y)); linkStyle 0,1,2,3,4 stroke:black,stroke-width:2px
SyntheticDataGenerator
SyntheticDataGenerator(self,
=10000,
n_obs=1,
n_cont_outcomes=0,
n_binary_outcomes=0,
n_cont_treatments=1,
n_binary_treatments=0,
n_discrete_treatments=0,
n_cont_confounders=0,
n_binary_confounders=0,
n_discrete_confounders=0,
n_cont_modifiers=0,
n_binary_modifiers=0,
n_discrete_modifiers=0,
n_confounding_modifiers=1.0,
stddev_outcome_noise=1.0,
stddev_treatment_noise='linear',
causal_model_functional_form=None,
n_nonlinear_transformations=None,
seed )
Generate highly flexible synthetic data for use in causal inference and CaML testing.
SyntheticDataGenerator is experimental and may change significantly in future versions.
The general form of the data generating process is:
\[ \mathbf{Y_i} = \tau (\mathbf{X_i}) \mathbf{T_i} + g(\mathbf{W_i}, \mathbf{X_i}) + \mathbf{\epsilon_i} \] \[ \mathbf{T}_i=f(\mathbf{W}_i, \mathbf{X_{i,\mathcal{S}}})+\mathbf{\eta_i} \]
where \(\mathbf{Y_i}\) are the outcome(s), \(\mathbf{T_i}\) are the treatment(s), \(\mathbf{X_i}\) are the effect modifiers (leveraged for treatment effect heterogeneity) with an optional random subset \(\mathcal{S}\) selected as confounders, \(\mathbf{W_i}\) are the confounders, \(\mathbf{\epsilon_i}\) and \(\mathbf{\eta_i}\) are the error terms drawn from normal distributions with optional specified standard deviation, \(\tau\) is the CATE function, \(g\) is the linearly seperable/nuisance component of the outcome function, and \(f\) is the treatment function. Note in the case of no modifier variables, we obtain a purely partially linear model, with \(\tau\) as a constant.
For linear data generating process, \(f\) and \(g\) consist of strictly linear terms and untransformed variables. \(\tau(\mathbf{X_i})\) consists of linear interaction terms.
For nonlinear data generating process, \(f\) and \(g\) are generated via Generalized Additive Models (GAMs) with randomly selected nonlinear transformations. \(\tau(\mathbf{X_i})\) contains interaction terms with \(\mathbf{X}\) and nonlinear transformations of \(\mathbf{X}\).
Note in the case of binary/discrete outcomes or treatments, sigmoid and softmax functions are used to transform log odds to probabilities.
As a DAG, the data generating process can be roughly represented as:
For a more detailed working example, see SyntheticDataGenerator Example.
Parameters
Name | Type | Description | Default |
---|---|---|---|
n_obs | int | Number of observations. | 10000 |
n_cont_outcomes | int | Number of continuous outcomes (\(Y\)). | 1 |
n_binary_outcomes | int | Number of binary outcomes (\(Y\)). | 0 |
n_cont_treatments | int | Number of continuous treatments (\(T\)). | 0 |
n_binary_treatments | int | Number of binary treatments (\(T\)). | 1 |
n_discrete_treatments | int | Number of discrete treatments (\(T\)). | 0 |
n_cont_confounders | int | Number of continuous confounders (\(W\)). | 0 |
n_binary_confounders | int | Number of binary confounders (\(W\)). | 0 |
n_discrete_confounders | int | Number of discrete confounders (\(W\)). | 0 |
n_cont_modifiers | int | Number of continuous treatment effect modifiers (\(X\)). | 0 |
n_binary_modifiers | int | Number of binary treatment effect modifiers (\(X\)). | 0 |
n_discrete_modifiers | int | Number of discrete treatment effect modifiers (\(X\)). | 0 |
n_confounding_modifiers | int | Number of confounding treatment effect modifiers (\(X_{\mathcal{S}}\)). | 0 |
stddev_outcome_noise | float | Standard deviation of the outcome noise (\(\epsilon\)). | 1.0 |
stddev_treatment_noise | float | Standard deviation of the treatment noise (\(\eta\)). | 1.0 |
causal_model_functional_form | str | Functional form of the causal model, can be “linear” or “nonlinear”. | 'linear' |
n_nonlinear_transformations | int | None | Number of nonlinear transformations, only applies if causal_model_functional_form=“nonlinear”. | None |
seed | int | None | Random seed to use for generating the data. | None |
Attributes
Name | Type | Description |
---|---|---|
df | pd.DataFrame | The data generated by the data generation process. |
cates | pd.DataFrame | The true conditional average treatment effects (CATEs) of the data. |
ates | pd.DataFrame | The true average treatment effects (ATEs) of the data. |
dgp | dict | The true data generating processes of the treatments and outcomes. Contains the design matrix formula, parameters, noise, raw_scores, and function used to generate the data. |
Examples
from caml.extensions.synthetic_data import SyntheticDataGenerator
= SyntheticDataGenerator(n_cont_outcomes=1,
data_generator =1,
n_binary_treatments=2,
n_cont_confounders=2,
n_cont_modifiers=10)
seed data_generator.df
W1_continuous | W2_continuous | X1_continuous | X2_continuous | T1_binary | Y1_continuous | |
---|---|---|---|---|---|---|
0 | 0.354380 | -3.252276 | 2.715662 | -3.578800 | 1 | -10.372900 |
1 | 0.568499 | 2.484069 | -6.402235 | -2.611815 | 0 | 13.437245 |
2 | 0.162715 | 8.842902 | 1.288770 | -3.788545 | 0 | -51.695014 |
3 | 0.362944 | -0.959538 | 1.080988 | -3.542550 | 1 | -10.163549 |
4 | 0.612101 | 1.417536 | 4.143630 | -4.112453 | 0 | -33.613222 |
... | ... | ... | ... | ... | ... | ... |
9995 | 0.340436 | 0.241095 | -6.524222 | -3.188783 | 1 | 28.300943 |
9996 | 0.019523 | 1.338152 | -2.555492 | -3.643733 | 1 | -0.252336 |
9997 | 0.325401 | 1.258659 | -3.340546 | -4.255203 | 1 | 5.992318 |
9998 | 0.586715 | 1.263264 | -2.826709 | -4.149383 | 1 | 1.543645 |
9999 | 0.003002 | 6.723381 | 1.260782 | -3.660600 | 1 | -44.114285 |
10000 rows × 6 columns
data_generator.cates
CATE_of_T1_binary_on_Y1_continuous | |
---|---|
0 | -2.446437 |
1 | 6.601527 |
2 | -1.289209 |
3 | -0.922547 |
4 | -4.137320 |
... | ... |
9995 | 6.299893 |
9996 | 2.337202 |
9997 | 2.618441 |
9998 | 2.223415 |
9999 | -1.171886 |
10000 rows × 1 columns
data_generator.ates
Treatment | ATE | |
---|---|---|
0 | T1_binary_on_Y1_continuous | 0.678957 |
for t, df in data_generator.dgp.items():
print(f"\nDGP for {t}:")
print(df)
DGP for T1_binary:
{'formula': '1 + W1_continuous + W2_continuous', 'params': array([ 0.4609703 , 0.2566887 , -0.03896251]), 'noise': array([ 0.14476544, -0.51949108, -1.88624383, ..., -0.59020672,
0.87157749, 0.0697439 ]), 'raw_scores': array([0.69496136, 0.49765524, 0.15083742, ..., 0.47633017, 0.80751307,
0.5669763 ]), 'function': <function SyntheticDataGenerator._create_dgp_function.<locals>.f_binary at 0x7f5c121b6f80>}
DGP for Y1_continuous:
{'formula': '1 + W1_continuous + W2_continuous + X1_continuous + X2_continuous + T1_binary + T1_binary*X1_continuous + T1_binary*X2_continuous', 'params': array([-1.80242342, 1.11129512, -4.1263484 , -4.82709212, 1.87319625,
2.60635605, -0.91633948, 0.71653213]), 'noise': array([-0.12533653, -1.15370094, -0.26681987, ..., 1.85405702,
-0.18887322, -0.45736583]), 'raw_scores': array([-10.3729005 , 13.43724492, -51.69501383, ..., 5.99231789,
1.54364473, -44.11428464]), 'function': <function SyntheticDataGenerator._create_dgp_function.<locals>.f_cont at 0x7f5c121b7490>}
Methods
Name | Description |
---|---|
create_design_matrix | Create a design matrix from a formula and data. |
create_design_matrix
extensions.synthetic_data.SyntheticDataGenerator.create_design_matrix(
df,
formula,='dataframe',
return_type**kwargs,
)
Create a design matrix from a formula and data.
This method can be used to reconstruct the design matrices used to generate the treatment and outcome variables. Furthermore, using dgp
attribute, using the returned design matrix, one can generate the original outcomes and treatment variables. See below example.
Parameters
Name | Type | Description | Default |
---|---|---|---|
df | pd.DataFrame | The input data. | required |
formula | str | The formula to be used with patsy. | required |
return_type | str | The type of the returned design matrix. Can be either “dataframe” or “matrix”. Default is “dataframe”. | 'dataframe' |
**kwargs | Additional keyword arguments to be passed to patsy.dmatrix. | {} |
Returns
Name | Type | Description |
---|---|---|
pd.DataFrame | np.ndarray | The design matrix. |
Examples
import numpy as np
= data_generator.df
df = data_generator.dgp['Y1_continuous']
dgp
= data_generator.create_design_matrix(df,formula=dgp['formula'])
design_matrix
print(design_matrix.columns)
# Recreate Y1_continuous
= dgp['params']
params = dgp['noise']
noise = dgp['function']
f
f(design_matrix,params,noise)
assert np.allclose(f(design_matrix,params,noise), df['Y1_continuous'])
Index(['Intercept', 'W1_continuous', 'W2_continuous', 'X1_continuous',
'X2_continuous', 'T1_binary', 'T1_binary:X1_continuous',
'T1_binary:X2_continuous'],
dtype='object')