make_partially_linear_dataset_constant

make_partially_linear_dataset_constant(
    n_obs=1000,
    ate=4.0,
    n_confounders=10,
    dgp='make_plr_CCDDHNR2018',
    seed=None,
    **doubleml_kwargs,
)

Simulate a data generating process from a partially linear model with a constant treatment effect (ATE only).

The outcome and treatment are both continuous.The dataset is generated using the make_plr_CCDDHNR2018 or make_plr_turrell2018 function from the doubleml package.

The general form of the data generating process is:

\[ y_i= \tau_0 d_i + g(\mathbf{W_i})+\epsilon_i \] \[ d_i=f(\mathbf{W_i})+\eta_i \]

where \(y_i\) is the outcome, \(d_i\) is the treatment, \(\mathbf{W_i}\) are the confounders, \(\epsilon_i\) and \(\eta_i\) are the error terms, \(\tau_0\) is the ATE parameter, \(g\) is the outcome function, and \(f\) is the treatment function.

See the doubleml documentation for more details on the specific functional forms of the data generating process.

As a DAG, the data generating process can be roughly represented as:

flowchart TD;
    W((W))-->d((d));
    W((W))-->y((y));
    d((d))-->|"τ0"|y((y));
    linkStyle 0,1 stroke:black,stroke-width:2px
    linkStyle 1,2 stroke:black,stroke-width:2px

Parameters

Name	Type	Description	Default
n_obs	int	The number of observations to generate.	`1000`
ate	float	The average treatment effect \(\tau_0\).	`4.0`
n_confounders	int	The number of confounders \(\mathbf{W_i}\) to generate.	`10`
dgp	str	The data generating process to use. Can be “make_plr_CCDDHNR20” or “make_plr_turrell2018”.	`'make_plr_CCDDHNR2018'`
seed	int \| None	The seed to use for the random number generator.	`None`
**doubleml_kwargs		Additional keyword arguments to pass to the data generating process.	`{}`

Returns

Name	Type	Description
df	pandas.DataFrame	The generated dataset where y is the outcome, d is the treatment, and W are the confounders.
true_cates	numpy.ndarray	The true conditional average treatment effects, which are all equal to the ATE here.
true_ate	float	The true average treatment effect.

Examples

from caml.extensions.synthetic_data import make_partially_linear_dataset_constant
df, true_cates, true_ate = make_partially_linear_dataset_constant(n_obs=1000,
                                                    ate=4.0,
                                                    n_confounders=10,
                                                    dgp="make_plr_CCDDHNR2018",
                                                    seed=1)

print(f"True CATES: {true_cates[:5]}")
print(f"True ATE: {true_ate}")
print(df.head())

True CATES: [4. 4. 4. 4. 4.]
True ATE: 4.0
         W1        W2        W3        W4        W5        W6        W7  \
0 -1.799808 -0.830362 -0.775800 -2.430475 -1.759428 -0.196538 -0.392579   
1 -2.238925 -2.107779 -1.619264 -1.816121 -2.084809 -0.456936  0.118781   
2  1.069028  1.616054  1.959420  1.398880  0.058545  0.370891  0.161045   
3  0.497020 -0.399126 -0.019305  0.230080  0.640361  1.233185  0.906313   
4 -1.749809 -0.315699 -0.283176  0.439451  0.819941  0.156514  0.059722   

         W8        W9       W10         y         d  
0 -0.827537 -0.735652 -1.127103 -6.074658 -1.843476  
1  0.270647  0.199401  0.049088 -8.534573 -1.969429  
2  0.118180  0.438721  0.280880  4.915427  0.935840  
3  1.031123 -0.373092  0.442367 -0.037117 -0.209740  
4  0.472781  0.030157  1.174463 -7.922597 -1.903480