make_partially_linear_dataset_constant

make_partially_linear_dataset_constant(
    n_obs=1000,
    ate=4.0,
    n_confounders=10,
    dgp='make_plr_CCDDHNR2018',
    seed=None,
    **doubleml_kwargs,
)

Simulate a data generating process from a partially linear model with a constant treatment effect (ATE only).

The outcome and treatment are both continuous.The dataset is generated using the make_plr_CCDDHNR2018 or make_plr_turrell2018 function from the doubleml package.

The general form of the data generating process is:

\[ y_i= \tau_0 d_i + g(\mathbf{W_i})+\epsilon_i \] \[ d_i=f(\mathbf{W_i})+\eta_i \]

where \(y_i\) is the outcome, \(d_i\) is the treatment, \(\mathbf{W_i}\) are the confounders, \(\epsilon_i\) and \(\eta_i\) are the error terms, \(\tau_0\) is the ATE parameter, \(g\) is the outcome function, and \(f\) is the treatment function.

See the doubleml documentation for more details on the specific functional forms of the data generating process.

As a DAG, the data generating process can be roughly represented as:

flowchart TD;
    W((W))-->d((d));
    W((W))-->y((y));
    d((d))-->|"τ0"|y((y));
    linkStyle 0,1 stroke:black,stroke-width:2px
    linkStyle 1,2 stroke:black,stroke-width:2px

Parameters

Name Type Description Default
n_obs int The number of observations to generate. 1000
ate float The average treatment effect \(\tau_0\). 4.0
n_confounders int The number of confounders \(\mathbf{W_i}\) to generate. 10
dgp str The data generating process to use. Can be “make_plr_CCDDHNR20” or “make_plr_turrell2018”. 'make_plr_CCDDHNR2018'
seed int | None The seed to use for the random number generator. None
**doubleml_kwargs Additional keyword arguments to pass to the data generating process. {}

Returns

Name Type Description
df pandas.DataFrame The generated dataset where y is the outcome, d is the treatment, and W are the confounders.
true_cates numpy.ndarray The true conditional average treatment effects, which are all equal to the ATE here.
true_ate float The true average treatment effect.

Examples

from caml.extensions.synthetic_data import make_partially_linear_dataset_constant
df, true_cates, true_ate = make_partially_linear_dataset_constant(n_obs=1000,
                                                    ate=4.0,
                                                    n_confounders=10,
                                                    dgp="make_plr_CCDDHNR2018",
                                                    seed=1)

print(f"True CATES: {true_cates[:5]}")
print(f"True ATE: {true_ate}")
print(df.head())
True CATES: [4. 4. 4. 4. 4.]
True ATE: 4.0
         W1        W2        W3        W4        W5        W6        W7  \
0 -1.799808 -0.830362 -0.775800 -2.430475 -1.759428 -0.196538 -0.392579   
1 -2.238925 -2.107779 -1.619264 -1.816121 -2.084809 -0.456936  0.118781   
2  1.069028  1.616054  1.959420  1.398880  0.058545  0.370891  0.161045   
3  0.497020 -0.399126 -0.019305  0.230080  0.640361  1.233185  0.906313   
4 -1.749809 -0.315699 -0.283176  0.439451  0.819941  0.156514  0.059722   

         W8        W9       W10         y         d  
0 -0.827537 -0.735652 -1.127103 -6.074658 -1.843476  
1  0.270647  0.199401  0.049088 -8.534573 -1.969429  
2  0.118180  0.438721  0.280880  4.915427  0.935840  
3  1.031123 -0.373092  0.442367 -0.037117 -0.209740  
4  0.472781  0.030157  1.174463 -7.922597 -1.903480  
Back to top