make_partially_linear_dataset_simple

make_partially_linear_dataset_simple(
    n_obs=1000,
    n_confounders=5,
    dim_heterogeneity=2,
    binary_treatment=True,
    seed=None,
)

Simulate data generating process from a partially linear model with a simple 1 or 2 dimensional CATE function.

The outcome is continuous and the treatment can be binary or continuous. The dataset is generated using the make_heterogeneous_data function from the doubleml package.

The general form of the data generating process is, in the case of dim_heterogeneity=1:

\[ y_i= \tau (x_0) d_i + g(\mathbf{X_i})+\epsilon_i \] \[ d_i=f(\mathbf{X_i})+\eta_i \]

or, in the case of dim_heterogeneity=2:

\[ y_i= \tau (x_0,x_1) d_i + g(\mathbf{X_i})+\epsilon_i \] \[ d_i=f(\mathbf{X_i})+\eta_i \]

where \(y_i\) is the outcome, \(d_i\) is the treatment, \(\mathbf{X_i}\) are the confounders, \(\epsilon_i\) and \(\eta_i\) are the error terms, \(\\tau\) is the CATE function, \(g\) is the outcome function, and \(f\) is the treatment function.

See the doubleml documentation for more details on the specific functional forms of the data generating process.

Here the ATE is defined as the average of the CATE function over all observations: \(\mathbb{E}[\tau (\cdot)]\)

As a DAG, the data generating process can be roughly represented as:

flowchart TD;
    Xn((X))-->d((d));
    Xn((X))-->y((y));
    d((d))-->|"τ(x0,x1)"|y((y));

    linkStyle 0,1 stroke:black,stroke-width:2px
    linkStyle 1,2 stroke:black,stroke-width:2px

Parameters

Name	Type	Description	Default
n_obs	int	The number of observations to generate.	`1000`
n_confounders	int	The number of confounders \(X\).	`5`
dim_heterogeneity	int	The dimension of the heterogeneity \(x_0\) or \((x_0,x_1)\). Can only be 1 or 2.	`2`
binary_treatment	bool	Whether the treatment \(d\) is binary or continuous.	`True`
seed	int \| None	The seed to use for the random number generator.	`None`

Returns

Name	Type	Description
df	pandas.DataFrame	The generated dataset where y is the outcome, d is the treatment, and X are the confounders with a 1d or 2d subset utilized for heterogeneity.
true_cates	numpy.ndarray	The true conditional average treatment effects.
true_ate	float	The true average treatment effect.

Examples

from caml.extensions.synthetic_data import make_partially_linear_dataset_simple
df, true_cates, true_ate = make_partially_linear_dataset_simple(n_obs=1000,
                                                                n_confounders=5,
                                                                dim_heterogeneity=2,
                                                                binary_treatment=True,
                                                                seed=1)

print(f"True CATES: {true_cates[:5]}")
print(f"True ATE: {true_ate}")
print(df.head())

True CATES: [5.07318438 4.22638341 4.84246206 5.02852819 7.30906609]
True ATE: 4.434805144050488
          y    d        X0        X1        X2        X3        X4
0  5.814804  1.0  0.560647  0.182920  0.938085  0.721671  0.209634
1  4.593199  1.0  0.113353  0.358469  0.271148  0.908152  0.497946
2  1.489081  0.0  0.970009  0.981170  0.319852  0.034913  0.003447
3  6.569753  1.0  0.386105  0.317130  0.339849  0.232991  0.463512
4  8.249305  1.0  0.733222  0.360575  0.903222  0.600965  0.110013