make_fully_heterogeneous_dataset

make_fully_heterogeneous_dataset(
    n_obs=1000,
    n_confounders=5,
    theta=4.0,
    seed=None,
    **doubleml_kwargs,
)

Simulate data generating process from an interactive regression model with fully heterogenous treatment effects.

The outcome is continuous and the treatment is binary. The dataset is generated using a modified version of make_irm_data function from the doubleml package.

The general form of the data generating process is:

\[ y_i= g(d_i,\mathbf{X_i})+\epsilon_i \] \[ d_i=f(\mathbf{X_i})+\eta_i \]

where \(y_i\) is the outcome, \(d_i\) is the treatment, \(\mathbf{X_i}\) are the confounders utilized for full effect heterogeneity, \(\epsilon_i\) and \(\eta_i\) are the error terms, \(g\) is the outcome function, and \(f\) is the treatment function.

See the doubleml documentation for more details on the specific functional forms of the data generating process.

Note that the treatment effect is fully heterogenous, thus the CATE is defined as: \(\tau = \mathbb{E}[g(1,\mathbf{X}) - g(0,\mathbf{X})|\mathbf{X}]\) for any \(\mathbf{X}\).

The ATE is defined as the average of the CATE function over all observations: \(\mathbb{E}[\tau (\cdot)]\)

As a DAG, the data generating process can be roughly represented as:

flowchart TD;
    X((X))-->d((d));
    X((X))-->y((y));
    d((d))-->|"τ(X)"|y((y));
    linkStyle 0,1 stroke:black,stroke-width:2px
    linkStyle 1,2 stroke:black,stroke-width:2px

Parameters

Name	Type	Description	Default
n_obs	int	The number of observations to generate.	`1000`
n_confounders	int	The number of confounders \(\mathbf{X_i}\) to generate (these are utilized fully for heterogeneity).	`5`
theta	float	The base parameter for the treatment effect. Note this can differ slightly from the true ATE.	`4.0`
seed	int \| None	The seed to use for the random number generator.	`None`
**doubleml_kwargs		Additional keyword arguments to pass to the data generating process.	`{}`

Returns

Name	Type	Description
df	pandas.DataFrame	The generated dataset where y is the outcome, d is the treatment, and X are the confounders which are fully utilized for heterogeneity.
true_cates	numpy.ndarray	The true conditional average treatment effects.
true_ate	float	The true average treatment effect.

Examples

from caml.extensions.synthetic_data import make_fully_heterogeneous_dataset
df, true_cates, true_ate = make_fully_heterogeneous_dataset(n_obs=1000,
                                                            n_confounders=5,
                                                            theta=4.0,
                                                            seed=1)

print(f"True CATEs: {true_cates[:5]}")
print(f"True ATE: {true_ate}")
print(df.head())

True CATEs: [5.10338083 5.0918794  1.93444292 4.36046179 3.89521828]
True ATE: 3.9499484248360175
         X1        X2        X3        X4        X5         y    d
0  1.682368 -0.422572 -1.219871 -0.941586 -1.270241  5.828931  1.0
1  0.684154  1.125168  2.601475  0.441070  0.889493  4.767675  1.0
2 -2.035148 -1.386116 -0.770108 -0.070788 -0.524494  2.748786  1.0
3  0.429364 -0.125604 -0.095252 -0.033939  1.243388  5.140932  1.0
4  0.240024 -0.069628 -1.722948 -1.565808 -1.494064  2.431165  1.0