CamlCATE

CamlCATE(
    self,
    df,
    Y,
    T,
    X,
    W=None,
    *,
    discrete_treatment=True,
    discrete_outcome=False,
    seed=None,
)

The CamlCATE class represents an opinionated framework of Causal Machine Learning techniques for estimating highly accurate conditional average treatment effects (CATEs).

CamlCATE is experimental and may change significantly in future versions.

The CATE is defined formally as \(\mathbb{E}[\tau|\mathbf{X}]\) where \(\tau\) is the treatment effect and \(\mathbf{X}\) is the set of covariates.

This class is built on top of the EconML library and provides a high-level API for fitting, validating, and making inference with CATE models, with best practices built directly into the API. The class is designed to be easy to use and understand, while still providing flexibility for advanced users. The class is designed to be used with pandas, polars, or pyspark backends, which ultimately get converted to NumPy Arrays under the hood to provide a level of extensibility & interoperability across different data processing frameworks.

The primary workflow for the CamlCATE class is as follows:

Initialize the class with the input DataFrame and the necessary columns.
Utilize flaml AutoML to find nuisance functions or propensity/regression models to be utilized in the EconML estimators.
Fit the CATE models on the training set and select top performer based on the RScore from validation set.
Validate the fitted CATE model on the test set to check for generalization performance.
Fit the final estimator on the entire dataset, after validation and testing.
Predict the CATE based on the fitted final estimator for either the internal dataset or an out-of-sample dataset.
Summarize population summary statistics for the CATE predictions for either the internal dataset or out-of-sample predictions.

For technical details on conditional average treatment effects, see:

CaML Documentation
EconML documentation

Note: All the standard assumptions of Causal Inference apply to this class (e.g., exogeneity/unconfoundedness, overlap, positivity, etc.). The class does not check for these assumptions and assumes that the user has already thought through these assumptions before using the class.

For outcome/treatment support, see matrix.

For a more detailed working example, see CamlCATE Example.

Parameters

Name	Type	Description	Default
df	pandas.DataFrame \| polars.DataFrame \| pyspark.sql.DataFrame	The input DataFrame representing the data for the CamlCATE instance.	required
Y	str	The str representing the column name for the outcome variable.	required
T	str	The str representing the column name(s) for the treatment variable(s).	required
X	str \| list[str]	The str (if unity) or list of feature names representing the feature set to be utilized for estimating heterogeneity/CATE.	required
W	str \| list[str] \| None	The str (if unity) or list of feature names representing the confounder/control feature set to be utilized only for nuisance function estimation. When W is passed, only Orthogonal learners will be leveraged.	`None`
discrete_treatment	bool	A boolean indicating whether the treatment is discrete/categorical or continuous.	`True`
discrete_outcome	bool	A boolean indicating whether the outcome is binary or continuous.	`False`
seed	int \| None	The seed to use for the random number generator.	`None`

Attributes

Name	Type	Description
df	pandas.DataFrame \| polars.DataFrame \| pyspark.sql.DataFrame	The input DataFrame representing the data for the CamlCATE instance.
Y	str	The str representing the column name for the outcome variable.
T	str	The str representing the column name(s) for the treatment variable(s).
X	Iterable[str]	The str (if unity) or list of variable names representing the confounder/control feature set to be utilized for estimating heterogeneity/CATE and nuisance function estimation where applicable.
W	Iterable[str] \| None	The str (if unity) or list of variable names representing the confounder/control feature set to be utilized only for nuisance function estimation, where applicable. These will be included by default in Meta-Learners.
discrete_treatment	bool	A boolean indicating whether the treatment is discrete/categorical or continuous.
discrete_outcome	bool	A boolean indicating whether the outcome is binary or continuous.
available_estimators	str	A list of the available CATE estimators out of the box. Validity of estimator at runtime will depend on the outcome and treatment types and be automatically selected.
model_Y_X_W	sklearn.base.BaseEstimator	The fitted nuisance function for the outcome variable.
model_Y_X_W_T	sklearn.base.BaseEstimator	The fitted nuisance function for the outcome variable with treatment variable.
model_T_X_W	sklearn.base.BaseEstimator	The fitted nuisance function for the treatment variable.
cate_estimators	dict[str, econml._cate_estimator.BaseCateEstimator \| econml.score.EnsembleCateEstimator]	Dictionary of fitted cate estimator objects.
rscores	dict[str, float]	Dictionary of RScore values for each fitted cate estimator.
validation_estimator	econml._cate_estimator.BaseCateEstimator \| econml.score.EnsembleCateEstimator	The fitted EconML estimator object for validation.
validator_results	econml.validate.results.EvaluationResults	The validation results object.
final_estimator	econml._cate_estimator.BaseCateEstimator \| econml.score.EnsembleCateEstimator	The fitted EconML estimator object on the entire dataset after validation.
input_names	dict[str, list[str]]	The feature, outcome, and treatment names used in the CATE estimators.

Examples

from caml import CamlCATE
from caml.extensions.synthetic_data import SyntheticDataGenerator

data_generator = SyntheticDataGenerator(seed=10, n_cont_modifiers=1, n_cont_confounders=1)
df = data_generator.df

caml_obj = CamlCATE(
    df = df,
    Y="Y1_continuous",
    T="T1_binary",
    X=[c for c in df.columns if "X" in c or "W" in c],
    discrete_treatment=True,
    discrete_outcome=False,
    seed=0,
)

print(caml_obj)

================== CamlCATE Object ==================
Data Backend: pandas
No. of Observations: 10,000
Outcome Variable: Y1_continuous
Discrete Outcome: False
Treatment Variable: T1_binary
Discrete Treatment: True
Features/Confounders for Heterogeneity (X): ['W1_continuous', 'X1_continuous']
Features/Confounders as Controls (W): []
Random Seed: 0

Methods

Name	Description
auto_nuisance_functions	Leverages AutoML to find optimal nuisance functions/regression & propensity models for use in EconML CATE estimators.
fit_validator	Fits the CATE models on the training set and evaluates them & ensembles based on the validation set.
validate	Validates the fitted CATE models on the test set to check for generalization performance.
fit_final	Fits the final estimator on the entire dataset, after validation and testing.
predict	Predicts the CATE based on the fitted final estimator for either the internal dataset or provided Data.
summarize	Provides population summary statistics for the CATE predictions for either the internal results or provided results.

auto_nuisance_functions

CamlCATE.auto_nuisance_functions(
    flaml_Y_kwargs=None,
    flaml_T_kwargs=None,
    use_ray=False,
    use_spark=False,
)

Leverages AutoML to find optimal nuisance functions/regression & propensity models for use in EconML CATE estimators.

Sets the model_Y_X_W, model_Y_X_W_T, and model_T_X_W attributes to the fitted nuisance functions.

Parameters

Name	Type	Description	Default
flaml_Y_kwargs	dict \| None	The keyword arguments for the FLAML AutoML search for the outcome model. Default implies the base parameters in CamlBase.	`None`
flaml_T_kwargs	dict \| None	The keyword arguments for the FLAML AutoML search for the treatment model. Default implies the base parameters in CamlBase.	`None`
use_ray	bool	A boolean indicating whether to use Ray for parallel processing.	`False`
use_spark	bool	A boolean indicating whether to use Spark for parallel processing.	`False`

Examples

flaml_Y_kwargs = {
    "n_jobs": -1,
    "time_budget": 10,
    "verbose": 0
}

flaml_T_kwargs = {
    "n_jobs": -1,
    "time_budget": 10,
    "verbose": 0
}

caml_obj.auto_nuisance_functions(
    flaml_Y_kwargs=flaml_Y_kwargs,
    flaml_T_kwargs=flaml_T_kwargs,
    use_ray=False,
    use_spark=False,
)

print(caml_obj.model_Y_X_W)
print(caml_obj.model_Y_X_W_T)
print(caml_obj.model_T_X_W)

ExtraTreesRegressor(max_leaf_nodes=18, n_estimators=24, n_jobs=-1,
                    random_state=12032022)
ExtraTreesRegressor(max_leaf_nodes=18, n_estimators=24, n_jobs=-1,
                    random_state=12032022)
ExtraTreesClassifier(criterion='entropy', max_features=0.6334496470801398,
                     max_leaf_nodes=5, n_estimators=12, n_jobs=-1,
                     random_state=12032022)

fit_validator

CamlCATE.fit_validator(
    cate_estimators=['LinearDML', 'CausalForestDML', 'NonParamDML', 'SparseLinearDML-2D', 'DRLearner', 'ForestDRLearner', 'LinearDRLearner', 'DomainAdaptationLearner', 'SLearner', 'TLearner', 'XLearner'],
    additional_cate_estimators=[],
    ensemble=False,
    rscorer_kwargs={},
    use_ray=False,
    ray_remote_func_options_kwargs={},
    validation_size=0.2,
    test_size=0.2,
    sample_size=1.0,
    n_jobs=-1,
)

Fits the CATE models on the training set and evaluates them & ensembles based on the validation set.

Sets the validation_estimator attribute to the best fitted EconML estimator and cate_estimators attribute to all the fitted CATE models.

Parameters

Name	Type	Description	Default
cate_estimators	Iterable[str]	The list of CATE estimators to fit and ensemble. Default implies all available models as defined by class.	`['LinearDML', 'CausalForestDML', 'NonParamDML', 'SparseLinearDML-2D', 'DRLearner', 'ForestDRLearner', 'LinearDRLearner', 'DomainAdaptationLearner', 'SLearner', 'TLearner', 'XLearner']`
additional_cate_estimators	list[tuple[str, BaseCateEstimator]]	The list of additional CATE estimators to fit and ensemble	`[]`
ensemble	bool	The boolean indicating whether to ensemble the CATE models & score.	`False`
rscorer_kwargs	dict	The keyword arguments for the econml.score.RScorer object.	`{}`
use_ray	bool	A boolean indicating whether to use Ray for parallel processing.	`False`
ray_remote_func_options_kwargs	dict	The keyword arguments for the Ray remote function options.	`{}`
validation_size	float	The fraction of the dataset to use for model scoring via RScorer.	`0.2`
test_size	float	The fraction of the dataset to hold out for final evaluation in the `validate()` method.	`0.2`
sample_size	float	The fraction of the datasets to use. Useful for quick testing when dataframe is large. Defaults implies full training data.	`1.0`
n_jobs	int	The number of parallel jobs to run.	`-1`

Examples

from econml.dr import LinearDRLearner

rscorer_kwargs = {
    "cv": 3,
    "mc_iters": 3,
}
cate_estimators = ["LinearDML", "NonParamDML", "CausalForestDML"]
additional_cate_estimators = [
    (
        "LinearDRLearner",
        LinearDRLearner(
            model_propensity=caml_obj.model_T_X_W,
            model_regression=caml_obj.model_Y_X_W_T,
            discrete_outcome=caml_obj.discrete_outcome,
            cv=3,
            random_state=0,
        ),
    )
]

caml_obj.fit_validator(
    cate_estimators=cate_estimators,
    additional_cate_estimators=additional_cate_estimators,
    rscorer_kwargs=rscorer_kwargs,
    validation_size=0.2,
    test_size=0.2
)

print(caml_obj.validation_estimator)
print(caml_obj.cate_estimators)

<econml.dr._drlearner.LinearDRLearner object at 0x7f0733bbb280>
[('LinearDML', <econml.dml.dml.LinearDML object at 0x7f0733bb8a00>), ('NonParamDML', <econml.dml.dml.NonParamDML object at 0x7f0733bba200>), ('CausalForestDML', <econml.dml.causal_forest.CausalForestDML object at 0x7f0733bbba00>), ('LinearDRLearner', <econml.dr._drlearner.LinearDRLearner object at 0x7f0733bbb280>)]

validate

CamlCATE.validate(
    n_groups=4,
    n_bootstrap=100,
    estimator=None,
    print_full_report=True,
)

Validates the fitted CATE models on the test set to check for generalization performance.

Uses the DRTester class from EconML to obtain the Best Linear Predictor (BLP), Calibration, AUTOC, and QINI. See EconML documentation for more details. In short, we are checking for the ability of the model to find statistically significant heterogeneity in a “well-calibrated” fashion.

Sets the validator_report attribute to the validation report.

Parameters

Name	Type	Description	Default
n_groups	int	The number of quantile based groups used to calculate calibration scores.	`4`
n_bootstrap	int	The number of boostrap samples to run when calculating confidence bands.	`100`
estimator	BaseCateEstimator \| EnsembleCateEstimator \| None	The estimator to validate. Default implies the best estimator from the validation set.	`None`
print_full_report	bool	A boolean indicating whether to print the full validation report.	`True`

Examples

caml_obj.validate()

caml_obj.validator_results

   treatment  blp_est  blp_se  blp_pval  qini_est  qini_se  qini_pval  autoc_est  autoc_se  autoc_pval  cal_r_squared
0          1    0.995   0.006       0.0     2.061    0.061        0.0      5.984     0.222         0.0          0.993

fit_final

CamlCATE.fit_final()

Fits the final estimator on the entire dataset, after validation and testing.

Sets the input_names and final_estimator class attributes.

Examples

caml_obj.fit_final()

print(caml_obj.final_estimator)
print(caml_obj.input_names)

<econml.dr._drlearner.LinearDRLearner object at 0x7f0733063910>
{'feature_names': ['W1_continuous', 'X1_continuous'], 'output_names': 'Y1_continuous', 'treatment_names': 'T1_binary'}

predict

CamlCATE.predict(X=None, T0=0, T1=1, T=None)

Predicts the CATE based on the fitted final estimator for either the internal dataset or provided Data.

For binary treatments, the CATE is the estimated effect of the treatment and for a continuous treatment, the CATE is the estimated effect of a one-unit increase in the treatment. This can be modified by setting the T0 and T1 parameters to the desired treatment levels.

Parameters

Name	Type	Description	Default
X	pandas.DataFrame \| np.ndarray \| None	The DataFrame containing the features (X) for which CATE needs to be predicted. If not provided, defaults to the internal dataset.	`None`
T0	int	Base treatment for each sample.	`0`
T1	int	Target treatment for each sample.	`1`
T	pandas.DataFrame \| np.ndarray \| None	Treatment vector if continuous treatment is leveraged for computing marginal effects around treatments for each individual.	`None`

Returns

Name	Type	Description
	np.ndarray	The predicted CATE values if return_predictions is set to True.

Examples

caml_obj.predict()

array([ 11.50064573,  -9.15542281, -32.14314537, ...,  -4.77048878,
        -4.7504708 , -24.5222364 ])

summarize

CamlCATE.summarize(cate_predictions=None)

Provides population summary statistics for the CATE predictions for either the internal results or provided results.

Parameters

Name	Type	Description	Default
cate_predictions	np.ndarray \| None	The CATE predictions for which summary statistics will be generated. If not provided, defaults to internal CATE predictions generated by `predict()` method with X=None.	`None`

Returns

Name	Type	Description
	pandas.DataFrame \| pandas.Series	The summary statistics for the CATE predictions.

Examples

caml_obj.summarize()

	cate_predictions_0_1
count	10000.000000
mean	-7.680559
std	8.277390
min	-54.741286
25%	-11.685537
50%	-7.773742
75%	-3.670348
max	47.041980