CamlCATE API Usage

Here we’ll walk through an example of generating synthetic data, running CamlCATE, and visualizing results using the ground truth as reference.

Generate Synthetic Data

Here we’ll leverage the CamlSyntheticDataGenerator class to generate a linear synthetic data generating process, with a binary treatment, continuous outcome, and a mix of confounding/mediating continuous covariates.

from caml.extensions.synthetic_data import CamlSyntheticDataGenerator

data =  CamlSyntheticDataGenerator(n_obs=10_000,
                                  n_cont_outcomes=1,
                                  n_binary_treatments=1,
                                  n_cont_confounders=2,
                                  n_cont_modifiers=2,
                                  n_confounding_modifiers=1,
                                  causal_model_functional_form="linear",
                                  n_nonlinear_transformations=5,
                                  n_nonlinear_interactions=2,
                                  seed=1)

We can print our simulated data via:

data.df
W1_continuous W2_continuous X1_continuous X2_continuous T1_binary Y1_continuous
0 0.212951 2.427782 4.855579 1.899164 1 -15.626184
1 15.593752 7.556136 4.087682 -0.574265 1 -6.393739
2 1.062978 3.644116 4.970670 0.091263 1 -13.628719
3 0.334657 4.581727 3.524831 0.235195 1 -9.546798
4 5.221081 3.886017 0.487610 -0.677476 1 0.070117
... ... ... ... ... ... ...
9995 0.408012 4.600472 1.411209 1.882209 1 -3.247035
9996 2.639800 6.699876 0.001684 2.976786 1 2.582407
9997 1.151358 7.453297 1.133279 2.428612 1 -0.491604
9998 1.073735 8.631265 1.431656 2.058397 1 -0.964414
9999 0.662806 5.282546 0.214802 2.999912 1 1.087057

10000 rows × 6 columns

To inspect our true data generating process, we can call data.dgp. Furthermore, we will have our true CATEs and ATEs at our disposal via data.cates & data.ates, respectively. We’ll use this as our source of truth for performance evaluation of our CATE estimator.

for t, df in data.dgp.items():
    print(f"\nDGP for {t}:")
    print(df)

DGP for T1_binary:
      covariates    params global_transformation
0  W1_continuous  1.003484               Sigmoid
1  W2_continuous  2.968150               Sigmoid
2  X1_continuous  1.551445               Sigmoid

DGP for Y1_continuous:
                    covariates    params global_transformation
0                W1_continuous  0.431376                  None
1                W2_continuous  0.287855                  None
2                X1_continuous -2.663734                  None
3                X2_continuous  1.842291                  None
4                    T1_binary -0.656965                  None
5  int_T1_binary_X1_continuous -0.549627                  None
6  int_T1_binary_X2_continuous -1.580467                  None
data.cates
CATE_of_T1_binary_on_Y1_continuous
0 -6.327288
1 -1.996057
2 -3.533216
3 -2.966024
4 0.145760
... ...
9995 -4.407371
9996 -5.362603
9997 -5.118186
9998 -4.697070
9999 -5.516288

10000 rows × 1 columns

data.ates
Treatment ATE
0 T1_binary_on_Y1_continuous -3.593735

Running CamlCATE

Class Instantiation

We can instantiate and observe our CamlCATE object via:

💡 Tip: W can be leveraged if we want to use certain covariates only in our nuisance functions to control for confounding and not in the final CATE estimator. This can be useful if a confounder may be required to include, but for compliance reasons, we don’t want our CATE model to leverage this feature (e.g., gender). However, this will restrict our available CATE estimators to orthogonal learners, since metalearners necessarily include all covariates. If you don’t care about W being in the final CATE estimator, pass it as X, as done below.

from caml import CamlCATE

caml_obj = CamlCATE(df=data.df,
                Y="Y1_continuous",
                T="T1_binary",
                X=[c for c in data.df.columns if 'X' in c]
                    + [c for c in data.df.columns if 'W' in c],
                discrete_treatment=True,
                discrete_outcome=False,
                verbose=1)
[03/31/25 18:55:16] INFO     Logging has been set up.                                                 logging.py:50
print(caml_obj)
================== CamlCATE Object ==================
Data Backend: pandas
No. of Observations: 10,000
Outcome Variable: Y1_continuous
Discrete Outcome: False
Treatment Variable: T1_binary
Discrete Treatment: True
Features/Confounders for Heterogeneity (X): ['X1_continuous', 'X2_continuous', 'W1_continuous', 'W2_continuous']
Features/Confounders as Controls (W): []
Random Seed: None

Nuisance Function AutoML

We can then obtain our nuisance functions / regression & propensity models via Flaml AutoML:

caml_obj.auto_nuisance_functions(
    flaml_Y_kwargs={"time_budget": 30,
                    "verbose":0,
                    "estimator_list":["rf", "extra_tree", "xgb_limitdepth"]},
    flaml_T_kwargs={"time_budget": 30,
                    "verbose":0,
                    "estimator_list":["rf", "extra_tree", "xgb_limitdepth"]},
)
print(caml_obj.model_Y_X_W)
print(caml_obj.model_Y_X_W_T)
print(caml_obj.model_T_X_W)
ExtraTreesRegressor(max_features=0.8120734525770129, max_leaf_nodes=877,
                    n_estimators=344, n_jobs=-1, random_state=12032022)
ExtraTreesRegressor(max_features=0.8120734525770129, max_leaf_nodes=877,
                    n_estimators=344, n_jobs=-1, random_state=12032022)
ExtraTreesClassifier(max_features=0.48476586538023114, max_leaf_nodes=4,
                     n_estimators=5, n_jobs=-1, random_state=12032022)

Fit CATE Estimators

Now that we have obtained our first-stage models, we can fit our CATE estimators via:

📝 Note: The selected model defaults to the one with the highest RScore. All fitted models are still accessible via the cate_estimators attribute and if you want to change default estimator, you can run caml_obj._validation_estimator = {different_model}.

🚀Forthcoming: Additional scoring techniques & AutoML for CATE estimators is on our roadmap.

caml_obj.fit_validator(
    cate_estimators=[
        "LinearDML",
        "CausalForestDML",
        "ForestDRLearner",
        "LinearDRLearner",
        "DomainAdaptationLearner",
        "SLearner",
        "TLearner",
        "XLearner",
    ],
    validation_size=0.2,
    test_size=0.2,
    n_jobs=-1,
)
[03/31/25 18:57:14] INFO     Best Estimator: DomainAdaptationLearner                                    cate.py:867
                    INFO     Estimator RScores: {'LinearDML': 0.13605192393186216, 'CausalForestDML':   cate.py:868
                             0.13401438074400696, 'ForestDRLearner': 0.13334836796315974,                          
                             'LinearDRLearner': 0.13348360299921525, 'DomainAdaptationLearner':                    
                             0.13902672426789076, 'SLearner': 0.13392781656169428, 'TLearner':                     
                             0.1286775854604263, 'XLearner': 0.1362410478945144}                                   
caml_obj.validation_estimator
<econml.metalearners._metalearners.DomainAdaptationLearner at 0x7efd6ad9a470>
caml_obj.cate_estimators
[('LinearDML', <econml.dml.dml.LinearDML at 0x7efd6aff0280>),
 ('CausalForestDML',
  <econml.dml.causal_forest.CausalForestDML at 0x7efd6aff6ad0>),
 ('ForestDRLearner', <econml.dr._drlearner.ForestDRLearner at 0x7efd6aff6e60>),
 ('LinearDRLearner', <econml.dr._drlearner.LinearDRLearner at 0x7efd6a9788b0>),
 ('DomainAdaptationLearner',
  <econml.metalearners._metalearners.DomainAdaptationLearner at 0x7efd6ad9e740>),
 ('SLearner', <econml.metalearners._metalearners.SLearner at 0x7efd6ad9e3e0>),
 ('TLearner', <econml.metalearners._metalearners.TLearner at 0x7efd6ad9f790>),
 ('XLearner', <econml.metalearners._metalearners.XLearner at 0x7efd6ad9c6d0>)]

Validate model on test hold out set

Here we can validate our model on the test hold out set. Currently, this is only available for when continuous outcomes and binary treatments exist.

caml_obj.validate()
[03/31/25 18:57:17] INFO     All validation results suggest that the model has found statistically      cate.py:513
                             significant heterogeneity.                                                            
   treatment  blp_est  blp_se  blp_pval  qini_est  qini_se  qini_pval  autoc_est  autoc_se  autoc_pval  cal_r_squared
0          1    0.621   0.074       0.0     0.551    0.064        0.0      1.474     0.153         0.0          0.773

Refit our selected model on the entire dataset

Now that we have selected our top performer and validated results on the test set, we can fit our final model on the entire dataset.

caml_obj.fit_final()
caml_obj.final_estimator
<econml.metalearners._metalearners.DomainAdaptationLearner at 0x7efdbcedaa70>

Validating Results with Ground Truth

First, we will obtain our predictions.

cate_predictions = caml_obj.predict()

Average Treatment Effect (ATE)

We’ll use the summarize() method after obtaining our predictions above, where our the displayed mean represents our Average Treatment Effect (ATE).

caml_obj.summarize()
cate_predictions_0_1
count 10000.000000
mean -3.488162
std 2.452214
min -51.386351
25% -5.075448
50% -3.487215
75% -1.615648
max 7.114962

Now comparing this to our ground truth, we see the model performed well the true ATE:

data.ates
Treatment ATE
0 T1_binary_on_Y1_continuous -3.593735

Conditional Average Treatment Effect (CATE)

Now we want to see how the estimator performed in modeling the true CATEs.

First, we can simply compute the Precision in Estimating Heterogeneous Effects (PEHE), which is simply the Mean Squared Error (MSE):

from sklearn.metrics import mean_squared_error

true_cates = data.cates.iloc[:, 0]
mean_squared_error(true_cates,cate_predictions)
0.827839455681178

Not bad! Now let’s use some visualization techniques:

from caml.extensions.plots import cate_true_vs_estimated_plot

cate_true_vs_estimated_plot(true_cates=true_cates, estimated_cates=cate_predictions)

from caml.extensions.plots import cate_histogram_plot

cate_histogram_plot(true_cates=true_cates, estimated_cates=cate_predictions)

from caml.extensions.plots import cate_line_plot

cate_line_plot(true_cates=true_cates, estimated_cates=cate_predictions, window=20)

Overall, we can see the model performed remarkably well!~

Obtaining Model Objects & Artifacts for Production Systems

In many production settings, we will want to store our model, information on the features used, etc. We provide attributes that to pull key information (more to be added later as class evolves)

Grabbing final model object:

caml_obj.final_estimator
<econml.metalearners._metalearners.DomainAdaptationLearner at 0x7efdbcedaa70>

Grabbing input features:

caml_obj.input_names
{'feature_names': ['X1_continuous',
  'X2_continuous',
  'W1_continuous',
  'W2_continuous'],
 'output_names': 'Y1_continuous',
 'treatment_names': 'T1_binary'}

Grabbing all fitted CATE estimators:

caml_obj.cate_estimators
[('LinearDML', <econml.dml.dml.LinearDML at 0x7efd6aff0280>),
 ('CausalForestDML',
  <econml.dml.causal_forest.CausalForestDML at 0x7efd6aff6ad0>),
 ('ForestDRLearner', <econml.dr._drlearner.ForestDRLearner at 0x7efd6aff6e60>),
 ('LinearDRLearner', <econml.dr._drlearner.LinearDRLearner at 0x7efd6a9788b0>),
 ('DomainAdaptationLearner',
  <econml.metalearners._metalearners.DomainAdaptationLearner at 0x7efd6ad9e740>),
 ('SLearner', <econml.metalearners._metalearners.SLearner at 0x7efd6ad9e3e0>),
 ('TLearner', <econml.metalearners._metalearners.TLearner at 0x7efd6ad9f790>),
 ('XLearner', <econml.metalearners._metalearners.XLearner at 0x7efd6ad9c6d0>)]
Back to top