DoubleML Trainings: Getting Started

Welcome to the DoubleML Trainings!

We are very happy to welcome you to our Trainings in Causal Machine Learning with DoubleML! Please have a look at the following instructions to get ready for our DoubleML trainings.

Virtual Meetings and Communication 💻

We will send you the invite links to our virtual meetings via the email that you provided during sign-up on eventbrite. Our sessions will be hosted via Microsoft Teams. You can either install the Microsoft Teams app on your machine or access our meetings from the browser.

We will use a slack workspace for communication during the training. You will be sent an invite link or be added by the course organizers.

Materials: Slides and Notebooks

You will receive a link to the materials (slides, notebooks, etc.) in the days before our training starts.

Installation

During our trainings, we will work with DoubleML and other packages in Python. So please make sure you have access to a working Python environment on your local machine or on a cloud service.

Installing DoubleML for Python

Please read the installation instructions and make sure you installed the latest release (>= DoubleML 0.7.0) of DoubleML on your local machine prior to our tutorial. If you have an earlier version of DoubleML installed, please update your installation.

To install DoubleML via pip or conda without a virtual environment type

pip install -U DoubleML

or use conda

conda install -c conda-forge doubleml
DoubleML Version

Please check that you installed a DoubleML version as of version 0.7.0 or higher by typing

import doubleml
print(doubleml.__version__)
More detailed installation instructions

For more information on installing DoubleML read our online installation guide.

Installing Additional Packages

In addition to DoubleML and its dependencies, we will use the packages xgboost, lightgbm and networkx. To install these packages please run

pip install xgboost lightgbm networkx

Getting Ready for the Tutorial

Run the following example to check whether you are ready for the tutorial.

Load the DoubleML package after completed installation.

import doubleml as dml

Load the Bonus data set.

from doubleml.datasets import fetch_bonus

# Load bonus data
df_bonus = fetch_bonus('DataFrame')
print(df_bonus.head(5))
/opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/doubleml/datasets.py:84: FutureWarning:

The 'delim_whitespace' keyword in pd.read_csv is deprecated and will be removed in a future version. Use ``sep='\s+'`` instead

/opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/doubleml/datasets.py:89: FutureWarning:

A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


   index   abdt  tg  inuidur1  inuidur2  female  black  hispanic  othrace  \
0      0  10824   0  2.890372        18       0      0         0        0   
1      3  10824   0  0.000000         1       0      0         0        0   
2      4  10747   0  3.295837        27       0      0         0        0   
3     11  10607   1  2.197225         9       0      0         0        0   
4     12  10831   0  3.295837        27       0      0         0        0   

   dep  ...  recall  agelt35  agegt54  durable  nondurable  lusd  husd  muld  \
0    2  ...       0        0        0        0           0     0     1     0   
1    0  ...       0        0        0        0           0     1     0     0   
2    0  ...       0        0        0        0           0     1     0     0   
3    0  ...       0        1        0        0           0     0     0     1   
4    1  ...       0        0        1        1           0     1     0     0   

   dep1  dep2  
0   0.0   1.0  
1   0.0   0.0  
2   0.0   0.0  
3   0.0   0.0  
4   1.0   0.0  

[5 rows x 26 columns]

Create a data backend.

# Specify the data and variables for the causal model
from doubleml import DoubleMLData

dml_data_bonus = DoubleMLData(df_bonus,
                            y_col='inuidur1',
                            d_cols='tg',
                            x_cols=['female', 'black', 'othrace', 'dep1', 'dep2',
                                    'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54',
                                    'durable', 'lusd', 'husd'])
print(dml_data_bonus)
================== DoubleMLData Object ==================

------------------ Data summary      ------------------
Outcome variable: inuidur1
Treatment variable(s): ['tg']
Covariates: ['female', 'black', 'othrace', 'dep1', 'dep2', 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54', 'durable', 'lusd', 'husd']
Instrument variable(s): None
No. Observations: 5099

------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5099 entries, 0 to 5098
Columns: 26 entries, index to dep2
dtypes: float64(3), int64(23)
memory usage: 1.0 MB

Create two learners for the nuisance components using scikit-learn.

from sklearn.base import clone
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LassoCV

learner = RandomForestRegressor(n_estimators = 500, max_features = 'sqrt', max_depth= 5)

ml_l_bonus = clone(learner)
ml_m_bonus = clone(learner)

Create a new instance of a causal model, here a partially linear regression model via DoubleMLPLR.

import numpy as np
from doubleml import DoubleMLPLR

np.random.seed(3141)
obj_dml_plr_bonus = DoubleMLPLR(dml_data_bonus, ml_l_bonus, ml_m_bonus)
obj_dml_plr_bonus.fit();
print(obj_dml_plr_bonus)
================== DoubleMLPLR Object ==================

------------------ Data summary      ------------------
Outcome variable: inuidur1
Treatment variable(s): ['tg']
Covariates: ['female', 'black', 'othrace', 'dep1', 'dep2', 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54', 'durable', 'lusd', 'husd']
Instrument variable(s): None
No. Observations: 5099

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_l: RandomForestRegressor(max_depth=5, max_features='sqrt', n_estimators=500)
Learner ml_m: RandomForestRegressor(max_depth=5, max_features='sqrt', n_estimators=500)
Out-of-sample Performance:
Learner ml_l RMSE: [[1.20032322]]
Learner ml_m RMSE: [[0.47419634]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
       coef   std err         t     P>|t|     2.5 %    97.5 %
tg -0.07669  0.035412 -2.165689  0.030335 -0.146096 -0.007285

Ready to Go 🚀

Once you are able to run this code, you are ready for our tutorial!

Questions and Contact

In case you have any questions, please contact us via .