Lessons for machine learning from Econometrics
Hal Varian is the lead economist at Google and provided a talk to Electronic Support Group at EECS Department at the University of California at Berkeley in November 2013.
The talk was entitled Machine Learning and Econometrics and was really concentrated on what lessons the machine learning can take away from the domain of Econometrics.
Hal began by summarizing a research paper of his entitled “Big Data: New Tricks for Econometrics” (PDF) which comments on what the econometrics community can go about learning from the ML community, specifically:
- Train-test validate to avoid overfitting
- Cross validation
- Nonlinear estimation (trees, forests, SVMs, neural nets, etc.)
- Bootstrap, bagging, boosting
- Variable selection (lasso and friends)
- Model averaging
- Computational Bayesian methods (MCMC)
- Tools for manipulation of big data (SQL, NoSQL databases)
- Textual analysis (not discussed)
He went on by talking about non-i.i.d data like time series information and panel data. This is information where cross validation usually does not feature good performance. He suggests decomposing data trend + seasonal components and observe deviations from expected behaviour. An instance is provided of Google Correlate displaying that auto dealer sales data ideally correlates with searches for Indian restaurants.
The concentration of the talk is casual inference, a big topic in econometrics. He goes over:
1] Counterfactuals: What would have occurred to the treated if they weren’t treated? Would they appear like the control on average? Read more about counterfactuals within empirical testing.
2] Confounding variables: Unobserved variables that correlates with both x and y (the other stuff). Usually an issue when human choice is an aspect.
3] Natural Experiments: Might or might not be randomized. An instance is the draft lottery. Read more about natural experiments.
4] Regression Discontinuity: Cut-off or threshold above or below the treatment is applied. You can contrast cases near to the arbitrary threshold to estimate the average treatment impact when randomization is not feasible. Tune the threshold after you can model the casual relationship and play what-if’s (don’t leave out randomization to chance) Read more on regression discontinuity design. (RDD)
5] Difference in Differences (DiD): It’s not adequate to look at prior and after the treatment, you require to adjust the treated by the control. The treatment might not be arbitrarily allocated.
6] Instrumental Variables: Variation in X that is independent of error. Something that modifies X (correlates with X) but does not modify the error. Furnishes a control lever. Randomization is an instrumental variable.
He summed up the lessons for the machine learning community from econometrics as follows:
- Observational data typically can’t decide causality, no matter how big it is (big data is inadequate)
- Casual inference is what you wish for policy
- Treatment-control with arbitrary assignment is the gold standard
- At times you can identify natural experiments, discontinuities, etc.
- Prediction is crucial to causal inference for both selection issue and counterfactual
- Very fascinating research in systems for continual testing.
Hal completed his speech with dual book recommendations:
- Mostly Harmless Econometrics: An Empiricist’s Companion
- An Introduction to Statistical Learning: with Applications in R
The talk was also provided to the Stanford University Department of Electrical Engineering in 2014 entitled What Machine Learning Can Learn From Econometrics and Vice Versa.