Description
What
Currently, the synthetic control functionality is constrained to a single treatment unit. Clearly, having one treated unit is the minimum you could have for a working synthetic control solution. This still offers non-trivial functionality, and we have docs with a generic example with simulated data, and also for the effects of Brexit (the UK is the only treated unit).
However, there are many situations where you will have more than one treated unit. This could happen in many different domains, but it will be notable in marketing with geolift situations. We also have a docs page on geolift with a single treated geo. We also have a docs page on multi-cell geolift analysis where we have multiple treated geos. That docs page currently walks through an example of a pooled analysis approach where we simply take the average of the outcome variable across the treated geos and then proceed to model it as a single treated unit case of synthetic control. The alternative was to treat the geos as unpooled - in that case we simply run multiple independent single treated unit synthetic control analyses.
Why
This issue proposes that we add the ability to model multiple treated units (or geos). This is has a number of motivations:
- it is a more general solution
- it would allow a single modeling approach to geo testing (or any other multiple treatment unit situation)
- it would allow the full flexibility from pooled and unpooled analysis approaches, but also newly, partially pooled analysis where there could be information sharing across weights.
- it will lay the foundation for implementing synthetic differences in differences Add Synthetic Difference-in-Differences #47
Changes
Changes to the WeightedSumFitter
class
This pymc model class would need to be changed so that we have a weight matrix, rather than a weight vector.
CausalPy/causalpy/pymc_models.py
Lines 254 to 271 in 4227edf
So rather than dims="coeffs"
(where coeffs
correspond to control units), it would be dims=("control_units", "treated_units")
. This would give us an unpooled set of weights of each of the control units for each of the treated units. A later step could them implement partial pooling over these weights (across the treated_unit
) dimension.
The WeightedSumFitter.build_model
method would also change to update the fact that the raw data would no longer be long form, so the incoming data (currently a design matrix X
would now be a 2D matrix, probably shape ("time", "unit")
.
Changes to the SyntheticControl
class
SyntheticControl
would no longer inherit from thePrePostFit
class. So all the logic currently inPrePostFit.__innit__
would move to the newSyntheticControl.__init__
. This will leaveInterruptedTimeSeries
as the only class that does inherit fromPrePostFit
, so there would be opportunity to collapse that class hierarchy, but that is a peripheral issue. The core thing is thatSyntheticControl
would change a lot.- The incoming dataframe is still split into pre and post treatment
- Remove the
formula
argument and no longer use a design matrix approach (with patsy). This would result in quite a lot of change to the logic inSyntheticControl.__init__
- Update the
_bayesian_plot
method.
Changes to tests
- Update all the integration tests to deal with the changed API
- Add new tests to cover the new multiple treated unit case
Changes to docs
- We'd have to update the docs to use the new API.
- We would also want to update the existing multi-cell geolift analysis docs.