Synthetic control provides a powerful approach at establishing a counterfactual for policy changes. One such example of a policy change is in Russian surrounding alcohol consumption. A recent Guardian article1 chronicled the fall of alcohol consumption in Russia based policy changes made by Mr Putin’s governments. These policy changes all occurred roughly around 2006 and included reclassifying what counted as a “foodstuff” and levying of hefty taxes on alcohol (Levintova 2007). Based on recent trends, observers say that these policies are having the desired effect. But how much of this drop in alcohol consumption is due to the policy implementation and not changes in consumer preference?2
This underlines the need for a counterfactual. We would like to know what would have been the alcohol consumption in Russia but for these policy changes. Here is where Synthetic Control (and the SCtools package) can help us answer these questions.
In order to understand the policy impact on Russian Alcohol Consumption, we will first load the SCtools, Synth, and some of the tidyverse packages.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
library(SCtools)
#> Loading required package: future
library(Synth)
#> ##
#> ## Synth Package: Implements Synthetic Control Methods.
#> ## See https://web.stanford.edu/~jhain/synthpage.html for additional information.
theme_set(theme_minimal())
After those are loaded, we can look at the alcohol
data
that is inside of the SCtools
package. We can build the
graphic in the Guardian using these data to see the trend in alcohol
consumption.
alcohol %>%
filter(country_name == "Russian Federation") %>%
ggplot(aes(year, consumption))+
geom_line()+
geom_vline(xintercept = 2006, color = "orange")+
labs(
title = "Per Capita Alcohol Consumption in the Russian Federation Since 1990",
subtitle = "The Russian government began a series of policy changes in 2003\nto reduce alcohol consumption",
y = "Per Capita Consumption (L/person)",
x = NULL,
caption = "Data: World Health Organization"
)
We can also examine some of the trends in our predictors.
alcohol %>%
filter(country_name == "Russian Federation") %>%
select(year, consumption,labor_force_participation_rate:manufacturing) %>%
tidyr::gather(predictor, value, -year) %>%
ggplot(aes(year, value))+
geom_line()+
facet_wrap(~predictor, scales = "free")
One feature of SCM is selecting donor states. These donor states help us to generate our synthetic control (synthetic version of Russian in the absence of the policy change). There are several key assumptions regarding choosing donor states including
In the absence of evidence of other changes in other states, we can look at the potential donor states graphically:
p1 <- alcohol %>%
mutate(my_color = ifelse(country_code == "RUS", "Russia", "Other")) %>%
ggplot(aes(year, consumption, group = country_code, color = my_color))+
geom_line() +
scale_color_manual(values = c("grey", "black", "white"))+
ylim(5,15)+
xlim(1990,2005)
p1
#> Warning: Removed 3579 rows containing missing values or values outside the scale range
#> (`geom_line()`).
When we look at the potential donors we see the usual suspects: France, United States, as well as some of the neighbouring Baltic states. This makes some intuitive sense. Baltic states share similar cultural habits. SCM is insensitive to having donors that are not neighbouring, which will allow us to include countries like Great Britain and the United States in the donor pool.3
Now that we have an idea of the donors, we can build the pool.
comparison_states <- c("USA", "UK", "UKR", "KAZ",
"GBR", "ESP", "DEU", "POL",
"FIN", "FRA", "GRC", "IRL",
"LTU", "ROU", "GEO", "MDA",
"SWE", "BEL", "BLR", "KGZ",
"CZE", "MEX", "SVN")
control_ids <- alcohol %>%
select(country_code,country_num) %>%
filter(country_code %in% comparison_states) %>%
distinct() %>%
pull(country_num)
Now we will use the dataprep
function from Synth to
format our data.
dataprep.out<-dataprep(
foo = as.data.frame(alcohol),
predictors = c("labor_force_participation_rate",
"inflation",
"mobile_cellular_subscriptions",
"manufacturing"),
predictors.op = "mean",
dependent = "consumption",
unit.variable = "country_num",
time.variable = "year",
treatment.identifier = 142,
controls.identifier = control_ids,
time.predictors.prior = c(1991:2005),
time.optimize.ssr = c(2000:2005),
special.predictors = list(
list("consumption", 2000:2005 ,"mean")),
unit.names.variable = "country_code",
time.plot = 1992:2015
)
#>
#> Missing data- treated unit; predictor: inflation ; for period: 1991
#> We ignore (na.rm = TRUE) all missing values for predictors.op.
#>
#> Missing data- treated unit; predictor: inflation ; for period: 1992
#> We ignore (na.rm = TRUE) all missing values for predictors.op.
#>
#> Missing data- treated unit; predictor: manufacturing ; for period: 1991
#> We ignore (na.rm = TRUE) all missing values for predictors.op.
#>
#> Missing data- treated unit; predictor: manufacturing ; for period: 1992
#> We ignore (na.rm = TRUE) all missing values for predictors.op.
#>
#> Missing data- treated unit; predictor: manufacturing ; for period: 1993
#> We ignore (na.rm = TRUE) all missing values for predictors.op.
#>
#> Missing data- treated unit; predictor: manufacturing ; for period: 1994
#> We ignore (na.rm = TRUE) all missing values for predictors.op.
#>
#> Missing data- treated unit; predictor: manufacturing ; for period: 1995
#> We ignore (na.rm = TRUE) all missing values for predictors.op.
#>
#> Missing data- treated unit; predictor: manufacturing ; for period: 1996
#> We ignore (na.rm = TRUE) all missing values for predictors.op.
#>
#> Missing data- treated unit; predictor: manufacturing ; for period: 1997
#> We ignore (na.rm = TRUE) all missing values for predictors.op.
#>
#> Missing data- treated unit; predictor: manufacturing ; for period: 1998
#> We ignore (na.rm = TRUE) all missing values for predictors.op.
#>
#> Missing data- treated unit; predictor: manufacturing ; for period: 1999
#> We ignore (na.rm = TRUE) all missing values for predictors.op.
#>
#> Missing data- treated unit; predictor: manufacturing ; for period: 2000
#> We ignore (na.rm = TRUE) all missing values for predictors.op.
#>
#> Missing data- treated unit; predictor: manufacturing ; for period: 2001
#> We ignore (na.rm = TRUE) all missing values for predictors.op.
#>
#> Missing data - control unit: 16 ; predictor: inflation ; for period: 1991
#> We ignore (na.rm = TRUE) all missing values for predictors.op.
#>
#> Missing data - control unit: 16 ; predictor: inflation ; for period: 1992
#> We ignore (na.rm = TRUE) all missing values for predictors.op.
There are some missing values, but that should be ok. Now we will run
the synth
function to generate our synthetic controls.
After running the SCM, we can look at the weights from the donors and see if we see anything that is surprising.
solution <- out$solution.w %>%
as.data.frame()
solution$country_num <- rownames(solution)
solution %>%
mutate(country_num = as.numeric(country_num)) %>%
left_join(alcohol %>%
select(country_code,country_num) %>%
filter(country_code %in% comparison_states) %>%
distinct(), by = "country_num") %>%
ggplot(aes(reorder(country_code, w.weight), w.weight))+
geom_col()+
coord_flip()+
labs(
title = "Donor Weights by Country",
y = NULL,
x = "Weight"
)
Unsurprisingly, we can see that Kazakstan and Lithuania contribute to counterfactual Russia, however, interestingly Great Britain and the Chzech Republic are amongst the top donors to the counterfactual. This highlights one of the powerful features of Synthetic Control - the donors do not need to be geographically connected, Let’s continue with our analysis.
Now we can plot our real Russia vs our Synthetic Russia using the
plot.path
function from Synth. Here we see that the
synthetic control matches 2000-2005 fairly well (which is what we
specified to optimize), but there is some opportunity to more closely
match actual Russia.
We can also look at the delta between our real and synthetic Russia.
delta_out <- (dataprep.out$Y1plot - (dataprep.out$Y0plot %*% out$solution.w)) %>%
as.data.frame()
delta_out$year <- rownames(delta_out)
delta_out%>%
knitr::kable(caption = "Difference in Alcohol Consumption Between Synthetic and Actual Russia", digits = 2)
142 | year | |
---|---|---|
1992 | -3.28 | 1992 |
1993 | -1.71 | 1993 |
1994 | -1.24 | 1994 |
1995 | 1.49 | 1995 |
1996 | -0.41 | 1996 |
1997 | -0.60 | 1997 |
1998 | -0.10 | 1998 |
1999 | 0.77 | 1999 |
2000 | -0.50 | 2000 |
2001 | -0.44 | 2001 |
2002 | -0.06 | 2002 |
2003 | 0.34 | 2003 |
2004 | 0.35 | 2004 |
2005 | 0.53 | 2005 |
2006 | 1.02 | 2006 |
2007 | 1.19 | 2007 |
2008 | 1.27 | 2008 |
2009 | 1.00 | 2009 |
2010 | 0.76 | 2010 |
2011 | 0.86 | 2011 |
2012 | 0.92 | 2012 |
2013 | 0.09 | 2013 |
2014 | -1.00 | 2014 |
2015 | -1.46 | 2015 |
It appears that the policy took a while to take effect, with a decrease in alcohol consumed per capita beginning in 2014. However, it must be mentioned that this isn’t the “40%” that the Guardian reports and that one may glean from the WHO graph at first glance.
The next steps in a SCM controls analysis is to permute the data set
to understand the sensitivity of our Synthetic controls to assess if the
results observed are plausible (e.g. are the results we obtained due
from chance, or reflect the true effect). The SCtools provides the
generate.placebos
function to automate this process. This
includes the option to parallelise the operation (which greatly speeds
up processing time).
Now that we have our placebo object, we can represent it graphically
with plot_placebos
. Here we can see our donors (control
unit) and our actual treated group.
Equally important we can test the Mean Squared Prediction Error (MSPE). Additionally, because we generated the controls for the placebos, we can see how extreme our values are and thus generate a pseudo p-value to see if our results are significant. Now, it should be mentioned that this test is relatively underpowered as it looks only at the donor states.
Why look at single number summary, when you can look at a plot? With
mspe.plot
we can visualise the ratios of MSPE for each
donor. Here we see very little difference in Russia, which supports our
higher p-value.
In the case of this study, it doesn’t look like the policy had an impact. However, this brings into question the model specification. The predictors in the model did not necessarily generate the best pre-treatment MSPE. Different predictors may help us better construct a Synthetic Russia. Additionally, this model could be tuned by including some lagged predictors. We won’t go into details regarding this process, but with SCtools, you now have tools to help you with your model inference and tuning.
Available at https://www.theguardian.com/world/2019/oct/01/russian-alcohol-consumption-down-40-since-2003-who↩︎
Yes, we all know as good economists that as goods get more expensive, consumers will buy less of them. However, how much less is another question.↩︎
I would recommend using the plotly
package
which allows you to hover over the different lines and see the country
codes.↩︎