Written by Anja Kilibarda, Ph.D.
May 3, 2022 at 7:00 pm ET
- The challenge with comparing opinion data across countries
- How we have tried to mitigate differential item functioning
- Leveraging anchoring vignettes
- Outcomes and limitations
- Abandoning clunky assumptions to form our current approach
The challenge with comparing public opinion data across countries
Morning Consult conducts daily brand tracking surveys in more than forty countries in addition to running hundreds of custom surveys in many more countries. We of course, then, want to be able to compare public opinion data across countries. Given the questions may be the same in some tracking surveys, one might think this is a straightforward task: compare the means or distributions of opinions on a question of people living in one country to those living in another. One would be wrong. Differences in culture, norms, and institutional contexts can lead people to understand and respond to identical questions in different ways (Brady 1985).
For reasons unrelated to the survey or question at hand, survey respondents in some countries will strongly tend toward agreement in Likert scale questions, for example, while others toward disagreement. Similarly, intangible concepts like ‘democracy’ and ‘political efficacy’ might elicit different reactions in different countries given heterogeneous understandings of what the concepts mean (Sen 2002, King et al. 2004). The most widely used term to describe this interpersonal incomparability is differential item functioning (DIF). The result of DIF, of course, is measurement error that is correlated with fielding country and biased comparative estimates.
How we have tried to mitigate differential item functioning
Half a century of scientific work has not been able to do away with DIF. More recently, researchers have tried to use anchoring vignettes to measure differences in the standards respondents use when asked to evaluate themselves in survey questions. In this framework, survey respondents read several short vignettes, each of which describes a situation in which a greater or lesser degree of a concept is present. They then order the vignettes according to how much or little of the concept they believe is present in each and subsequently answer a question in which they place themselves on the same scale. Using vignette ratings, King et al. (2004), for example, uncovered Mexicans have much higher standards for what constitutes political efficacy than do the Chinese, resolving the puzzle of why Chinese respondents, who live in an autocracy, report higher levels of efficacy than Mexicans, who live in a democracy.
Leveraging anchoring vignettes
Morning Consult’s Research Science department set out to replicate King et al.’s study across three countries using our own surveys to explore whether it was possible to leverage the rescaling factor anchoring vignettes offer to reweight the daily tracking survey we field in 40+ countries. Specifically, we wanted to know whether we could use the vignette approach to not only render responses to a single question comparable across countries, but also all questions in our daily tracking surveys in one shot.
We fielded a survey replicating King et al.’s design in the US, India and Germany and estimated a between-country difference factor using a Compound Hierarchical Probit (CHOPIT) model, where self-placement questions were regressed on vignette evaluations, certain individual-level sociodemographic characteristics and country indicators. Based on this model, we predicted individual respondents’ probabilities of being in each of the five self-placement categories for each question and recoded respondents as belonging to the category to which they had the highest probability of belonging.
We wanted to experiment using the proportions on this recategorized variable as possible targets for the original, non-vignette-adjusted proportions in constructing new raking weights. While our existing weighting approach rakes to demographic targets usually provided by government Census or statistical offices, we hoped raking opinion to CHOPIT-rescaled versions of those questions would help us eliminate DIF from our surveys.
Outcomes and limitations
Unfortunately, this approach failed for a variety of reasons. To begin with, it was not immediately clear or necessarily testable how many anchoring vignettes were necessary to correct an entire survey, and we quickly observed violations of assumptions about how people perceived the vignettes implemented as being ordered.
We then also found that the CHOPIT model tended to ‘overcorrect,’ such that some categories were not the most probable for any respondent and thus remained empty. The reweighted data were also sensitive in general to categorization decisions when people had close to equal probabilities of being in two categories.
Finally, we noticed that even once we resolved these issues to the extent possible, raking to the CHOPIT-drived targets changed the proportions of other variables in unexpected ways. Ultimately, the assumptions required to use anchoring vignettes in this way did not appear to be more defensible than the assumptions underpinning the comparison of basic unadjusted responses.
Abandoning clunky assumptions to form our current approach
We thus pivoted to focusing on the data and largely assumption-free, tried and true methods we have at hand. Relying on theories of survey “response styles”—the idea that certain people have tendencies to respond to survey questions in certain patterns regardless of the content of the question—we first looked at how people tend to respond to brand questions in all 40+ daily tracking countries across brand favorability ratings for multinational brands asked about in all countries.
We found consistent country-based response-style patterns: people in Latin American countries, for instance, tended to select the positive ends of scales much more often than people in Northern European countries, who had more pessimistic views regardless of the subject at hand (see, for example, Figure 1).
Median Brand Favorability by Country
Relying on both brand and political/economic opinion data, we also mapped out differences on the propensity to choose ‘Don’t know/No opinion’ options across countries. Additionally, we evaluated how people in different countries rank KPIs to track whether certain KPIs (e.g. favorability, trust) are consistently highly ranked or low ranked in certain countries compared to others. We wrapped up the results of all these analyses into a series of country profiles for each of the 40+ daily tracking countries. Reporting at Morning Consult involves incorporating country response pattern information into any analyses that compare public opinion across two or more countries. When differences are observed between countries, we have a characterization of how DIF may be driving those differences.
‘No Opinion’ Responses by Country
How the scale and speed of our tracking surveys differentiate our approach to country comparison
Our approach also exploits the fact that we collect data in so many countries on a daily basis. Analyses focus on trends over time and compare how opinions change across countries, rather than how they might differ at a given point in time. While DIF might bias comparisons at a given point in time, it is largely stable over time, so we can evaluate the size of changes in opinions period over period in one country versus another.
For instance, we might notice a 25% change on a given metric in Spain, but only a 2% change in Japan, allowing us to relativize our insights. We also focus on relative volatility within and across countries over time and contextualize analyses by considering a given data point within its historical context. For instance, while we can’t really compare leader evaluations in a static way across countries, no less because countries have different leaders, we can look at evaluations today within each country compared to the range of that country’s evaluations over the last year.
Ultimately, rendering meaningfully comparable insights from multi-country data involves rigorously tackling the issue on both front; that is, writing better questions and using appropriate statistical adjustments. Morning Consult’s approach tackles this by:
- Ensuring questions and answer scales are as concrete, standardized and as contextually localized as possible
- Applying insights from comprehensive analyses of response styles across countries while also leveraging the fact that we collect data daily to focus on relative changes rather than absolute levels.
Brady, Henry E. 1985. “The Perils of Survey Research: Inter-Personally Incomparable Responses.” Political Methodology 11 (June): 269–90.
King, G., Murray, C.J., Salomon, J.A. and Tandon, A., 2003. “Enhancing the Validity and Cross-Cultural Comparability of Measurement in Survey Research”. American Political Science Review 97 (February): 567–583.
Sen, Amartya. 2002. “Health: Perception versus Observation.” British Medical Journal 324 (April): 860–61.