Thursday, November 7, 2013

The Subtle Joys of Selecting on the Dependent Variable

Academic research in the social sciences has a variety of aims, but much of it seeks to explain or elucidate phenomena or condition(s) and the relationships therein. In research parlance, this phenomena or condition is the dependent variable. One should not select cases that satisfy the criteria of the dependent variable; doing so is called selection bias and can lead to incorrect conclusions.

To wit, here is an example of selection bias from my former field of study, political science.
Analysts trying to explain why some developing countries have grown so much more rapidly than others regularly select a few successful new industrializing countries (NICs) for study, most often Taiwan, South Korea, Singapore, Brazil, and Mexico. In all these countries, during the periods of most rapid growth, governments exerted extensive controls over labor and prevented most expressions of worker discontent. Having noted this similarity, analysts argue that the repression, cooptation, discipline, or quiescence of labor contributes to high growth. (Geddes, 134 pdf)
If one were to make policy recommendations based off this research, one might advocate that developing countries repress labor unions in order to get economic growth, the dependent variable.

Reaction Gifs, as always. And Clueless. 
As it turns out, Alicia Silverstone is right to be skeptical about this claim.
In order to establish the plausibility of the claim that labor repression contributes to development, it is necessary to select a sample of cases without reference to their position on the dependent variable, rate each on its level of labor repression, and show that, on average, countries with higher levels of repression grow faster. 
The two tasks crucial to testing any hypothesis are to identify the universe of cases to which the hypothesis should apply, and to find or develop measures of the variables. A sample of cases to examine then needs to be selected from the universe in such a way as to insure that the criteria for selecting cases are uncorrelated with the placement of cases on the dependent variable.(Geddes, 134-5)
A random sample from a given universe is one such way to test a hypothesis or a relationship, but selection bias is not random, and when one does this, the research findings may be biased.

However, there is a flip-side to selecting on the dependent variable: the results are often not only relevant, but highly entertaining.

To wit, James Scott's Seeing Like a State: How Certain Schemes to Improve the Human Condition Have Failed is, in my mind, a towering achievement and an immensely absorbing piece of research. Of course, he selects on the schemes that have failed.

Via Google Books
And that brings us to library and information science.

Stanford University's Jacqueline Hettel and Chris Bourg are conducting research on "assessing library impact by text mining acknowledgements" from Google Books (Source). It is an impressive and creative way to measure how libraries can positively affect scholars, and at present it is in the "proof of concept" stage, so it is still early. Information and early data on the project is available at the following links.

It seems that these scholars have a dependent variable robustly defined and measured in the form of acknowledgements that thank libraries and librarians for their help with research. While they have acknowledgements, proof of the impact of libraries, the dependent variable, they do not have the causes of these acknowledgements, and as a fellow librarian, the causes are what I am after. Those causes lead to a new metric of academic library success in scholarly communication. As of now, this work appears to be called "Measuring Thanks," a title that may hint at possible selection bias. I look forward to hearing more about the project, and I hope that they have not selected on the dependent variable by focusing on it at this early stage. As was the case above, a random sample of books, and the acknowledgements therein, is one way to avoid this bias.

Academic researchers are not supposed to select on the dependent variable, but doing so can lead to interesting and entertaining finds. More research that satisfies these latter conditions, please.

No comments:

Post a Comment