Correlation, Causation, and Confounds

Last week, Inside Higher Ed posted a very short article about Purdue University, which investigated the relationship between use of the campus gym and student GPA. The study found that students who visited the gym roughly once a week throughout the semester had higher GPAs, and the gym is now seeking to prove “whether there is a cause and effect.”

As Meredith Farkas pointed out on Twitter, this kind of study is not alien to librarians, who are likewise collecting user data in an effort to demonstrate (or prove) their value. The “Library Cube” studies were some of the first that I read (there are several others, but this one is open access), and they’re part of a growing trend in the literature which aims to link some aspect of “library use” with some aspect of “student success” (like this one, this one, or this one).

While I appreciate the sentiment behind these kinds of studies, and think campus departments should attempt to evaluate if, and how, they help students, studies looking for correlates between the use of a library service (research consultations, instruction sessions) and some other metric (GPA, retention, graduation rate) seem to be demonstrating an alarming level of omitted-variable bias. By that I mean that there are dozens upon dozens of confounding variables which can and do have an effect on student success, and simply comparing something like how often a student visits the library with their GPA is akin to viewing the world through a keyhole- you get a picture, but it’s certainly not the whole picture.

For instance, it could be that just hanging out in a library (or a gym) all day will get someone better grades just by the nature of being there… but that’s probably not the case. Perhaps those students with the luxury of spare time to spend in the library aren’t working a full time job, or aren’t taking care of children or siblings or parents. Perhaps the library didn’t just magically give them better grades, but rather they are in a privileged circumstance which gives them the time to study, and, in turn, succeed in class. It’s also entirely conceivable that our libraries, like many of our systems of higher education, have been created and maintained in an image which is off-putting to a significant number of our students. These students don’t use the library because they don’t feel welcome there, and that feeling of exclusion also extends to their time in class, which has a deleterious effect on their grades until they leave school. I know it feels good to say “people who use the library are more likely to succeed!” but that’s a disingenuous form of evaluation and implies a causation that isn’t there.

My main problem with the current state of affairs in library assessment isn’t that researchers don’t always control for these variables (more on that in a moment), it’s that more often we don’t even acknowledge them. Running a search in LISTA for phrases like “confounding variables” or “selection bias” reveals articles primarily from the fields of psychology or computer science that happen to mention information seeking behavior, but I’ve been hard pressed to find many studies of library usage that openly acknowledge how complicated, and limited, social research can be. If we’re going to engage in these practices, and really evaluate our users and how the library influences them, it means integrating all of social research methodology, not just the parts we like. To ignore the confounds and simplify our methods will still yield quantitative results, and maybe even show correlation between library use and student success, but I’m dubious of how valuable those results will be, and strongly oppose the notion that one “causes” the other.

So what’s the solution? How do we identify and control for the myriad variables present whenever we evaluate human behavior? The answer of the day seems to be “more data.” Instead of just looking at library use and GPA, let’s find out their major, year in school, age, and sex. Now we can make some crosstabs… but you know what? We left out marital status, household income, and race/ethnicity, so let’s add those. And I know we’ve been saying “library use,” but let’s look at how many books they checked out- well, not fiction, let’s just look at how many non-fiction books related to their major they checked out- so I guess we should just pry open their circulation records…

Uncomfortable yet?

I don’t want to wade too far into the “big data” debate, but let me just say that I understand the benevolent reasoning for collecting large amounts of user data, while at the same time am severely opposed to it. Perhaps mining circ records and database usage and class attendance and retention rates would yield some great understanding of how libraries contribute to student success. My suspicion, however, is that no matter how invasive we become with data collection, there will always be underlying confounds which will call our findings into question, and I’m not comfortable with betraying my professional ethics (n.b. number 3) in an attempt to prove value.

Ultimately, I’m really happy that libraries have become so involved in assessment- now I’m ready for us to be critical about how we do it. Personally, I do think that positivism has its place, and I understand the benefits of quantitative research, but I fundamentally reject the assertion that qualitative research methods are somehow inferior, or that theory can’t guide our practice just as well as an IRB-approved research study. (I say this as someone in the middle of an IRB-approved research study.) If our profession is to move forward, we need to be comfortable with and welcoming of a variety of methodologies, and should strive not to be forced into a mode of social research because we think data are somehow “more true.” I know enough about data to know that they’re not a panacea, and sometimes there’s just no substitute for a good critical theory.