Worshiping False Metrics (not what you think)

I’ve been ploughing through research data lately in several different formats and I’m forcing myself to take great care with certain types of data. Most notably this is grade data, particularly between cohorts. It’s one of the key differences between carrying out pedagogical research and chemistry research. If I run the same sample (preparing it to requirements each time) on the same NMR instrument on three different days, within certain tolerances, I expect the same result. If I carry out a teaching activity then evaluate it with three different cohorts, I have to consider what the same result means.  I have to consider factors outside of my control that impact an experiment with live humans. These may include (but are not limited to):

  • entry grades, prior subject knowledge
  • timing of teaching session and timing of other activities around my session (it can be as simple as the difference between teaching at 11 am and teaching at 4pm, or teaching on a Tuesday or a Friday)
  • Hawthorne Effect 
  • nature of the cohort – repeat year students, different degree route students, different year group students

This post is largely to help me clarify my thinking on the matter and moot out a few scenarios. If it helps anyone else, that’s a bonus.

Grades of three fictitious cohorts of students.

So let’s consider some fictitious data. Cohort 1 (blue, average 63%, range: 42 – 79), cohort 2 (red, average 58%, range: 40 – 73) and cohort 3 (green, average 62%, 30 – 80). The first cohort have quite a nice profile of marks, decent range that covers a good bit of the marking spectrum, no fails/capped work (marks of 40% for late submission or capped after reassessment). The teaching activity that produces these grades seems to have worked in its first year. Cohort 2 causes some issues however. There’s a bump in the distribution that corresponds to a large number of fails or capped work, 8 marks of 40%. The top end of the range is also a bit lower. The average isn’t too bad however. Cohort 3 has a broader range, including a fail, and an average comparable to cohort 1. How can these data be interpreted meaningfully?

How far should we go to attribute such variation to circumstances beyond the control of our teaching session?

I like to see a few year’s worth of data of this sort. Firstly, this allows me to feel reassured that Hawthorne effects have dissipated and the activity in question is embedded decently into practice. Longevity of teaching innovations is one of my personal bugbears – I see lots of great ideas, I see limited evidence of how many endure beyond the initial enthusiasm and effort. So I like to see the average marks or evaluation scores as these thing reach steady state.  I would consider average marks varying within about 5% to be reasonable within a data set so in that context, my fictitious cohort 2 doesn’t bother me that much.

I need reassurance that the circumstances of the three cohorts were roughly similar in terms of what was going on elsewhere, and that the cohorts started out reasonably similar. What happens if cohort 2 was a year that a great number of students were taken through clearing where a specific requirement had been lowered? For example, if we were looking at test scores from a maths module, what if some maths requirement was waived during admissions process for cohort 2 to get more rears on chairs?  Or what if a new cohort joined the module with different prior experience? How do we account for that? A simple way would be to discount the ‘different’ students but without getting into a difficult and beyond me discussion of statistical validity, that may not be entirely appropriate. I’ll just note that if, in cohort 2, we exclude the 40% marks due to caps and reassessment, the average is 61% and the range 40 – 73.  What if these were laboratory report marks and the experiment for cohort 2 was different to cohort 1 but we went back to the original for cohort 3? And what if the person teaching the session, or the test format changed between the three cohorts?  There are lots of structural explanations for variations like this and these should be accounted for.

I guess what I’m circling around here is the idea that quoting some data in isolation is disingenuous at best, and outright deceptive at worse.  It’s worse when comparing the impact of a new intervention, a newly designed teaching session for example. It’s essential to be aware that your session may not be the only thing that has changed, and it’s wrong to spin data to present an intervention in a better than realistic light.  Should we take it as read that other stuff has changed when viewing these types of data? How much change is acceptable within that before it has to be noted? And should we query it when we see it? What if the person presenting the data hasn’t considered this?

Comments please!