Sampling bias: identifying and avoiding bias in data collection

November 27, 2022 Sheldon Kallio

This article is rated as:

Bias in evaluation is inevitable. Reflection helps us to identify our bias and when we do, it is necessary to identify sources of bias in our processes, eliminate which bias we can, and acknowledge which bias we cannot.

For simplicity, I will focus on the data collection process and the associated sampling bias: ‘Bias that results from certain outcomes or responses being favoured over others.’ Sampling bias results from groups of individuals being over- or under-represented during the data collection process. This results in weak, or incorrect, conclusions being drawn from the data. Incomplete data lead to incomplete results.

Identifying sampling biases

Sampling bias encompasses any biases that originate during the data collection process. Here we’ll focus on a few common sources of bias in data collection.

1. Undercoverage

Commonly specific groups of individuals are underrepresented in our data; this is undercoverage. Often minority or marginalized groups are absent from collected data or these groups are poorly represented so that their responses are diluted. This leads to poor conclusions about specific populations, as groups underrepresented may provide insights not captured by the majority.

Key points:

A segment of the population of interest in not represented (or underrepresented)
People captured in the data differ systematically from those not captured (e.g., demographic, social, or cultural differences)

Example: You want to understand how newcomers to Canada navigate the Canadian healthcare system by conducting English-language interviews. However, this will result in undercoverage as newcomers to Canada may be less than confident in their English fluency or may not speak English at all. The resulting data will skew towards English-speaking newcomers only and will not properly represent all newcomers’ thoughts and opinions on navigating the Canadian healthcare system. Conclusions drawn from these results may not generalize to or represent the underrepresented population, thereby reinforcing barriers and continuing to marginalize these groups.

2. Non-response bias

Inevitably some individuals will not respond to a survey or will back out of a scheduled interview last minute. Like undercoverage, non-response bias may result in some groups being underrepresented, but non-respondents opt not to respond, whereas in undercoverage, specific groups are missed and not provided the opportunity to respond. Non-respondents may simply have forgotten to respond or may refuse to respond. If the non-respondents differ from the respondents in some meaningful way, the subsequent results will be inaccurate.

Key points:

Some people may forget to respond or be unwilling to respond to a survey (or participate in an interview)
Survey fatigue may result in increasing non-responses for latter survey questions
Technology is convenient but has limitations (e.g., some people do not have access to an email account or, if they do, emails are too often redirected to the spam folder)

Example: You are evaluating patient satisfaction using a survey following visits with a family medical doctor. Surveys are emailed to patients, and they are asked to complete the survey at their convenience. Guaranteed, there will be several non-responses as surveys are lost (email addresses are incorrect; email sent to spam), or patients forget to respond. Non-respondents may represent a specific group, such as single parents with limited time to complete the survey or elderly people that are less likely to access email. And again, conclusions drawn from these results would not necessarily generalize to the population of interest as non-respondents may differ in some meaningful way relative to respondents.

3. Voluntary response bias

This is where specific individuals volunteer to participate in a survey or interview. These individuals usually have a strong interest in the survey/ interview topic, either positive or negative. This results in a biased data sample that is crucially missing people of middling opinions.

Key points:

Participants self-select to complete a survey or interview
Strong opinions, positive or negative, are disproportionately represented in the data

Example: You want to receive feedback on ways to improve your organization by providing an anonymous survey link to all employees in your organization. Most responses will be from people that are completely satisfied with the organization or people that are completely dissatisfied with the organization. Responses will skew to the extreme and valuable information in the middle will be lost, which could result in changes to the program or organization that may be unwarranted if most staff feel neutral.

4. Survivorship bias

Survivorship bias results when we consider only surviving (i.e., existing) observations and fail to consider all other potential observations. This may result from focusing in on only the successes of an evaluation but ignoring the failures or roadblocks that built the foundation for said success. That is, we may be prone to focus on people that start and finish a program, but neglect information from those that did not start or complete a program. Results would therefore reflect the thoughts and opinions of those completing a program, and potentially miss critical information on why people did not complete a program.

Key points:

Reliance on existing data, “the survivors”, at the expense of other data (e.g., historical data, attrition data, etc.)
Generally, overemphasizes positive outcomes over negative or null outcomes

Example: You could administer a survey to participants of a physical fitness and healthy eating program to gauge their engagement and satisfaction with the program. You may find that these participants are engaged and satisfied with the program, but the data fails to capture the participants that started and did not complete the program. There may be underlying reasons as to why they did not complete the program, which could provide insight into how to improve the program to reach more people in the future.

5. Recall bias

Our recollection of past experiences diminishes over time. This results in inaccuracies when recalling past experiences. The longer the time between an experience and when an individual is asked about said experience, the more likely those recollections will be inaccurate, influenced by current events and experiences.

Key points:

Everyone forgets details of past experiences over time
Longer durations between experiences and recollection, the greater the likelihood to misremember specific details

Example: You wrap with a mental health initiative for seniors with mild depression and conduct follow-up interviews with the participants. Immediately after the program conclusion, participants provide more detailed accounts of their experiences. However, six-months down the road, you interview these same participants. They may have vague impressions of the program but are no longer able to provide concrete details of their experience of the program.

Avoiding sampling biases

Sampling biases will always be present during the data collection process and, oftentimes, multiple biases will be present. The goal is to eliminate which bias we can and reduce the rest. Here are a few steps to reduce bias in your next evaluation:

If in doubt, try to sample more people than you think you need. For minimum estimates, consider using a sample size calculator.
Use random sampling protocols. Random samples improve the likelihood of capturing a sample representative of your population of interest.
- You may also consider maximum variation sampling to avoid sampling the same homogenous groups and better capture underrepresented or marginalized groups.
Keep your surveys and interviews concise and accessible. People are more likely to participate if the barrier for participation is low.
Follow up with non-responders to increase the robustness of your data. If you are still having trouble getting responses, consider providing incentives for non-respondents.