The question of the minimum number of participants you need in a usability study regularly comes up, and there are lots of resources that answer it (e.g. https://measuringu.com). What these sites do not talk about, however, is the fact that many of the problems you observe are likely trivial issues (which affect a low proportion of real users) as opposed to the important issues (affecting a high proportion of users) that you are trying to detect.
Let us also imagine a study in which you want to be very sure to identify any problems which might affect 20% or more of the general user population (I am going to call that a high-impact problem). Using a sample size calculator and specifying that you want a 95% probability of identifying high-impact problems, the calculator will tell you to use a sample size of 14 participants. The green line in Figure 1 shows how the probability of observing a particular high-impact problem grows from 20% if we only use one participant, to 96% if we use 14 participants (all these graphs are based upon the binomial probability function).
Figure 1. Probability of observing a given user interface issue at least once.
The black dashed line in Figure 1 shows the probability of identifying a low-impact problem affecting only 5% of the general user population. As we can read from Figure 1, there is a more than a 50% chance of observing this low-impact problem in our 14-participant study.
In a real-world environment we probably want to address the high-impact problems first, and postpone addressing low-impact problems until later. The problem, however, is that if you observe both of these hypothetical problems in your study, there is no objective way to know that one of them is high-impact, and the other is low-impact. The expert opinion of the usability specialist may not be sufficient in an environment which stresses ‘evidence-based decision making.’ Furthermore, in the real world there are usually many more low-impact problems than high-impact ones, so the actual number of low-impact problems you observe will typically outnumber the high-impact ones. You stand the risk of “fixing” a whole lot of non-problems, or if you pick and chose from the problems observed, you stand the risk of neglecting important ones.
There is an answer, fortunately. Let’s change the rules. Let’s say that we won’t pay any attention to an issue unless at least TWO participants demonstrate it. Figure 2 has the result. Now with 14 participants there is still an 80% probability of identifying the high-impact problem, but the low-impact problem is down to 15% probability of being flagged.
Figure 2. Probability of observing specific usability problems at least TWICE.
Figure 3 shows the same thing for the situation where we require THREE observations of an issue before we define it as a problem. Now with 14 participants we are down to 3% probability of flagging a low-impact problem!
Figure 3. Probability of observing specific usability problems at least THREE TIMES.
If we still want that 95% probability of identifying any particular high-impact problem we must increase sample size to 29 in Figure 3, but at that sample size our low-impact problem will still only appear with a probability of 18%. Things are definitely improving.
Important caveat: The probability levels discussed relate to the probability of observing one particular problem. If there were, for example, three high-impact problems, the probability discussed relates to each of them individually, not all together. So whereas there is a 55% probability of identifying a particular high-impact problem with 14 participants; the probability of identifying all three high-impact problems is only 17%!
Depending on the percent success rates you use to define a high-impact versus a low-impact problem, you will get different curves, and may need to choose a different number of observations required to be flagged as a problem. I provide a few more curves in the addendum.
Discussion
Adopting this approach does seem to push one to using larger samples. However, if you compare the cost of, say, 10 extra participants against the political cost of incorrectly identifying trivial issues as usability problems, and the development cost of “fixing” issues that don’t need it, then the decision becomes easier.
Even without increasing the sample size, the same general principles apply – if you find yourself in a situation of having observed too many issues, give emphasis to those that were observed more than once.
Addendum
The charts below provide curves for different probabilities.
Charts showing the probability of identifying specific problems, based on sample size and severity, by specific observation frequencies are provided below.
These data are all based on the binomial probability function, calculated using Excel’s BINOM.DIST function.