Standardized usability questionnaires: which one to use?

Part 1: post-test questionnaires

Dec 16, 2021

Questionnaires are a self-reporting data collection technique. The questions (items) in a questionnaire are usually closed-ended and presented as multiple-choice. Respondents have to choose from a set of alternatives or points on a rating scale (e.g., very satisfied to very dissatisfied, strongly agree to strongly disagree).

A standardized questionnaire is a questionnaire that is written and administered so all participants are asked precisely the same questions in an identical format and responses recorded and scored in a specific, consistent manner (Boynton et al., 2004). Standardizing a questionnaire takes effort; it requires repeated testing with a big sample and extensive data analysis.

Standardized measures offer many advantages to practitioners (Nunally, 1978):

Objectivity: they allow usability practitioners to independently verify the measurement statements of other practitioners.
Reliability: this refers to how consistent responses are to the questions. If a measure has high reliability the same or similar users are expected to give similar responses when we evaluate the same product. The most common measure of reliability in questionnaires is by using Cronbach’s alpha, a measure of internal reliability. This ranges from 0 (poor reliability) to 1 (perfect reliability). Measures with a score over .70 are considered sufficiently reliable.
Validity: It refers to whether a questionnaire can measure what it is intended to measure. For example, a survey designed to explore learnability that actually measures system capabilities would not be considered valid.
Replicability: standardized methodology allows practitioners to replicate studies (either their own or other researchers’).
Quantification: Results can be reported in finer detail using predefined methods. More advanced statistics can be used to allow us to better understand the results.
Scientific generalization: They allow us to generalize a finding from a sample to the greater population.
Economy: The development of standardized measures requires a substantial amount of work as discussed earlier. However, once developed, they can be reused multiple times without the need to re-standardize.
Communication: Communicating findings from standardized measures is easier to interpret since the metrics tend to be standardized. For example, the score of a questionnaire can be compared to scores reported in previous studies.

There are two categories of questionnaires used during usability testing:

Post-task questionnaires: These measures are completed immediately after users finish a task and they capture their impressions of that task (e.g., Overall, this task was…?). A question is usually presented after the end of each task, which results in multiple answers collected within a session.
Post-test (post-study) questionnaires: They are administered at the end of a session (or can be used after a user has interacted with a product). They measure the user’s overall impressions of an app or a website.

Post-task and post-test questionnaires are not incompatible; both can be used in the same usability study if required. In this article, I’ll be focusing on post-test questionnaires. Part 2 of this series covers post-task measures.

The most widely used post-test standardized usability questionnaires are the following:

Questionnaire for User Interaction Satisfaction (QUIS) (Chin et al., 1988)
Software Usability Measurement Inventory (SUMI) (Kirakowski and Corbett, 1993)
Post-Study System Usability Questionnaire (PSSUQ) (Lewis,1992)
Software Usability Scale (SUS) (Brooke, 1986)
Standardized User Experience Percentile Rank Questionnaire (SUPR-Q) (Sauro, 2015)

QUIS

The QUIS was developed by Chin, Diehl, and Norman in 1988 and it is one of the earliest questionnaires for evaluating user satisfaction. The QUIS is organized around general categories as screen, terminology and system information, learning, and system capabilities. Practitioners often use only categories relevant to the product they are testing and can supplement the QUIS with some of their own questions, specific to the design being evaluated. The questionnaire has been updated multiple times since its release and the current version is QUIS 7, which is available in five languages (English, German, Italian, Brazilian Portuguese, and Spanish) and two lengths, short (41 items) and long (122 items), using nine-point bipolar scales for each item. The shorter version is the most popular one. The questionnaire is licensed and can be found here.

SUMI

The SUMI was developed by the Human Factors Research Group (HFRG) at University College Cork in Ireland, led by Jurek Kirakowski. It is a 50-item questionnaire with a Global scale based on 25 of the items and five subscales for Efficiency, Affect, Helpfulness, Control, and Learnability (10 items each). As shown in the figure below, users can choose one of three options (Agree, Undecided, Disagree). The SUMI contains a mixture of positive and negative statements (e.g., “The instructions and prompts are helpful”; “I sometimes don’t know what to do next with this system”).

The SUMI is currently available in 12 languages (Dutch, English, Finnish, French, German, Greek, Italian, Norwegian, Polish, Portuguese, Swedish, and Spanish) and is licensed. To view more information about SUMI and buy a license, see here. You can view the English version of the questionnaire here.

statements from the SUMI. Examples include items like “I would recommend this software to my colleagues” and “If this software stops it is not easy to restart it” — Some statements from the SUMI

PSSUQ

The PSSUQ is a questionnaire designed to measure users’ perceived satisfaction with computer systems or applications. It was originally developed by IBM and it is based on an internal IBM project called SUMS (System Usability MetricS).

A few rounds of improvements have resulted in PSSUQ Version 3, which is the one used today. The original version had 18 questions but version 3 consists of 16 questions with a Likert scale(ranging from Strongly Agree to Strongly Disagree).

The PSSUQ items produce four scores — one overall and three subscales. To score it:

For the Overall score average the responses for all the items of the questionnaire (items 1 through 16)
System Quality subscale: Calculate the average for items 1 through 6
Information Quality subscale: The average of items 7 through 12
Interface Quality subscale: Calculate the average for items 13 through 15

The 16 items in the 3rd version of the PSSUQ — the PSSUQ version 3 (image from Sauro and Lewis)

The PSSUQ can be used with both large sample sizes (more than 100) and with smaller sample sizes (fewer than 15). The main difference is the level of precision obtained. In a 2004 study, Tullis and Stetson used the CSUQ to compare two financial websites, and they found a sample size of 12 generated the same results as a larger sample size 90% of the time.

The PSSUQ scores correlate significantly with task-based measures and completion rates (r = .4) (Sauro, 2019). However, the PSSUQ should be used carefully as it is susceptible to acquiescence bias (also known as agreement bias), the tendency for survey respondents to agree with research statements. This is because all the items in the PSSUQ are positively worded.

SUS

The SUS is the most well-known questionnaire used in UX research. It was created by John Brooke in 1986 and is being used by most UX researchers today. Its effectiveness is supported by academic research — it has high validity (it actually measures what it intends to measure), reliability (users consistently answer the questions in the same way), and sensitivity (it can detect meaningful differences). Scales with high validity such as the SUS produce more trustworthy results (which result in well-informed decision making). In 2012 the SUS accounted for 43% of post-test questionnaire usage in a recent study of a collection of unpublished usability studies (Sauro, 2012).

The SUS consists of 10 questions and produces a score from 0–100. The odd-numbered items have a positive tone; the tone of the even-numbered items is negative. Extensive benchmarking of SUS scores has been conducted by researchers on many different systems and an average SUS score of 68 across 500 studies has been found. A score of 80 or higher indicates high usability. There is a large amount of industry-wide data available to help benchmark a product’s score and understand it in the context of competitors.
Research has shown that SUS scores correlate with user performance (the correlation is modest — around r= .24 for completion rates and time). This means that the SUS score reflects real-life performance on a particular system.

The SUS scoring method requires participants to provide a response to all 10 items. If for some reason participants can’t respond to an item, they should select the center point of the scale (Barnum, 2021).

To score the SUS:

For odd items: subtract one from the user response.
For even-numbered items: subtract the user responses from 5 (reverse scoring)
Add up the converted responses for each user and multiply that total by 2.5. This converts the range of possible values from 0 to 100 instead of from 0 to 40.

While the SUS was initially intended to measure perceived ease-of-use (a single dimension), Lewis and Sauro found that it provides a global measure of system satisfaction and sub-scales of usability and learnability. Items 4 and 10 provide the learnability dimension and the other 8 items provide the usability dimension.

Scores for individual questions can also be calculated to give us more insight into usability issues. This is achieved by multiplying the normalized score of each question and multiplying it by 25 to align with the scale used for the overall SUS score.

Evidence suggests that SUS scores can predict customer loyalty. In particular, there is a significant positive correlation (r=.61) between SUS scores and Net Promoter Score (NPS). The NPS has become a popular metric of customer loyalty in the industry.

SUPR-Q

The SUPR-Q is a questionnaire consisting of 8 questions, which was developed by Jeff Sauro in 2015. Users respond to the first 7 questions using a 5-point Likert scale, where 1 represents “Strongly disagree” and 5 represents “Strongly agree”. The final item is the Net Promoter Score (NPS), a single question often used as a standalone survey for measuring users’ loyalty. This question uses a scale from 0 (“Not at all likely”) to 10 (“Extremely likely”).

The SUPR-Q contains four factors: usability (can users accomplish what they want to do?), trust (do users trust the product?), appearance (how do users feel about the UI?), and loyalty (are users loyal to the brand?).

Example SUPR-Q for Amazon (from MeasuringU)

The SUPR-Q requires a license fee, which allows practitioners to compare a specific domain to over 200 websites.

Others

The questionnaires discussed above are the most commonly used ones. Researchers have developed and validated much more — many were designed for specific types of applications or products. Some examples are described below:

The Usability Metric for User Experience (UMUX): It is a four-item Likert scale used for the subjective assessment of an application’s perceived usability. It is designed to provide results similar to those obtained with the SUS.
mHealth App Usability Questionnaire (MAUQ): 21 item usability questionnaire to measure usability of mobile health apps developed by Zhou and colleagues
The Chatbot Usability Questionnaire (CUQ): It includes 16 balanced questions related to different aspects of chatbot usability. Eight of these relate to positive aspects of chatbot usability, and eight relate to negative aspects. Scores are calculated out of 100.
AttrakDiff: it measures the pragmatic and hedonic quality of the user experience and is widely used in Germany (Hassenzahl et al., 2003).

Which one should I use?

There is not a simple answer to this question. Determining which questionnaire to use depends on various factors such as the nature of the project, the stage of the research, the goal of the study, and the budget.

Table summarizing the main differences of the questionnaires discussed earlier — Table comparing the most popular standardized post-study questionnaires

Here are some things to consider:

Has the product been extensively tested? If yes, the SUPR-Q or the PSSUQ are good tools for fine-tuning an already tested product.
Are any of the sub-scores for either questionnaire especially interesting or relevant for your research? For example, if you are interested in the learnability of a product, then the SUS is a good choice.
How long is the usability session you are running? Consider tester fatigue. Some questionnaires, such as the PSSUQ are longer and more complex, thus increasing tester fatigue. Shorter questionnaires are more likely to be completed and are better for benchmarking studies.
Is cost a concern? Then choose one of the free questionnaires that do not require a license fee (e.g., SUS, PSSUQ).
Are you particularly interested in how your product’s perceived usability compares to competitors? Look for questionnaires providing normative data (e.g., SUPR-Q, SUMI).
How big is your sample? If you can only recruit a small number of users, it is best to choose a measure that can provide valid results with smaller sizes (e.g., SUS, PSSUQ).

Ben Harper

Nov 13, 2023

I'm going to suggest that people not use QUIS at this point. I am one of the co-inventors and the scale was developed to be tied closely to the UI technology. This made it nicely diagnostic when it was fresh, but it hasn't been refreshed since Laura Slaughter and I did it in the 1990's.

Expand full comment

Oct 14

Thank you for this post, very helpful 😇!

UX Psychology

Discussion about this post