“Utility of Human-Computer Interactions: Toward a Science of Preference Measurement”, Michael Toomim, Travis Kriplean, Claus Pörtner, James Landay2011-05-07 (, , ; backlinks)⁠:

The success of a computer system depends upon a user choosing it, but the field of Human-Computer Interaction has little ability to predict this user choice.

We present a new method that measures user choice, and quantifies it as a measure of utility. Our method has two core features. First, it introduces an economic definition of utility, one that we can operationalize through economic experiments. Second, we employ a novel method of crowdsourcing that enables the collection of thousands of economic judgments from real users.

Figure 1: Fittsʼ law models the time required to click a widget of a size and width—our technique can model how much people prefer to use a widget. Participants were assigned one of 3 index of difficulty conditions. Each point is the number of clicks a participant completed before quitting (points jittered to show spread). Participants preferred big buttons to small buttons (p < 0.10). Participants were allowed a maximum of 3,060 clicks each. The regression line accounts for this maximum using a Tobit analysis.

…Unfortunately, the measures used by the CHI research community—time-on-task, the number of errors, and subjective interpretations of think-aloud and survey reports—only indirectly predict whether an interface will be preferred over other alternatives. We usually do not directly measure user choice itself.

…In this paper, we take some first steps towards establishing the language, methods, and analytical tools for evaluating choice and preference of different tasks and interfaces. The core technique we introduce is a semi-automated method for posting different interfaces and tasks to a crowdsourced labor market, such as Amazon’s Mechanical Turk. These labor markets are websites where anyone can post a small task for someone else to accomplish for a small price. Our method is to create thousands of such tasks that systematically vary the interface, the price, and instructions. We then observe how many workers choose to complete the tasks, and how many times they do so. With this data, we can apply various analytical techniques to characterize user preferences for the given interfaces and tasks.

…We posted the standard Fitts’ law task to Mechanical Turk, asking workers to click back and forth between a rectangle that switched sides on the screen. Our experiment manipulated the task’s index of difficulty, by changing the size of rectangle and distance from cursor. We expected users to prefer easy tasks to difficult tasks, and in fact the data displays this trend (see Figure 1).

…Our approach is a between-subjects design, and minimizes the explicitness of workers reasoning about their choices. We call jobs “Mystery Tasks”, presenting them as a surprise or a game rather than an explicit auction (detailed later). Workers do not know that their activities are being aggregated to infer net utility. This technique is simple, direct, and requires few assumptions. The downside is that it requires a large amount of data, because every completed job provides only one bit of information: whether the user accepted the job, or not. Luckily, obtaining this amount of data is feasible with Mechanical Turk.

Figure 7: Survival graph for the Aesthetics & Feedback study. We made two interfaces for answering CAPTCHAs: one “pretty” (A), one “ugly” (B), but identical in behavior. The survival graph shows how many workers made it through how many tasks, for each of our 4 experimental conditions. The shaded regions are 95% confidence intervals. At the far left, 100% of these workers looked at the task, but only 10% to 40% completed 10 tasks (100 CAPTCHAs). Note that the Pretty and Ugly lines are separated at the left, but converge toward the right. This suggests either that the utility effect of esthetics fades over time, or that the types of users who complete many CAPTCHAs are more concerned with pay than esthetics.

…One interface had a clear, minimalist design, and the other had gaudy colors, small fonts, and a distracting animated GIF advertisement (Figure 7A & Figure 7B)…We also estimated the effect of esthetics on labor supply, as we did with the Fitts’ law study. The results show that the effect of esthetics and feedback is substantial: all else equal, the Pretty style of the interface produces 58% more use. This is statistically-significant at p = 0.02.