Statistical Significance Brian Usability Tools Podcast, Episode nine - Statistical Significance Christiansen: recorded from the studios of User Interface Engineering October 19, 2007. [music] Brian: Welcome. I am Brian Christiansen, Producer of UIE podcasts. Each week we will be sharing tools for improving your site's user experience based on our research at User Interface Engineering. If you are interested in the topics Christine and Jared discuss in the podcast, sign up for our popular free newsletter, UIE Tips. I'll have more details on how at the end of the podcast. And now, here are Christine and Jared. [music] Brian: Hello and welcome, everybody to another edition of the Usability Tools Podcast. This is Brian Christiansen sitting in this week for Christine Perfetti, and I'm here with Jared Spool, our Founder and CEO of UIE. Hello, Jared. Jared Spool: Hey there, Brian. What did we do with Christine? Did we lose her somewhere? Brian: I heard some hollering out there, but it was really muffled. Jared: Yeah. I think she's up to her elbows in conference preparations. Brian: Indeed. Jared: That's what she is. Brian: For all of our listeners out there who have attended or are planning to attend one of our conferences, she is the main cog that runs that machine. Jared: Absolutely. We actually don't do anything. Brian: Let me tell you - it is great. Jared: Yeah. Brian: So, we've an interesting topic this week that was actually suggested by one of our listeners. Colin wrote in with this suggestion. He asked if we could talk about statistical significance. What do you think? Jared: Oh, yeah, absolutely. I love talking for stuff like this. I live for these conversations. Brian: All right. Jared: Let's talk about statistical significance. Hopefully, we'll have a statistically significant conversation about statistical significance. Brian: I'll try to be significant. Jared: OK. Brian: So, the question goes... Jared: I'll be just average. Brian: So, market research experts will tell you that you need hundreds, potentially thousands of data points to have statistically significant results. Yet, a typical usability test only has five to eight users. How do we convince the marketing research guys at our company that what we do is meaningful, says Colin. What do you think, Jared? Jared: The basics of this have to do with the fact that the definition of statistical significance is something that not as many people know as use the term. So, there are a lot of people who go around and say, "Well, I don't think that's statistically significant", without actually knowing what that term actually means. There is sort of a very simple way to define it. If we were going to conduct some sort of survey where we just asked people their opinion about something important, like: do you like potato chips - what you're looking for basically, how many people do you have to ask in order to make sure that the answer you get from the sub-set of people that you are asking is representative of what you are going to get from everybody. Another way to think about it is the way that I learned about it which is, if you have a bag with a thousand marbles in it and you pull some small number of marbles out, let's say 10 marbles out of the bag. Five of them are blue, and five of them are red. Does that mean that the bag is half blue and half red? A lot of things are going to depend on how you select the marbles or how you select the people you ask the questions of and a variety of other things. There are basically formulas that are called confidence formulas that basically give you confidence ratings. For instance, sometimes you'll see a statistic that says: this is with a confidence rating of 85% or 90% or 95%. Basically, what that's telling you is how statistically significant the data is in order to be able to decide if the data you have is correct. Another way that people often express it is with margin of error. So, for instance when you hear political polls they will often say, "Well, this has a three percent margin of error", which means if they say that a candidate was a chosen candidate for 54 percent and has three percent margin of error means that could be as high as 57 or 58 percent. It could be as low as 50 or 51 percent. So, what you are talking about there is this notion of confidence and it has to do with asking questions and getting answers. The question is what kind of confidence can we have with something like a usability test? Brian: So, for a survey the people we would have to talk to, you know, the number of people who eat potato chips, or the number of people who vote in elections would necessarily be large in the thousands. Jared: Well, it's going to depend on how many people vote in the election right? If you are sampling stuff, if you are sampling a small community for their local sheriff.. Brian: Sure. Jared: ...you don't have to survey nearly as many people as if you are trying to get something that is across the country. Just because you are a statistically significant number doesn't necessarily mean you have gotten a representative sample. If you only measure Massachusetts residents about the Presidential Election it going to have a particular Massachusetts bend to it, so, that many not be representative to the entire country. So, you have to look at your samples very carefully. When you then say, "OK we want to sample every state." You have 50 states now you need to get a statistically significant number of from each state which ups the number of people that you have to sample. Brian: Some of our clients have websites that serve thousands or even millions of people. If they do testing or surveying would they then have to have larger numbers in that to make their test significance? Jared: Well, if their goal is to get a significant result the answer is absolutely yes, you have to test large numbers of folks. You have to survey large numbers of folks. If you are going to ask everybody who comes to the site a question and you have a million people who come to your site every day. You are going to need to get probably a couple of thousand answers to make sure that you do the right thing. You are probably going to have to spread those answers across different hours of the day on the assumption that people who come to your site at night would answer differently... Brian: They would have different habits.. Jared: Right. So if you are going to ask for instance whether a part of the site is fast OK? You are going to want to spread that out throughout the day, because people are going to have different speed experience based on load on the servers, load on their backbone. If they are doing it from their office they may get a different result if they are doing it from their home. So, to get something that is statistical if you are asking that question, "do you think the site is fast?" you are going to want to spread it across. Brian: So, if I am following this right, you need large numbers for surveys, but that is not the case for a usability study. Jared: Right. A usability study is different. A usability study has a completely different experience to it. Part of the reason you need it for surveys...you only need to ask one person a question for surveys if you have discovered that, that person predicts what every other person does, right? So, there is this small little town in Northern New Hampshire that when it is time for the New Hampshire primaries to vote or when it is the New Hampshire general elections, they all vote at midnight. So all 48 residents come out of their house and go cast their ballots and by 12:15 on Election Day they have all cast their ballots. And, apparently they predicted that every election winner for the last 70 or 80 years or something like that. Brian: I've seen that. The networks go and cover it. Jared: That's right, and that is because they are just as reliable as any other survey tool out there right? So if you could find that person right? You could just ask them that question and then you have this prediction. You are there you don't need statistical significant. Brian: You save a lot of money. Jared: You would. You would. [laughter] And the whole point of any of these things is to predict whether the whole audience... because, you know, what you don't want to do is you don't want to have a small group of people point out some problem. And then you spend a lot of money fixing that problem only to discover it had no effect on anybody else's experience. So, the reason that we even care about this notion of statistical liability is because we're going to make investments based on the information we collect, and given those investments we want to make the right investments. With usability tests you get other types of information, whereas you could just ask people questions, but you're actually watching behavior. Every time you watch behavior you see something that you can ask the question, is this a problem that people would actually run into? So, for example, we did a study years and years ago at the Harvard Business School. We were studying how participants in one of their educational programs which was this program for CEOs - so you had to be a CEO or Executive Vice President of a company that was basically a billion dollars in revenue or more. Brian: I thought it was already hard enough to get into Harvard. Jared: Yeah, I know. This was a 13-week program for these guys. They lived... They stopped working, right. They actually had to get a letter from their Board of Directors that says they are excused from their job for 13 weeks. They take a leave from their job, and they go and study with the Harvard professors in this program. There were 160 CEOs and Executive Vice Presidents in this program, all these over- achievers, right. One of the things they do is they do these business simulations, and they compete against other students. So, you take these over- achievers; you put them in a highly competitive thing; and it got really tense. We were watching the Executive Vice President of Finance for Sun Oil, and he was doing this business simulation. It was a 10 year simulation that you do in about an hour, and you basically make all these decisions. Then it tells you how you do that year, and then you make more decisions. He was working with this thing, and up comes this prompt. And this prompt says, something like, "Enter next year's growth factor", right. That wasn't the actual prompt. It was some question along those lines. He was supposed to enter a number, but underneath the prompt it said, "Press the enter key to default". The Vice President stared at the screen for a minute. Now, again, it's a very stressful moment for him because he's competing in real time against 10 other people. He really wants to win. He's staring at the screen, and he reads this thing. Then, he slams his fist on the table, and he says, "Default? I've never defaulted before in my life, and I'm not going to start now". Right? So, instantly, we realized that the word 'default' meant something different to a finance guy than to us, the designers, usability guys, whatever. We knew at that point that we probably had to change the wording. The thing is we didn't have to get 15 other people's opinions to know that. As soon as we saw the problem we knew that we could probably come up with a way so people wouldn't confuse the meaning of the word 'default'. Brian: He was representative of the segment of people who would be using it. Jared: He was. He was. But, the thing was even if he was one of a handful of people who get confused it's an easy thing to fix. We could just take care of it and make it right, and it would just make it better. It made us realize that people who are dealing with finance stuff - we have to be careful of the language we choose because the words have different meaning there. So, when you're doing usability work you don't need to see a statistically representative sample to know if you're going to run into the problem. What you need is to know whether the problem is worth fixing or not. Will you see any advantage to it, and so you can get away with smaller numbers. The other thing that happens with the usability is that you spend an hour and a half with these people. There isn't a survey out there that you spend an hour and a half with people. If you spent an hour and a half with every survey applicant and you ask them the same questions seven different ways, like you do in a usability test where you are watching people do the same thing sort of over and over again. Then you wouldn't need as large a sample, because it would become very clear what things are problems and what things are not. Brian: So, it sounds like a quantitative versus a qualitative where you need quantities to produce results, but usability's test are more of the qualitative, where you have that in depth. Jared: Yeah, so that is really what this is. Is that when you're talking about quantitative stuff you need to have those confidence intervals. When you are doing qualitative work you just need to ask the question, "is this a problem we think people will run into?" If you are not sure then you go out and talk to more people until you are sure. Brian: So, how else are surveys and usability tests or field studies different? Jared: Well, basically what makes them really different is that A, you are looking at behavioral stuff. Where surveys are almost always attitudinal or if they are behavioral they are behavioral in a retrospective form. So I will say to you for instance, "would you call an 800 number when you have a problem with check out?" Brian: Right. Jared: Which was the question somebody asked one of our users in a usability test that we had been looking...we had been working for an e-commerce company and we had been looking at their check out process and the guy really struggled with the check out. So the vice president of customer support was sitting next to me in the study. It came time to ask questions and he basically said that, "if you problems like this at home, would you have called the 800 number that was there on the screen?" The guy thought about it for a second and he said, "Yes I would, I would definitely call." Knowing that people are not very good at these types of questions and they definitely want to promote their best face forward. Its like, "yeah, a good person would call so I would call." I asked a question which was, "Have you ever called the 800 number?" I knew this guy was a customer on the site and he had shopped there a long time. So, "have you ever called the 800 number before?" and he thought about it for a second and said, "no, I never have." So now we have this sort of conflicting data that we have to deal with and this is the problem with a lot of surveys. This is why you need these confidence intervals... Brian: The numbers to... Jared: ...because these people are sort making stuff up. You know they..some number of people who answer the questions don't actually answer the questions right, or another problem with surveys is they don't understand the question, right? They don't know if the question you are asking is in fact the question that they are answering. One of my favorite questions that falls into that category is, "how satisfied were you with the shopping experience?" If I say I was very satisfied do you actually know what that means? If I say that I was somewhat satisfied do you actually know what it would take to be different? Do I even know what it would take to make it different? Those are such meaningless questions and they don't really help anybody, but we ask them all the time. The thing is in a usability study you can watch people. Now, I have had people who have done nothing but struggle, and failed to complete their tasks and at the end still told us they were satisfied. I don't understand that. I have had a person who have just whizzed through something that was a work tool, like expense reporting right? They whizzed through it got it done, and I asked if they were satisfied with it and they said, "No, it sucked." But, they got it all done and frankly this was a tool where I didn't care whether they liked it or not because it was just a tool they needed for work. So, when you are observing people actually using stuff versus asking their opinion it is a completely different ball game. Brian: So, do we even care about statistical significance in doing this kind of research? Jared: When we are trying to find problems or trying to identify opportunities by observing behavior the answer is no. However with that being said some studies require more users than others. It is because you have to understand how you are going to run into problems. A few weeks back we talked about interview based tasks. Brian: I heard that one I think. Jared: Yes, and because you edited it you probably heard it multiple times. [laughter] Brian: I can still hear it in my head. [laughter] Jared: That's right. [laughter] So, when you are doing interview based tasks the users themselves are making up the tasks. The question is "are you going to get coverage of all the problems and areas with just a few users when they are making up their own tasks". So, an example is: we did a study on how people purchase music online. We found that in fact the more users we added to the study the more data we found. While theoretically we shouldn't have seen any new problems eight or nine. We kept seeing new problems. In fact in user 16 and 17 of this particular study, we did twenty users, in user 16 and 17 we saw problems that we never we had never ever seen before. The reason we saw them in user 16 and 17 was those were the first users we found that primarily were only in classical music. Shopping for classical music online is a very different experience than shopping for popular music online. You know you don't put in the name of an artist. It doesn't mean the same thing. Looking for everything Britany Spears has recorded is very different that Mozart...so it's a completely different experience shopping, and this system we were testing was not very good at classical music. It has been designed primarily for popular music and when you started to look for classical...Even though they had a huge catalog of classical music. When you started to look for anything it feel apart very fast, the thing with the fist 15 users, no interest in classical music. They were shopping for things..we were giving them money to buy things they wanted to buy. It didn't occur to us upfront, because we didn't know to respect the problem, to recruit people especially interested in classical music. So therefore we just left it to chance and we were pulling marbles out of the bag, and all of the sudden after pulling marbles out fifteen marbles that were either red or blue we found a yellow one. Now, what does that mean? Are there more than two colors in the bag? So, at that point how do you decide, "OK, we are at fifteen," do we go to twenty. I mean this means you need experience with who your audience is. You need to know your audience. Once we discovered this problem it became very simple to solve it. From that point on we knew that we needed in any study to have a sum number of people who were always classical, right? Brian: hmm. Jared: Who were primarily classical shoppers and we knew now what we call 'balance' the study for those differences. So, when you are putting together a study there are three things you can do. You can screen people in or out. You can say, "well I only want people who have a credit card and have a history of shopping online, because I don't want to test for people who are not that because they are not our market." The people who buy from us have bought from other places. We only want people who do that. So, we are going to eliminate people. That is called screening. Brian: Right. Jared: Or, you can balance. We can balance so for every four popular music shoppers we are going to make sure that we have two classical music shoppers. Just so we get good coverage. Or, you can measure. Measure means we are just going to putt randomly out of our sample. We are just going to grab marbles, randomly out of the bag. And we are just going to happen to take note on how they come out. We are not going to do anything special to create situations. To make sure we have those people and screen them in or out. So those are three things you can do. Part of the problem is you have to know to do those things. It's an experience thing. Brian: Right. Jared: As you do more you will get better at. So the way we recommend people approach this stuff is to not think in terms of small batches of studies where you have to do this, but instead think in a long term experience. Say well, "what would happen if we brought in just two users every week", or six users every month or something like that. From now until we think this is a stupid idea. We are just going to keep bringing people in. Well what will happen is, at first you will just be pulling random marbles out of the bag and you will start to discover things like people who shop for classical music do something different. Then you will start to balance. Then you will start to screen, "well, in this study all we want are classical folks because that is all we are working on is the classical stuff." Brian OK. Christiansen: Jared Spool: ...balance, and then you'll start to screen and say, "Well in this study, all we want are Classic Coke folks, because that's all we're working on is the classical stuff." Brian OK. Christiansen: Jared Spool: And over time, you'll notice other things about these people. For example, one of the things we notice about people as we study is that people who have a history of shopping on eBay a lot actually behave very differently in a shopping environment than people who don't. They, for instance, are more likely to read the small print on things. So they all go and check out the return stuff, and they'll check out the terms and conditions. They'll read all that stuff seriously, whereas people who don't have a history of shopping on eBay, they don't read that stuff. Brian A little more trustworthy of their sources. Christiansen: Jared Spool: Yeah, well, this is correlation, not causation, so I can't say why they do this, though I would hypothesize that people on eBay, because they're buying from random sellers and not trusted brands, typically are trained to check out details and make sure they understand the full terms of the deal, and they just get into that mindset. But it could be that they already were those types of people; that if you buy on eBay you already have to be that type of person anyways. So I'm not going to say that eBay causes it. Brian Sure, another explanation. Christiansen: Jared Spool: Yeah, right. So it could be any sort of thing. But it is an interesting fact, right? So that tells us that if we're doing a study where we have products where the terms and conditions are getting us into trouble, or people are complaining about our return policies, things like that, then we want to have a certain number of these people with what we call "high eBayness"---that means they spend a lot on eBay---to come into our study, right? So we're going to balance for that. The only reason we learned this was because we were measuring it in the first place. So that's basically how this stuff works. Brian Well, that's great. I think that's a fascinating look at exactly Christiansen: how to judge these things and it seems that we get better at this every time we do it, based on prior experience. And since we suggest testing as you work, if you do a small test and it didn't get what you want, you could always adapt. Jared Spool: Absolutely, absolutely. You just said in two sentences what it took me 15 minutes to say. Brian [laughs] Christiansen: Jared Spool: This is why you're here. This is why I don't do this by myself. Brian Great. Well, I think Colin got his money's worth there. Christiansen: Jared Spool: It was exactly what he paid for it. Brian Yes, exactly. Please put your quarter directly into the C drive Christiansen: [laughs]. No, don't do that. [laughs] Don't put quarters in your computer. So, we welcome the input of our listeners. You're making these shows even better. Thanks again, Colin, for other listeners out there, two things. First of all, if you'd like to contribute something to the show, you can reach us at mailbag@uie.com. And the other thing to note is that if you're not already a subscriber, you want to check out UIEtips, where we talk about these type of things every week. We put out an article on anything that tickles our fancy. Jared Spool: That's right. This week we're talking about web apps. By the time this comes out, we'll have shipped the first part of our article--- actually not web apps, just application design in general; just considerations for application design. It has been a lot of fun putting these articles out, and I think people really enjoy them. Brian Yeah, we get great feedback. Christiansen: So, signing up is easy; uie.com; you'll see it right at the top of the page, and if it isn't for you, unsubscribing is easy, too. We won't fill up your inbox. We think you'll like it, so check it out, so that's all for today, Jared, thanks for taking the time. Jared Spool: Hey, Brian thanks for doing this, and it'll be fun doing more with you when Christine is off doing her other work. Brian All right. Well, goodbye for now. Take care. Christiansen: [music] Announcer: We hope you have enjoyed this Usability Tools podcast. If you're interested in more of UIE's research, sign up for our free email newsletter. You can subscribe easily at uie.com. If you'd like to attend our next virtual seminar for free, just fill out our short podcasting survey at uie.com/audio. We'll give away free admission to one lucky respondent each week. We love hearing from you. Send us your comments at mailbag@uie.com, that's all for this week. Thanks for listening. Goodbye. [music]