A Short List Of Common Statistical Pitfalls To Avoid

Adapted from my critically acclaimed answer on Quora to the question of what are common statistical pitfalls to avoid.

  1. Forgetting the golden rule: “All models are wrong, some are useful.”
  2. The other golden rule: “Correlation does NOT equal causation.”
  3. Blindly trusting the p-value in hypothesis testing.
  4. NOT EVERY GOD DAMNED THING IS NORMALLY DISTRIBUTED!!!!!!!!!!!!
  5. Time series data is not the same as time-independent data.
  6. Statistics is a tool, not an avenue to truth. Like any tool, it can be put to good and bad uses. The more you learn the easier it becomes to find subtle ways to get results you want to believe. Never forget that the author of any study you read is human and they are not an unbiased machine. You’re not unbiased either, if you think you are then you are a fool. Always consider what can bias a study and give some thought to what you might be doing subconsciously that could bias the results.
  7. If you see any statistic in any mainstream news source (Fox, MSNBC, The New York Times, Bloomberg, CNBC,…) assume that it is at best a horrific misunderstanding by the journalist and most likely that it’s so incredibly misleading as to be an outright lie. You need to actually go to the source of the statistic, second hand sources suck, to put it mildly.
  8. Some things cannot be quantified. End. Of. Story. Psychology is one of the worst fields when it comes to this. Take IQ, for example. IQ is measuring something, but if you really think it measures intelligence you’re an idiot. At best, it is an extremely poor proxy for some specific types of intelligence. Yes, it does measure something and give you a number, but last I checked MENSA isn’t exactly running the world*. Formally, you need to understand that just because you can find a measurable proxy for something does not mean it is a very good representation and it is misleading to not be extremely clear about what you are measuring.
  9. “The plural of anecdote is not data.” This is another one of those all too common logical fallacies people are guilty of (if you’re very clever you might spot me breaking this rule somewhere in point #8). I’ve seen this a lot, for example, when talking about the Federal Reserve and inflation the price of gas often is brought up. Well yes, maybe you paid a lot for gas and the price just keeps going up, but you and the people you know are not representative of the entire damn economy. Also, just because your experience in the stock market sucked doesn’t say anything about how it actually works. This often is a problem when coming up with a hypothesis and can lead you to never finding anything that’s actually meaningful.
  10. Understand the assumptions you make in your analysis. You always have to make some assumptions and be aware that just because it doesn’t seem like such a big assumption doesn’t mean it doesn’t matter. Those are usually the things that come back to haunt you.
  11. You should really try to avoid using loops when you write code in R. I’ll let you figure this one out on your own.

* Fine, this bit of handwaving breaks my own rules; however, the details, while interesting to discuss, don’t exactly assist in answering the question. Let’s just assume I’m correct…

Quick Update – Two Margins

So I’ve been really busy the last few weeks, so the updates have been a bit infrequent. Nothing I can do about that. Anyway, here’s are a few things worthy of your attention.

Two Margins – It’s earnings season (do try to contain your excitement) and I’ve been meaning to mention that I’m still contributing to Two Margins, the startup that allows crowdsourcing of financial document annotation and analysis. I’ve contributed to Amazon’s quarterly report and Intel’s quarterly report so far for Q3. If you’re interested, Twitter’s 8-K is currently online. I understand that LNKD (LinkedIn), GRPN (Groupon), GPRO (GoPro), and HPQ (Hewlett-Packard) will become available on the following dates:

  • October 30: LNKD, and GRPN
  • November 3: GPRO
  • November 25: HPQ

Speaking of GoPro…

This is the textbook example of insanity. Everyone who's not me is crazy, or so it seems.
If you can’t enlarge the image. The range on the price runs from $28 to $96. Trading started on 6/26/2014.

That is the insanity of a hot IPO. It’s now 141.88% above it’s IPO open price of $28.65 and 29.62% below the all time high from just a few weeks ago. I personally think it’s best (as an individual) to stay the hell away from IPOs. Before you let the huge IPO returns get to your head, consider the following questions:

  1. How did I hear about this IPO? Would I have found out about the IPO if not for a major mainstream media source pushing the news?
  2. Does this fit in with my overall financial goals?
  3. Can I really afford to lose all/most of the money I put in if it flops?
  4. Am I investing in the business or trying to find the greater fool?

Just a few things to think about.

The Price of Ignorance Is High

In my junior year of college I worked part-time in the school convenience store located near Fenway. Said store did not sell cigarettes, but it did sell Massachusetts State Lottery tickets. I always found this to be an irony. It was clear that the decision to not sell cigarettes was a made because it was considered wrong and not because it wouldn’t be profitable, but the university never thought it was wrong to sell lottery tickets. On top of that, of that chain of school stores, the one I worked in was one of the few locations that sold lottery tickets. If it’s not already obvious, students were not the ones buying lottery tickets.

Continue reading

Reading 10/17/2014 – End Of The World Again

Yes, I’m playing around with the blog’s title. I never get around to reading about physics like I had planned, so the title wasn’t making much sense. I may change it a bit more, depends on how I feel after a day or two…I know, you’re on the edge of your seat. Try to keep calm during these difficult times.

Anyway, here are the articles/blog posts worth reading from the last few days. If you’re not entirely satisfied with the selection I will refund the full cost of your purchase.

Finance

  • Russia is learning that economic warfare can, in fact, hurt. Attempts to support the Ruble are not working so far. (Bloomberg)
  • The mighty have fallen, or rather, stumbled a little. Google missed Q3 profit and revenue estimates. You may begin guess what this means about Google’s future now. Please make sure to ignore that it’s only a single isolated data point, more people listen to you that way. (Bloomberg)
  • Because no one could have foreseen that the iPad’s problems would be cheaper competitors and limited usefulness. Serious work is still going to be done on a laptop and my phone can do anything I’d want with a mobile device. Tablets are not universally needed, so….make it thinner and faster right? (The New York Times – Apple’s iPad Problem)
  • Josh Brown was nice enough to let everyone know the Dow is negative for the year. Please ignore his sensible conclusion that this is not so unusual. The correct response is to run in circles screaming about the end of the world and how Wall Street’s plan is about to reach its final objective and we’re all going to die. QED. (The Reformed Broker)
  • I don’t have anything to add to this piece by Professor Aswath Damodaran on GoPro’s valuation. I still think it’s insanely overvalued, but I know enough to know it only feels that way to me and “feels” doesn’t mean a God damned thing in this context. In any case, the analysis is high quality and worth reading even if you’re not interested in GoPro because God forbid you learn about something new. (Musings on Markets)

Ebola

  • Yes, only one thing. Yes, it’s the most important thing to know about Ebola. In a stunning show of either irony or a flash of actual rational thoughts, The Verge has managed to impress me. They take time to point out that the media sucks at explaining everything Ebola. Never forget that the news media exists to make money and there’s no reason to assume that they actually care about the accuracy of their reporting or the consequences of their mistakes. Of course, we’ll continue to ignore that because our favorite source says bad things about the other political party. I may be getting off topic here….(The Verge)

That’s all for this morning’s reading. If I have time I’ll post some afternoon reading. Don’t hold your breath.

No, Nate Silver’s Model Doesn’t Have A Metaphysics Problem

Well, at least not the sort of metaphysical problems that Vox author Matthew Yglesias suggests are a problem.

“Except there’s a huge problem — we’re never going to know which model is correct.

…To anyone who understands probabilities, of course, this is nonsense….If you sit down at the blackjack table and play for a while, you will probably lose money. But you might not. Even the Washington Post’s current forecast that the GOP has a 95 percent chance of obtaining a Senate majority won’t genuinely be debunked by a Democratic hold. Five percent is unlikely, but unlikely things happen….

But in an epistemological sense, the way we check probabilistic statements is to run the experiment over and over again. Flipping a coin twice doesn’t really prove anything. But if you flip it ten or twenty or a thousand times you’ll see that “it comes up heads half the time” is a good forecasting principle…

…we’re just never going to get the kind of sample sizes that would let us tell whose method of calculation is best.

Cutting out all the background, that’s the heart of Yglesias’ argument (emphasis added). I’ll start by addressing the “problem” that we’ll never know which model is correct, we do have an answer to that.

“All models are wrong; some are useful.” -George E. P. Box

The correct question to ask is not which model is correct, but which model is more useful. Whether a given model is useful is highly subjective, to say the least. Even when we know that a model is deeply flawed it may still be considered to be useful. Take the Black-Scholes option pricing model, for example. We know that the Black-Scholes model has significant problems all the way down to the underlying assumptions not matching reality, but it’s still widely used for pricing options1. Why? Because it’s good enough for most investors and the results are known to be close enough to reality that it can be said to provide a useful result, even though it is known in advance that the result is wrong.

Now Yglesias is correct that observing an unlikely outcome does not, in itself, prove that a model is worse than another model that happened to predict the correct result this time. Yes, unlikely things do happen in the real world2, but why are you assuming that the assumptions that went into constructing the model are realistic3?

The assumptions underlying any model are simultaneously the strength and weakness of a model. We use models because we accept that the real world is too complicated to allow us to accommodate every single aspect of the system being modeled. The election forecasts use a poll of a small subset of the voting population to attempt to make predictions about the election in the future. There are two sources of possible error, first, the election happens at some point in the future and events can, and do, occur that can cause a significant number of people to change who they decide to vote for4.

The other possible source of error is that you are only polling a subset of voters and you don’t know whether or not they are representative of the entire population. If you had unlimited resources, you could in theory poll every single voter and likely achieve much greater accuracy (barring unforeseen events between your poll and the election). Needless to say, that’s not practical because that would amount to holding a poll that was effectively an election. Expensive and pointless.

I don’t follow election forecasts so I can’t say what exactly they do to attempt to improve the accuracy of the models. I can say that it is likely easy to find problems with the underlying assumptions of any poll that is 95% sure of the outcome. So the model can be debunked without needing to worry about the epistemological nature of probability. Now, given that such biased polls are put forward by the likes of The Washington Post, I’d still say an argument could be made that the model is useful, even if it’s stupid. After all, it’s making them money, isn’t it?

On a psychological level, most people are interpreting the forecast probability incorrectly. It doesn’t say that candidate X has a 60% chance of winning the election. It should be read as saying: candidate X would have a 60% chance of winning in the hypothetical universe of the model based on our observations of the real world and subject to the assumptions of the model. It’s telling IF the underlying assumptions hold that a particular outcome as the given chance of occurring.

So what does this mean as far as how you should view poll-based election forecasts? Honestly, I’d say you should always avoid using any model where you don’t understand the underlying assumptions and the model’s construction. You also need to know where the data used to fit the model parameters came from because that’s another possible source of bias. If you don’t know that much about the model you have no way to interpret what it is telling you, except to trust what others are saying that the model says. Your level of trust should be 0 when dealing with…really anyone who has either a financial interest in the model, or an ideological commitment to a particular result.

Really, if they don’t have a very long answer to the question, what’s wrong with this model, then you shouldn’t trust them.

Related Reading


1. Yes, I know the binomial model is more commonly used than Black-Scholes. The underlying assumptions are effectively the same between the two models and, for European style options at least, the binomial model will converge to the Black-Scholes model as the number of steps grows.
2. This glances over the question of how unlikely something has to be for it to be considered effectively impossible. Like most subjective things, going with your gut is not a good way to answer this question. A royal flush in poker is indeed unlikely, but it's not so unlikely as to have never happened in history. Contrast with a perfect bridge deal (assuming a fair deck) which has a probability of about 4.47*10^{-28}. As my stats professor put it years ago, "if everyone who every existed played bridge continuously, the probability of ever seeing a perfect deal is still much less than one millionth of a percent. I'll leave it as an exercise for the reader to get a more specific result.
3. Yes, realistic is a rather soft term, but it's accurate. What's considered realistic has, to my surprise, turned out to be extremely subjective. Of course, from my point of view, I'd say that many people's idea of realistic has nothing to do with the real world.
4. There's a difference between uncertainty that can be measured as probability and actual uncertainty. That is, the risk that things we cannot anticipate will occur. You can never entirely eliminate uncertainty, but we try anyway. A great example of this was a psychology experiment I read about a long time ago. There are two urns, A and B filled with colored balls (red and blue). You are first instructed to pick a color, it doesn't matter if you pick red or blue. You next need to pick an urn, you win a prize (say, money) if the ball you pull out is the color you pick. You're told that urn A has 50 red and 50 blue balls in it. You are told nothing about urn B other than it has 100 balls in it. What urn do you pick.

A majority of people selected urn A, even though there's no advantage to doing so. Mathematically speaking you cannot make an optimal choice because you have no information about the distribution of balls in urn B. We pick urn A because we at least know the odds, even though it doesn't help us to know the odds. We hate uncertainty, something to keep in mind when thinking about probability and more generally when thinking about forecasting.

 

I’ve Been Asked: “What knowledge do I need to start investing in the stock market?”

When I looked at my email this morning, I noticed that, for the first time in a while I had a request from Quora. The question I was asked to answer is, of course, the question in the title of this post. I can’t say I was ready to answer first thing on Saturday morning and, for technical reasons1 I’m not going to go into, I’ve decided to post my response here rather than on Quora.

Continue reading

Civil Forfeiture: Video

There are a lot of things to be said about civil forfeiture, but the video says it better than I can. This is a great example of why there need to be rules between people and easy money. It doesn’t matter whether you’re a police officer or a Wall Street banker or mid-level manager in a major company or whatever else you can think of. There is nothing that makes one of those groups special in this respect.