Summary: Naked Statistics

41k7v4gls7l._sx331_bo1204203200_

My One-Sentence Summary

Statistics is a major component of the scientific method, and its goal is to help us make better decisions on how to live our lives.

Capture

  • The point of statistics is to make better decisions about how to live our lives

  • When you say who’s the “best” at something, that’s subjective and could mean many things

  • An index is a single number that represents multiple descriptive metrics

  • Statistics is the foundation of the scientific method because it’s how we know whether to accept or discard a hypothesis based on experiments

  • It’s very easy to deceive with statistics because there are so many ways to say true things that aren’t what the question wanted answered

  • A p-value is the probability of obtaining the result (or something at least as extreme) if the null hypothesis is correct, i.e., if the thing isn’t true

  • A correlation coefficient is the numerical measure of how related two variables are, and it goes from -1 to 1. 0 means there’s no correlation. 1 means when X goes up by Y amount, Z also goes up by that amount. And -1 means that when X goes up by Y amount, Z goes down by that amount.

  • A confidence interval for the mean is a way of estimating the true population mean. Instead of a single number for the mean, a confidence interval gives you a lower estimate and an upper estimate. For example, instead of “6” as the mean you might get {5,7}, where 5 is the lower estimate and 7 is the upper.

  • The null hypothesis is the default, uninteresting state of “nothing to see here”, and is analogous to innocence until proven guilty

  • Mean is all observations added up and divided by the number of observations

  • Median is the value that has an equal number of observations above and below it

  • Correlation is where as one variable moves, so does another one

  • A normal distribution is where values are distributed symmetrically to the left and right of the mean, with the largest number clustered around the mean itself

  • The standard deviation is the distribution of values in the population

  • The standard error is the standard deviation of the sample means

  • 1 SD is around 68%, and 2SD is around 95%, and 3SD is 99.7%

  • The central limit theorem says that as you collect random samples from a population they will fall into a bell curve (normal distribution) around the actual median of the population

  • You need at least 30 observations for the central limit theorem to work

  • When you do regression analysis you try to get to a coefficient, and that coefficient should be a line with a slope

  • If you have bad data, bad samples, etc., there’s little statistics can do to help you

  • When you do regression analysis you are trying to say that the null hypothesis is not likely. You’re aren’t proving it’s wrong, you’re using central limit theorem and other fundamentals to show that it’s unlikely to a certain percent. But you can still be wrong

  • Something called the sum of squares helps you understand how far off you are. Because something can be above or below the line (positive or negative) you need to square the values before you add them

  • Regression is the process of isolating variables to find relationships between them

  • Multivariate regression analysis is how we can tell that something is correlated vs. caused by another thing

  • A visual way to think about regression is to plot home prices on a graph with square feet on one axis and sale price on the other. If you plot a bunch of known house sale prices based on their square footage, and then draw a line that approximates going through them, you can then predict the price or the square footage of a house where you only know one of the values. You simply look for the point on the line that corresponds to the data point you do have, and you will get a very good guess about where the other variable is. Regression, and ML as well, is simply doing this for hundreds, or thousands, or millions of variables—and looking for the relationships that all you to make predictions.

Takeaways, Ideas, and Analysis

  • Central Limit Theorem seems to be one of the most important things in statistics

  • It’s key to the scientific method because it tells you how much you should believe something based on the evidence you have

  • There are SO MANY traps to doing statistics correctly, such as assuming that spending more money in schools gets better results when it might just be a bunch of rich and smart students and/or parents

  • One of them is reverse causality, such as more golf lessons causing bad golf vs. bad golf causing golf lessons

  • Machine Learning (and specifically deep learning) is basically doing regression but on the scale of many, many variables

Summary

  1. Statistics is the underlying math that helps us determine truth in real-life applications

  2. Correlation goes from -1 to 1, where -1 is inversely correlated, and +1 is completely correlated

  3. A p-value is the chance of getting a result if there is nothing interesting going on

  4. The null hypothesis is the default state of “nothing to see here”

  5. The central limit theorem allows you to figure out what a population looks like by taking samples

  6. Standard deviations and standard errors allow you to look at samples and tell you how likely they are to come from different populations

  7. Multivariate regression analysis allows you isolate the effect of one variable in a large collection of variables, e.g., the effect of education while controlling for parental wealth and intelligence

  8. Ultimately, statistics isn’t boring at all. It’s the real-world math that lets us understand our world

Related posts: