Monday, May 18, 2020

Understanding Quantiles Definitions and Uses

Summary statistics such as the median, first quartile and third quartile are measurements of position. This is because these numbers indicate where a specified proportion of the distribution of data lies. For instance, the median is the middle position of the data under investigation. Half of the data have values less than the median. Similarly, 25% of the data have values less than the first quartile and 75% of the data have values less than the third quartile. This concept can be generalized. One way to do this is to consider percentiles. The 90th percentile indicates the point where 90% percent of the data have values less than this number. More generally, the pth percentile is the number n for which p% of the data is less than n. Continuous Random Variables Although the order statistics of median, first quartile, and third quartile are typically introduced in a setting with a discrete set of data, these statistics can also be defined for a continuous random variable. Since we are working with a continuous distribution we use the integral. The pth percentile is a number n such that: ∠«-â‚ ¶n f ( x ) dx p/100. Here f ( x ) is a probability density function. Thus we can obtain any percentile that we want for a continuous distribution. Quantiles A further generalization is to note that our order statistics are splitting the distribution that we are working with. The median splits the data set in half, and the median, or 50th percentile of a continuous distribution splits the distribution in half in terms of area. The first quartile, median and third quartile partition our data into four pieces with the same count in each. We can use the above integral to obtain the 25th, 50th and 75th percentiles, and split a continuous distribution into four portions of equal area. We can generalize this procedure. The question that we can start with is given a natural number n, how can we split the distribution of a variable into n equally sized pieces? This speaks directly to the idea of quantiles. The n quantiles for a data set are found approximately by ranking the data in order and then splitting this ranking through n - 1 equally spaced points on the interval. If we have a probability density function for a continuous random variable, we use the above integral to find the quantiles. For n quantiles, we want: The first to have 1/n of the area of the distribution to the left of it.The second to have 2/n of the area of the distribution to the left of it.The rth to have r/n of the area of the distribution to the left of it.The last to have (n - 1)/n of the area of the distribution to the left of it. We see that for any natural number n, the n quantiles correspond to the 100r/nth percentiles, where r can be any natural number from 1 to n - 1. Common Quantiles Certain types of quantiles are used commonly enough to have specific names. Below is a list of these: The 2 quantile is called the medianThe 3 quantiles are called tercilesThe 4 quantiles are called quartilesThe 5 quantiles are called quintilesThe 6 quantiles are called sextilesThe 7 quantiles are called septilesThe 8 quantiles are called octilesThe 10 quantiles are called decilesThe 12 quantiles are called duodecilesThe 20 quantiles are called vigintilesThe 100 quantiles are called percentilesThe 1000 quantiles are called permilles Of course, other quantiles exist beyond the ones in the list above. Many times the specific quantile used matches the size of the sample from a continuous distribution. Use of Quantiles Besides specifying the position of a set of data, quantiles are helpful in other ways. Suppose we have a simple random sample from a population, and the distribution of the population is unknown. To help determine if a model, such as a normal distribution or Weibull distribution is a good fit for the population we sampled from, we can look at the quantiles of our data and the model. By matching the quantiles from our sample data to the quantiles from a particular probability distribution, the result is a collection of paired data. We plot these data in a scatterplot, known as a quantile-quantile plot or q-q plot. If the resulting scatterplot is roughly linear, then the model is a good fit for our data.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.