Describing Data Sets

There are 4 facets statisticians use when describing a frequency distribution or data set: the skew, measures of central tendency, spread, and kurtosis.

Skewness or Shape
Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or undefined. For a unimodal distribution, negative skew commonly indicates that the tail is on the left side of the distribution, and positive skew indicates that the tail is on the right.

Centrality or Measures of Central Tendency
Central Tendency is a central or typical value for a probability distribution. The most common measures of central tendency are the arithmetic mean, the median and the mode.

Mean
Also known as the arithmatic average. The sum of the data values divided by the total number of data values.

Weighted Mean
Used when values are not all equally represented.


 * For example, pretend you have a car lot and you carry 3 different models of cars A, B, and C. Each model is worth a different amount of money: A is worth $12,000, B is worth $14,000, and C is worth $10,000. You have 8 of A, 10 of B, and 12 of C. What is the average value of the cars you have on your lot?Weighted_mean_example.png

Median
The "halfway" point in a data set when arranged in ascending order.

Mode
The value that occurs the most often. If there are 2 that are tied for most often, then it is considered bimodal. If there are 3 or more values that are tied for most often, then it has no mode.

This can be applied to categorical data. For example, when examining the number of students in each major, the mode is the major with the most students.

Midrange
A rough estimate of the middle. It is calculated by dividing the mus of the lowest and highest value by 2.

Measures of Variation or Spread
There are 3 common measures of variation: The range, variance, and standard deviation.

Variance
The average of the squares of the distance each data value is from the mean.

Standard Deviation
The square root of the variance.

Quartiles
Divides the distribution into four parts so that the same amount of scores are in each part. It is determined by fnding the median, then for each half of the data set, find their respective medians. The median of the lower half is called Q1 and the median of the upper half is called Q3.

5 Number Summary
To describe quartiles, you may report a 5 Number Summary. This includes the lowest data value, Q1, the median, Q3, and the highest data value.

Box Plot
Using the 5 Number Summary, you can create a box plot to provide a visual of the data.