A population is a large group of individuals, which a law of nature applies
A sample is a group of a given population intended to represent that population
Participants are those measured in a sample. Participants always refers to individuals (e.g., students, children, prisoners)
A sample is supposed to generalize to a given population
Participants > Subjects
Statistics use English letters to get values for a sample
Parameter use Greek letters for values of a population
Statistic = Sample
Parameter = Population
A raw score is the score given to a participant
Frequency denoted as f
Below we see the amount of cars that have 4, 6, and 8 cylinders
table(mtcars$cyl)
##
## 4 6 8
## 11 7 14
Note. Not F, that is a
Frequency Distribution is a distribution of each score and the number of times the score has occurred
We can use frequencies to get the proportion of cars that have 4, 6, and 8 cylinders
cyl_prop <- mtcars %>%
group_by(cyl) %>%
summarize(n = n(),
prop = n/nrow(mtcars)) %>%
ungroup()
One reliable way of visualizing frequencies is by using a bar graph
A Bar graph is a visualization with vertical bar over each nominal/ordinal category
mtcars %>%
ggplot(aes(as.factor(cyl))) +
geom_bar(fill = 'dodgerblue')
mtcars %>%
ggplot(aes(x = '',
y = cyl,
fill = as.factor(cyl))) +
geom_bar(stat = 'identity',
width = 1) +
coord_polar('y', start = 0) +
theme_void()
mtcars %>%
ggplot(aes(mpg)) +
geom_histogram(color = 'white',
fill = 'dodgerblue',
bins = 15)
mtcars %>%
group_by(cyl) %>%
summarize(n = n()) %>%
ungroup() %>%
ggplot(aes(cyl, n)) +
geom_line()
Negative Skew is not normal and is asymmetrical
Positive Skew is also not normal and is asymmetrical
Some thresholds for level of acceptable skewness is \(\pm2\) or \(\pm3\)
Kurtosis is when your frequency are really skinny and tall or really flat and wide
A bimodal distribution is when your distribution has two medians (we’ll cover this shortly)
\[ Relative\;frequency = \frac{f}{N} \]
f <- 10
N <- 200
freq <- f/N
freq
## [1] 0.05
percent <- freq*100
percent
## [1] 5
The proportion of area under the curve is the proportion of total area under the normal curve
A percentile is the percentage of all scores in the sample below a particular score
Cumulative Frequency is the number of scores in the data at or below a particular score
Three measures of central tendency
Measures of Central Tendency are statistics that summarize the location of a distribution on a variable by indicating where the center is
A Normal distribution will have the central point right at the center
A skewed distribution will have the central point where the frequency of scores is the highest
The mode is the value with the highest frequency
calc <- c(3, 5, 4, 6, 6, 3, 5, 1, 8, 10)
calc
## [1] 3 5 4 6 6 3 5 1 8 10
table(calc)
## calc
## 1 3 4 5 6 8 10
## 1 2 1 2 2 1 1
The value with the highest frequency is actually both 5 & 6
The median is the middle value
Preferred for ordinal/ordered data
More reliable for skewed data
Calculating the median is taking the middle value in the order from lowest to highest scores
We can see below that the middle value is between 6 and 3, but these are not sorted, so we can sort them to get the correct median
calc
## [1] 3 5 4 6 6 3 5 1 8 10
sort(calc)
## [1] 1 3 3 4 5 5 6 6 8 10
The Mean/Average
The mean is the score located at the
mathematical
center of a distribution
The formula below is to calculate the mean
\[ \overline{X} = \frac{\Sigma\;X}{N} \]
The mean is the basis for most inferential statistics
I’ll use the values we randomly created for the mode
calc
## [1] 3 5 4 6 6 3 5 1 8 10
sum_x <- sum(calc)
mean_n <- length(calc)
average <- sum_x/mean_n
average
## [1] 5.1
median(calc)
## [1] 5
mean(calc)
## [1] 5.1
The distance a participant’s score/value is from the mean
Deviations can be positive or negative
To get the deviation, you subtract the mean from each participant’s score
X is the value that corresponds to each participant (their raw score) and \(\overline{X}\) is the average for the sample
\[ X - \overline{X} \]
The larger the value the farther away from the mean the score/value is
Sum of the deviations around the mean is the sum of all differences between the scores and the mean
Deviations is the start for upcoming lectures and statistical tests, especially the sum of the deviations
\[ \Sigma(X - \overline{X}) \]
If the sum of the deviations is 0 then that means your math is good
Deviation of each score/value from the mean is often referred to as error/residual in statistical tests
set.seed(06062022)
devs <- tibble(x = rnorm(60, 5, n = 100)) %>%
rowid_to_column()
devs$mean <- mean(devs$x)
devs$deviations <- devs$x - devs$mean
head(devs)
## # A tibble: 6 x 4
## rowid x mean deviations
## <int> <dbl> <dbl> <dbl>
## 1 1 73.2 60.0 13.2
## 2 2 56.2 60.0 -3.78
## 3 3 57.7 60.0 -2.27
## 4 4 59.8 60.0 -0.172
## 5 5 60.9 60.0 0.909
## 6 6 61.8 60.0 1.76
devs %>%
ggplot(aes(rowid, x)) +
geom_point(aes(color = deviations)) +
geom_hline(yintercept = mean(devs$x),
size = 1.25,
color = 'red')
sum(devs$deviations)
## [1] -2.629008e-13