Introduction

Common Designs

Experimental
- manipulates independent variable to see the effect it has on the dependent variable
Correlational
- does not manipulate anything
- looks at the relationship between two variables

Populations & Samples

A population is a large group of individuals, which a law of nature applies
A sample is a group of a given population intended to represent that population
Participants are those measured in a sample. Participants always refers to individuals (e.g., students, children, prisoners)
A sample is supposed to generalize to a given population
Participants > Subjects

Statistics vs Parameters

Statistics use English letters to get values for a sample
Parameter use Greek letters for values of a population
Statistic = Sample
Parameter = Population

Frequencies

A raw score is the score given to a participant
Frequency denoted as f
- number of times a score occurs
Below we see the amount of cars that have 4, 6, and 8 cylinders

table(mtcars$cyl)

## 
##  4  6  8 
## 11  7 14

Note. Not F, that is a
Frequency Distribution is a distribution of each score and the number of times the score has occurred
We can use frequencies to get the proportion of cars that have 4, 6, and 8 cylinders

cyl_prop <- mtcars %>% 
  group_by(cyl) %>% 
  summarize(n = n(),
            prop = n/nrow(mtcars)) %>% 
  ungroup()

We can see that there are mostly 8 cylinder cars (0.4375)

Visualizing Frequencies

One reliable way of visualizing frequencies is by using a bar graph
A Bar graph is a visualization with vertical bar over each nominal/ordinal category

mtcars %>% 
  ggplot(aes(as.factor(cyl))) + 
  geom_bar(fill = 'dodgerblue')

Note A pie graph will always be inferior to a bar graph/any other visual

mtcars %>% 
  ggplot(aes(x = '',
             y = cyl,
             fill = as.factor(cyl))) + 
  geom_bar(stat = 'identity',
           width = 1) + 
  coord_polar('y', start = 0) +
  theme_void()

A Histogram is a frequency visualization used primarily for interval or ratio scores

mtcars %>% 
  ggplot(aes(mpg)) + 
  geom_histogram(color = 'white',
                 fill = 'dodgerblue',
                 bins = 15)

A Frequency Polygon is similar to a histogram, which shows data points connected with straight lines

mtcars %>% 
  group_by(cyl) %>% 
  summarize(n = n()) %>% 
  ungroup() %>% 
  ggplot(aes(cyl, n)) + 
  geom_line()

Normal Distribution

The normal curve is often called the bell-shaped curve
- Symmetrical
A normal distribution is the same as the normal curve
- Represents the population
The distribution tails are the ends of the distribution
- We’ll come back to these

Skewed Distributions

Negative Skew is not normal and is asymmetrical
- Indicates higher frequency of middle and higher scores
Positive Skew is also not normal and is asymmetrical
- Indicates higher frequency of low and middle scores
Some thresholds for level of acceptable skewness is $\\pm2$ or $\\pm3$
- So if your value is smaller than this, you probably have acceptable skewness
Kurtosis is when your frequency are really skinny and tall or really flat and wide
A bimodal distribution is when your distribution has two medians (we’ll cover this shortly)
- Has two distributions that should be addressed

Frequency Calculations

$Relative\\;frequency = \\frac{f}{N}$

f <- 10
N <- 200

freq <- f/N
freq

## [1] 0.05

percent <- freq*100
percent

## [1] 5

Relative Frequency Using Normal Curve

The proportion of area under the curve is the proportion of total area under the normal curve
A percentile is the percentage of all scores in the sample below a particular score
Cumulative Frequency is the number of scores in the data at or below a particular score

Central Tendency

Three measures of central tendency
- Mode
- Median
- Mean
Measures of Central Tendency are statistics that summarize the location of a distribution on a variable by indicating where the center is
A Normal distribution will have the central point right at the center
A skewed distribution will have the central point where the frequency of scores is the highest
The mode is the value with the highest frequency

calc <- c(3, 5, 4, 6, 6, 3, 5, 1, 8, 10)
calc

##  [1]  3  5  4  6  6  3  5  1  8 10

table(calc)

## calc
##  1  3  4  5  6  8 10 
##  1  2  1  2  2  1  1

The value with the highest frequency is actually both 5 & 6
- So for this example, we actually have two modes
The median is the middle value
- The 50th percentile
Preferred for ordinal/ordered data
More reliable for skewed data
Calculating the median is taking the middle value in the order from lowest to highest scores
We can see below that the middle value is between 6 and 3, but these are not sorted, so we can sort them to get the correct median

calc

##  [1]  3  5  4  6  6  3  5  1  8 10

Sorted, we can see that the mode is between 5 and 5, which is 5
- If other values are included, then you would get the average of the two middle values
- In this case it is redundant but it would be $\\frac{5 + 5}{2} = 5$

sort(calc)

##  [1]  1  3  3  4  5  5  6  6  8 10

The Mean/Average
The mean is the score located at the mathematical center of a distribution
- Often denoted as $Mean = \\overline{X}$
The formula below is to calculate the mean
- Where you would take all the values for each participant and divide by the number of participants

$\\overline{X} = \\frac{\\Sigma\\;X}{N}$

The mean is the basis for most inferential statistics
I’ll use the values we randomly created for the mode

calc

##  [1]  3  5  4  6  6  3  5  1  8 10

sum_x <- sum(calc)
mean_n <- length(calc)

average <- sum_x/mean_n
average

## [1] 5.1

We could also simply use the R base functions, like mean and/or median

median(calc)

## [1] 5

mean(calc)

## [1] 5.1

Deviation

The distance a participant’s score/value is from the mean
Deviations can be positive or negative
- Participants can score lower (negative) than the mean and higher (positive) than the mean
To get the deviation, you subtract the mean from each participant’s score
X is the value that corresponds to each participant (their raw score) and $\\overline{X}$ is the average for the sample

$X - \\overline{X}$

The larger the value the farther away from the mean the score/value is
Sum of the deviations around the mean is the sum of all differences between the scores and the mean

In the Near Future

Deviations is the start for upcoming lectures and statistical tests, especially the sum of the deviations
$\\Sigma(X - \\overline{X})$
If the sum of the deviations is 0 then that means your math is good
Deviation of each score/value from the mean is often referred to as error/residual in statistical tests

set.seed(06062022)
devs <- tibble(x = rnorm(60, 5, n = 100)) %>% 
  rowid_to_column()

devs$mean <- mean(devs$x)
devs$deviations <- devs$x - devs$mean

head(devs)

## # A tibble: 6 x 4
##   rowid     x  mean deviations
##   <int> <dbl> <dbl>      <dbl>
## 1     1  73.2  60.0     13.2  
## 2     2  56.2  60.0     -3.78 
## 3     3  57.7  60.0     -2.27 
## 4     4  59.8  60.0     -0.172
## 5     5  60.9  60.0      0.909
## 6     6  61.8  60.0      1.76

We can also visualize the deviations away from the average
- We can see that some scores are farther from the average score, while some are extremely close to the mean

devs %>% 
  ggplot(aes(rowid, x)) + 
  geom_point(aes(color = deviations)) +
  geom_hline(yintercept = mean(devs$x),
             size = 1.25, 
             color = 'red')

We can also get the sum of the deviations, by adding everything together
- If our math is correct, it should be zero (or an extremely small value since R is a computer) to show that we calculated our deviations correctly

sum(devs$deviations)

## [1] -2.629008e-13

Last updated on Jun 6, 2022