Salary Project
This assignment was done in collaboration with Nereid Kwok, Nisa Ozer, Thomas Giannetti-Fakhouri, and Kazmer Nagy-Betegh as a part of a homework assignment in our Applied Statistics class at London Business School taught by Kostis Christodoulou.
Omega Group plc- Pay Discrimination
At the last board meeting of Omega Group Plc., the headquarters of a large multinational company, the issue was raised that women were being discriminated in the company, in the sense that the salaries were not the same for male and female executives. A quick analysis of a sample of 50 employees (of which 24 men and 26 women) revealed that the average salary for men was about 8,700 higher than for women. This seemed like a considerable difference, so it was decided that a further analysis of the company salaries was warranted.
You are asked to carry out the analysis. The objective is to find out whether there is indeed a significant difference between the salaries of men and women, and whether the difference is due to discrimination or whether it is based on another, possibly valid, determining factor.
Loading the data
omega <- read_csv(here::here("data", "omega.csv"))
glimpse(omega) # examine the data frame
## Rows: 50
## Columns: 3
## $ salary <dbl> 81894, 69517, 68589, 74881, 65598, 76840, 78800, 70033, 635…
## $ gender <chr> "male", "male", "male", "male", "male", "male", "male", "ma…
## $ experience <dbl> 16, 25, 15, 33, 16, 19, 32, 34, 1, 44, 7, 14, 33, 19, 24, 3…
skim(omega)
| Name | omega |
| Number of rows | 50 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| salary | 0 | 1 | 68717 | 8638.2 | 47033 | 63303.16 | 68847 | 74777.7 | 84576 | ▁▃▇▆▅ |
| experience | 0 | 1 | 14 | 11.9 | 0 | 2.25 | 15 | 20.8 | 44 | ▇▃▅▂▁ |
Relationship Salary - Gender ?
The data frame omega contains the salaries for the sample of 50 executives in the company. Can you conclude that there is a significant difference between the salaries of the male and female executives?
Note that you can perform different types of analyses, and check whether they all lead to the same conclusion
. Confidence intervals . Hypothesis testing . Correlation analysis . Regression
Calculate summary statistics on salary by gender. Also, create and print a dataframe where, for each gender, you show the mean, SD, sample size, the t-critical, the SE, the margin of error, and the low/high endpoints of a 95% condifence interval
# Summary Statistics of salary by gender
mosaic::favstats (salary ~ gender, data=omega)
## gender min Q1 median Q3 max mean sd n missing
## 1 female 47033 60338 64618 70033 78800 64543 7567 26 0
## 2 male 54768 68331 74675 78568 84576 73239 7463 24 0
# Dataframe with two rows (male-female) and having as columns gender, mean, SD, sample size, the t-critical value, the standard error, the margin of error, and the low/high endpoints of a 95% condifence interval
sumstat <- omega %>%
group_by(gender) %>%
summarise(mean = mean(salary),
n = count(gender),
sd = sd(salary),
t_critical = qt(0.975, n - 1),
se = sd / sqrt(n),
margin_of_error = t_critical * se,
low_CI= mean - margin_of_error,
high_CI= mean + margin_of_error,
)
What can you conclude from your analysis? A couple of sentences would be enough
The confidence intervals for the mean salaries by gender do not overlap meaning we have statistically sufficient evidence to conclude that there is a difference between in mean salary by gender. We would assume that in running a hypothesis test, our p-value will be less than 0.05 but we will run the hypothesis test to double-check.
You can also run a hypothesis testing, assuming as a null hypothesis that the mean difference in salaries is zero, or that, on average, men and women make the same amount of money. You should tun your hypothesis testing using t.test() and with the simulation method from the infer package.
# hypothesis testing using t.test()
t.test(salary ~ gender, data = omega)
##
## Welch Two Sample t-test
##
## data: salary by gender
## t = -4, df = 48, p-value = 2e-04
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
## -12973 -4420
## sample estimates:
## mean in group female mean in group male
## 64543 73239
# hypothesis testing using infer package
set.seed(1234)
null_dist_salary <- omega %>%
# specify variables
specify(salary ~ gender) %>%
# assume independence, i.e, there is no difference
hypothesize(null = "independence") %>%
# generate 1000 reps, of type "permute"
generate(reps = 1000, type = "permute") %>%
# calculate statistic of difference, namely "diff in means"
calculate(stat = "diff in means", order = c("female", "male"))
obs_diff3 <- omega %>%
specify(salary ~ gender) %>%
calculate(stat = "diff in means", order = c("female", "male"))
obs_diff3
## Response: salary (numeric)
## Explanatory: gender (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 -8696.
#visualise the null distribution
null_dist_salary %>% visualize() +
shade_p_value(obs_stat = obs_diff3, direction = "two-sided")

null_dist_salary %>%
get_p_value(obs_stat = obs_diff3, direction = "two_sided")
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
What can you conclude from your analysis? A couple of sentences would be enough
Though the infer package returns a 0 p_value, it is impossible to have a p value of 0. Infer package sometimes returns p_values of 0 when the p-value is very small. When running the t-test, we got a p-value of 2e-04 which is very low and it confirms our hypothesis that inferpackage rounded down the p-value to 0. In conclusion the null hypothesis is rejected and that there is a difference between the salaries of men and women.
Relationship Experience - Gender?
At the board meeting, someone raised the issue that there was indeed a substantial difference between male and female salaries, but that this was attributable to other reasons such as differences in experience. A questionnaire send out to the 50 executives in the sample reveals that the average experience of the men is approximately 21 years, whereas the women only have about 7 years experience on average (see table below).
# Summary Statistics of salary by gender
favstats (experience ~ gender, data=omega)
## gender min Q1 median Q3 max mean sd n missing
## 1 female 0 0.25 3.0 14.0 29 7.38 8.51 26 0
## 2 male 1 15.75 19.5 31.2 44 21.12 10.92 24 0
Based on this evidence, can you conclude that there is a significant difference between the experience of the male and female executives? Perform similar analyses as in the previous section. Does your conclusion validate or endanger your conclusion about the difference in male and female salaries?
H0 = There is no difference between the mean experience of males and females. mean(f)-mean(m) = 0 HA = There is a difference between the mean experience of males and females. mean(f)-mean(m)!= 0
#manual CI calculation:
sumstatExp <- omega %>%
group_by(gender) %>%
summarise(mean = mean(experience),
n = count(gender),
sd = sd(experience),
t_critical = qt(0.975, n - 1),
se = sd / sqrt(n),
margin_of_error = t_critical * se,
low_CI= mean - margin_of_error,
high_CI= mean + margin_of_error,
)
#t-test:
t.test(experience ~ gender, data = omega)
##
## Welch Two Sample t-test
##
## data: experience by gender
## t = -5, df = 43, p-value = 1e-05
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
## -19.35 -8.13
## sample estimates:
## mean in group female mean in group male
## 7.38 21.12
Confidence intervals do not overlap and t-test gives a p value of 1e-05, so that we can reject the null hypothesis and there is a statistically sufficient evidence to conclude that there is a difference between the mean experience of men and women at work.
Relationship Salary - Experience ?
Someone at the meeting argues that clearly, a more thorough analysis of the relationship between salary and experience is required before any conclusion can be drawn about whether there is any gender-based salary discrimination in the company.
Analyse the relationship between salary and experience. Draw a scatterplot to visually inspect the data
ggplot(omega, aes(x=experience, y=salary )) +
geom_point()+
geom_smooth()+
labs(title="Scatterplot of Experience vs Salary")+
theme_bw()
There’s a positive relationship between experience and salary up until 30 years, and then the trend line seems to flatten.
Check correlations between the data
You can use GGally:ggpairs() to create a scatterplot and correlation matrix. Essentially, we change the order our variables will appear in and have the dependent variable (Y), salary, as last in our list. We then pipe the dataframe to ggpairs() with aes arguments to colour by gender and make ths plots somewhat transparent (alpha = 0.3).
omega %>%
select(gender, experience, salary) %>% #order variables they will appear in ggpairs()
ggpairs(aes(colour=gender, alpha = 0.3))+
theme_bw()

Look at the salary vs experience scatterplot. What can you infer from this plot? Explain in a couple of sentences
Women have a lot less experience and less salary and females have a stronger correlation between their experience and salary than males. Whereas males have more experience but there is less correlation between salary and experience. It is interesting that women do not work in the company as long. We could look into why women have less experience at the company. Is it because they are having kids? or is it because they are not being promoted to management roles because they are women?