library(GenderInfer)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
GenderInfer is a package developed to investigate gender differences within a data set. This package is based on the work of Dr. A. Day et al. Chem. Sci., 2020,11, 2277-2301. This has been developed for analysing differences in publishing authorship by gender. This package could also be useful for other analyses where there might be differences between male and female percentages from a specified baseline. The gender is assigned based on the first name, using the following data set as a corpus: https://github.com/OpenGenderTracking/globalnamedata The data source take into account data from:
In this vignette the example data frame authors
contain random names (first and last name for each row), country and publication_years from 2016 to 2020. This data set allow us to check the gender difference in the case of submission of articles to a journal.
head(authors)
#> first_name last_name country_code publication_years
#> 1 Claire Driver UK 2019
#> 2 Jedidiah el-Mansoor CH 2018
#> 3 Hayden Kim FR 2016
#> 4 Diana Sanchez CH 2018
#> 5 Cody Sterett CH 2020
#> 6 Shanea al-Rahaman FR 2017
The function assign_gender
assigns a plausible gender for each row in the supplied data frame (data_df
) based on the values of the first name stored in the column specified by first_name_col
. It creates in output a data frame, similar to the input one, but with a new column containing the variable gender
, which contains values M (male), F (female) or U (Unknown).
<- assign_gender(data_df = authors, first_name_col = "first_name")
authors_df
head(authors_df)
#> first_name last_name country_code publication_years gender
#> 1 Sakeena Mcneal UK 2019 U
#> 2 A Aliyah Terrazas IT 2020 U
#> 3 Aakif al-Mussa IT 2019 U
#> 4 Aanisa Guo FR 2016 U
#> 5 Aaqil Mark FR 2019 M
#> 6 Aaron Rozinski US 2016 M
We can now explore how many female, male and unknown there are in the data frame, using the function count
from dplyr
package.
## Count how many female, male and unknown gender there are in the data
%>% count(gender)
authors_df #> gender n
#> 1 F 396
#> 2 M 428
#> 3 U 176
## per gender and country
%>% count(gender, country_code)
authors_df #> gender country_code n
#> 1 F CH 81
#> 2 F FR 84
#> 3 F IT 71
#> 4 F UK 92
#> 5 F US 68
#> 6 M CH 91
#> 7 M FR 83
#> 8 M IT 77
#> 9 M UK 79
#> 10 M US 98
#> 11 U CH 32
#> 12 U FR 28
#> 13 U IT 35
#> 14 U UK 35
#> 15 U US 46
GenderInfer
calculates the female baseline using the function baseline
, which will be used for further statistical calculation and for the graphics. The baseline female percentage is calculated by:
\[baseline = \frac{Female}{Female + Male} \]
Note that the Unknown totals are omitted when calculating any percentages (for baselines and any female percentage comparison with it) by this methodology as discussed in the paper . The analysis compares the female percentage of various sub-populations with this baseline in order to find those there the difference is significant. It is also possible to calculate the baseline for different level, such as year or country, or another variables. The level represents the variable we want to use to make the comparison.
In the following case we calculate the baseline for the year range 2016-2019 to compare with 2020 for the whole data set.
## calculates baseline for the year range 2016-2019
<- baseline(data_df = authors_df %>%
baseline_female filter(publication_years %in% seq(2016, 2019)),
gender_col = "gender")
baseline_female#> [1] 49
The package has the function calculate_binom_baseline
, which applies the binomial test where the number of female is the number of success in a Bernoulli experiment and it uses the baseline value as expected probability of success. This function finds if there is any statistical significance in the difference between female and male. Before the binomial is calculated the input data frame is reshaped in a new data form.
In first instance we calculate the count of female for the 2020. The variable we want to make the comparison in this case is publication_years
. This variable will allow a comparison with the previous year range. In the present package we call level
the variable used for comparison. The function reshape_for_binomial
creates a new input data frame containing the female and male percentage, the total for level (total_for_level
), which is the sum of female, male and unknown and the sum of female and male (total_female_male
).
## Create a data frame that containing only the data from 2020 and
## the count of the variable gender.
<- authors_df %>%
female_count_2020 filter(publication_years == 2020) %>%
count(gender)
## create a new data frame to be used for the binomial calculation.
<- reshape_for_binomials(data = female_count_2020,
df_gender gender_col = "gender",
level = 2020)
#df_gender <- test(female_count_2020, "gender", 2020)
df_gender#> level female male unknown total_for_level total_female_male female_percentage
#> 1 2020 71 90 31 192 161 44.1
#> male_percentage
#> 1 55.9
The function calculate_binomial_baseline
calculates also the lower CI, upper CI and significance. The default value of the confidence level is 0.95. Before plotting the results, the function gender_total_df
pivots the data in longer format, which means that the data frame now has more rows and less columns by creating a coloumn gender
that contains the values for female, male and unknown. The function gender_bar_chart
creates a bar chart showing the number of female, male and unknown.
## Calculate the binomial
## Create a new column with the baseline and calculate the binomial.
<- calculate_binom_baseline(data_df = df_gender,
df_gender baseline_female = baseline_female)
df_gender#> level female male unknown total_for_level total_female_male female_percentage
#> 1 2020 71 90 31 192 161 44.1
#> male_percentage lower_CI upper_CI lower_CI_count upper_CI_count
#> 1 55.9 36.65 51.82 59.01 83.43
#> adjusted_p_value significance baseline
#> 1 0.2370283 49
## Reshape first the dataframe using `gender_total_df` and afterwards create a
## bar chart of showing the number of male, female and unknown gender with `gender_bar_chart`
<- total_gender_df(data_df = df_gender, level = "level")
gender_total
bar_chart(data_df = gender_total, x_label = "Year",
y_label = "Total number")
The function stacked_bar_chart
create a stacked bar chart using the percentage. This chart shows information about the baseline and the percentage of males and females.
## reshape the dataframe using the function `percent_df`.
## Add to `stacked_bar_chart` coord_flip() from ggplot2 to invert the xy axis.
# percent_df(data_df = df_gender)
<- percent_df(data_df = df_gender)
percent_data stacked_bar_chart(percent_data, baseline_female = baseline_female,
x_label = "Year", y_label = "Percentage of authors",
baseline_label = "Female baseline 2016-2019:") +
coord_flip()
We can now see how to calculate the baseline for several levels of the same variable and how to generate the graphics. In the example below we use the function sapply
to generate the baselines value for c("UK", "US")
. This generates a numeric vector containing two values, one for “US” and the second for “UK”. As before we now reshape the data with the function reshape_for_binomials
and afterwards we apply the calcultate_binom_baseline
.
## calculate binomials for us and uk.
## Reshape the dataframe and filter it country UK and US and year 2020 and count
## gender per countries.
# as.data.frame(t(with(authors_df, tapply(n, list(gender), c))))
<- reshape_for_binomials(data_df = authors_df %>%
UK_US_df filter(country_code %in% c("UK", "US"),
== 2020) %>%
publication_years count(gender, country_code),
gender_col = "gender", level = "country_code")
## To calculate the baseline for each country we can use the function `sapply`
<- sapply(UK_US_df$level, function(x) {
baseline_uk_us baseline(data_df = authors_df %>%
filter(country_code %in% x, publication_years %in% seq(2016, 2019)),
gender_col = "gender")
})
baseline_uk_us#> [1] 54.0 41.4
<- calculate_binom_baseline(data_df = UK_US_df,
UK_US_binom baseline_female = baseline_uk_us)
UK_US_binom#> level female male unknown total_for_level total_female_male female_percentage
#> 1 UK 18 16 6 40 34 52.9
#> 2 US 15 23 7 45 38 39.5
#> male_percentage lower_CI upper_CI lower_CI_count upper_CI_count
#> 1 47.1 36.73 68.55 12.49 23.31
#> 2 60.5 25.57 55.31 9.72 21.02
#> adjusted_p_value significance baseline
#> 1 1.0000000 54.0
#> 2 0.8703157 41.4
A bullet chart displays the baseline and the female and male percentage for US and UK
<- percent_df(UK_US_binom)
percent_uk_us
<- bullet_chart(data_df = percent_uk_us,
bullet_chart baseline_female = baseline_uk_us,
x_label = "Countries", y_label = "% Authors",
baseline_label = "Female baseline for 2016-2019")
bullet_chart
With the GenderInfer
package it is possible to create a bullet chart with line chart in the same graph. The bullet chart in this example shows the difference for UK for the year range 2017-2020. Each bar will show the baseline for the previous year
## calculate binomials for US and UK
<- reshape_for_binomials(data_df = authors_df %>%
UK_df filter(country_code == "UK") %>%
count(gender, publication_years),
"gender", "publication_years")
UK_df#> level female male unknown total_for_level total_female_male female_percentage
#> 1 2016 22 15 9 46 37 59.5
#> 2 2017 20 17 8 45 37 54.1
#> 3 2018 16 17 3 36 33 48.5
#> 4 2019 16 14 9 39 30 53.3
#> 5 2020 18 16 6 40 34 52.9
#> male_percentage
#> 1 40.5
#> 2 45.9
#> 3 51.5
#> 4 46.7
#> 5 47.1
## create a baseline vector containing values for each year from 2016 to 2020.
## using as country to compare France.
<- sapply(seq(2016, 2020), function(x) {
baseline_fr baseline(data_df = authors_df %>%
filter(country_code == "FR", publication_years %in% x),
gender_col = "gender")
})
baseline_fr#> [1] 65.5 48.6 53.3 43.9 43.3
<- calculate_binom_baseline(UK_df, baseline_female = baseline_fr)
UK_binom
UK_binom#> level female male unknown total_for_level total_female_male female_percentage
#> 1 2016 22 15 9 46 37 59.5
#> 2 2017 20 17 8 45 37 54.1
#> 3 2018 16 17 3 36 33 48.5
#> 4 2019 16 14 9 39 30 53.3
#> 5 2020 18 16 6 40 34 52.9
#> male_percentage lower_CI upper_CI lower_CI_count upper_CI_count
#> 1 40.5 43.46 73.68 16.08 27.26
#> 2 45.9 38.38 68.97 14.20 25.52
#> 3 51.5 32.50 64.78 10.73 21.38
#> 4 46.7 36.14 69.77 10.84 20.93
#> 5 47.1 36.73 68.55 12.49 23.31
#> adjusted_p_value significance baseline
#> 1 0.4896931 65.5
#> 2 0.5161637 48.6
#> 3 0.6045459 53.3
#> 4 0.3583854 43.9
#> 5 0.2998639 43.3
The line chart on the top of the bullet chart is the total number of gender in this case per year.
## Calculate the total number of submission per country and per year
<- percent_df(UK_binom)
percent_uk ## calculate the number of submission from UK
<- authors_df %>%
total_uk filter(country_code == "UK") %>%
count(publication_years) %>%
mutate(x_values = factor(publication_years,
levels = publication_years))
## conversion factor to create the second y-axis
<- min(total_uk$n) / 100
c bullet_line_chart(data_df = percent_uk, baseline_female = baseline_fr,
x_label = "year", y_bullet_chart_label = "Authors submission (%)",
baseline_label = "French Female baseline",
line_chart_df = total_uk,
line_chart_scaling = c, y_line_chart_label = "Total number",
line_label = "Total submission UK")