The first step in data exploration usually consists of univariate, descriptive analysis of all variables of interest. Tidycomm offers four basic functions to quickly output relevant statistics:
describe()
for continuous variablestab_percentiles()
for continuous variablesdescribe_cat()
for categorical variablestab_frequencies()
for categorical variablesFor demonstration purposes, we will use sample data from the Worlds of Journalism 2012-16 study included in tidycomm.
WoJ
#> # A tibble: 1,200 × 15
#> country reach employment temp_contract autonomy_selection autonomy_emphasis
#> <fct> <fct> <chr> <fct> <dbl> <dbl>
#> 1 Germany Nati… Full-time Permanent 5 4
#> 2 Germany Nati… Full-time Permanent 3 4
#> 3 Switzerl… Regi… Full-time Permanent 4 4
#> 4 Switzerl… Local Part-time Permanent 4 5
#> 5 Austria Nati… Part-time Permanent 4 4
#> 6 Switzerl… Local Freelancer <NA> 4 4
#> 7 Germany Local Full-time Permanent 4 4
#> 8 Denmark Nati… Full-time Permanent 3 3
#> 9 Switzerl… Local Full-time Permanent 5 5
#> 10 Denmark Nati… Full-time Permanent 2 4
#> # ℹ 1,190 more rows
#> # ℹ 9 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
#> # ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
#> # trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>
describe()
outputs several measures of central tendency
and variability for all variables named in the function call:
WoJ %>%
describe(autonomy_selection, autonomy_emphasis, work_experience)
#> # A tibble: 3 × 15
#> Variable N Missing M SD Min Q25 Mdn Q75 Max Range
#> * <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 autonomy_selec… 1197 3 3.88 0.803 1 4 4 4 5 4
#> 2 autonomy_empha… 1195 5 4.08 0.793 1 4 4 5 5 4
#> 3 work_experience 1187 13 17.8 10.9 1 8 17 25 53 52
#> # ℹ 4 more variables: CI_95_LL <dbl>, CI_95_UL <dbl>, Skewness <dbl>,
#> # Kurtosis <dbl>
If no variables are passed to describe()
, all numeric
variables in the data are described:
WoJ %>%
describe()
#> # A tibble: 11 × 15
#> Variable N Missing M SD Min Q25 Mdn Q75 Max Range
#> * <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 autonomy_sele… 1197 3 3.88 0.803 1 4 4 4 5 4
#> 2 autonomy_emph… 1195 5 4.08 0.793 1 4 4 5 5 4
#> 3 ethics_1 1200 0 1.63 0.892 1 1 1 2 5 4
#> 4 ethics_2 1200 0 3.21 1.26 1 2 4 4 5 4
#> 5 ethics_3 1200 0 2.39 1.13 1 2 2 3 5 4
#> 6 ethics_4 1200 0 2.58 1.25 1 1.75 2 4 5 4
#> 7 work_experien… 1187 13 17.8 10.9 1 8 17 25 53 52
#> 8 trust_parliam… 1200 0 3.05 0.811 1 3 3 4 5 4
#> 9 trust_governm… 1200 0 2.82 0.854 1 2 3 3 5 4
#> 10 trust_parties 1200 0 2.42 0.736 1 2 2 3 4 3
#> 11 trust_politic… 1200 0 2.52 0.712 1 2 3 3 4 3
#> # ℹ 4 more variables: CI_95_LL <dbl>, CI_95_UL <dbl>, Skewness <dbl>,
#> # Kurtosis <dbl>
Data can be grouped before describing:
WoJ %>%
dplyr::group_by(country) %>%
describe(autonomy_emphasis, autonomy_selection)
#> # A tibble: 10 × 16
#> # Groups: country [5]
#> country Variable N Missing M SD Min Q25 Mdn Q75 Max
#> * <fct> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Austria autonomy… 205 2 4.19 0.614 2 4 4 5 5
#> 2 Denmark autonomy… 375 1 3.90 0.856 1 4 4 4 5
#> 3 Germany autonomy… 172 1 4.34 0.818 1 4 5 5 5
#> 4 Switzerland autonomy… 233 0 4.07 0.694 1 4 4 4 5
#> 5 UK autonomy… 210 1 4.08 0.838 2 4 4 5 5
#> 6 Austria autonomy… 207 0 3.92 0.637 2 4 4 4 5
#> 7 Denmark autonomy… 376 0 3.76 0.892 1 3 4 4 5
#> 8 Germany autonomy… 172 1 3.97 0.881 1 3 4 5 5
#> 9 Switzerland autonomy… 233 0 3.92 0.628 1 4 4 4 5
#> 10 UK autonomy… 209 2 3.91 0.867 1 3 4 5 5
#> # ℹ 5 more variables: Range <dbl>, CI_95_LL <dbl>, CI_95_UL <dbl>,
#> # Skewness <dbl>, Kurtosis <dbl>
The returning results from describe()
can also be
visualized:
In addition, percentiles can easily be extracted from continuous variables:
WoJ %>%
tab_percentiles()
#> # A tibble: 11 × 11
#> Variable p10 p20 p30 p40 p50 p60 p70 p80 p90 p100
#> * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 autonomy_selecti… 3 3 4 4 4 4 4 4 5 5
#> 2 autonomy_emphasis 3 4 4 4 4 4 4 5 5 5
#> 3 ethics_1 1 1 1 1 1 2 2 2 3 5
#> 4 ethics_2 1 2 2 3 4 4 4 4 5 5
#> 5 ethics_3 1 1 2 2 2 2 3 4 4 5
#> 6 ethics_4 1 1 2 2 2 3 3 4 4 5
#> 7 work_experience 4 7 10 14 17 20 25 28 33 53
#> 8 trust_parliament 2 2 3 3 3 3 3 4 4 5
#> 9 trust_government 2 2 2 3 3 3 3 4 4 5
#> 10 trust_parties 1 2 2 2 2 3 3 3 3 4
#> 11 trust_politicians 2 2 2 2 3 3 3 3 3 4
Percentiles can also be visualized:
describe_cat()
outputs a short summary of categorical
variables (number of unique values, mode, N of mode) of all variables
named in the function call:
WoJ %>%
describe_cat(reach, employment, temp_contract)
#> # A tibble: 3 × 6
#> Variable N Missing Unique Mode Mode_N
#> * <chr> <int> <int> <dbl> <chr> <int>
#> 1 reach 1200 0 4 National 617
#> 2 employment 1200 0 3 Full-time 902
#> 3 temp_contract 1001 199 2 Permanent 948
If no variables are passed to describe_cat()
, all
categorical variables (i.e., character
and
factor
variables) in the data are described:
WoJ %>%
describe_cat()
#> # A tibble: 4 × 6
#> Variable N Missing Unique Mode Mode_N
#> * <chr> <int> <int> <dbl> <chr> <int>
#> 1 country 1200 0 5 Denmark 376
#> 2 reach 1200 0 4 National 617
#> 3 employment 1200 0 3 Full-time 902
#> 4 temp_contract 1001 199 2 Permanent 948
Data can be grouped before describing:
WoJ %>%
dplyr::group_by(reach) %>%
describe_cat(country, employment)
#> # A tibble: 8 × 7
#> # Groups: reach [4]
#> reach Variable N Missing Unique Mode Mode_N
#> * <fct> <chr> <int> <int> <dbl> <chr> <int>
#> 1 Local country 149 0 5 Germany 47
#> 2 Regional country 355 0 5 Switzerland 90
#> 3 National country 617 0 5 Denmark 262
#> 4 Transnational country 79 0 4 UK 72
#> 5 Local employment 149 0 3 Full-time 111
#> 6 Regional employment 355 0 3 Full-time 287
#> 7 National employment 617 0 3 Full-time 438
#> 8 Transnational employment 79 0 3 Full-time 66
Again, also the results from describe_cat()
can be
visualized like so:
tab_frequencies()
outputs absolute and relative
frequencies of all unique values of one or more categorical
variables:
WoJ %>%
tab_frequencies(employment)
#> # A tibble: 3 × 5
#> employment n percent cum_n cum_percent
#> * <chr> <int> <dbl> <int> <dbl>
#> 1 Freelancer 172 0.143 172 0.143
#> 2 Full-time 902 0.752 1074 0.895
#> 3 Part-time 126 0.105 1200 1
Passing more than one variable will compute relative frequencies based on all combinations of unique values:
WoJ %>%
tab_frequencies(employment, country)
#> # A tibble: 15 × 6
#> employment country n percent cum_n cum_percent
#> * <chr> <fct> <int> <dbl> <int> <dbl>
#> 1 Freelancer Austria 16 0.0133 16 0.0133
#> 2 Freelancer Denmark 85 0.0708 101 0.0842
#> 3 Freelancer Germany 29 0.0242 130 0.108
#> 4 Freelancer Switzerland 10 0.00833 140 0.117
#> 5 Freelancer UK 32 0.0267 172 0.143
#> 6 Full-time Austria 165 0.138 337 0.281
#> 7 Full-time Denmark 275 0.229 612 0.51
#> 8 Full-time Germany 139 0.116 751 0.626
#> 9 Full-time Switzerland 154 0.128 905 0.754
#> 10 Full-time UK 169 0.141 1074 0.895
#> 11 Part-time Austria 26 0.0217 1100 0.917
#> 12 Part-time Denmark 16 0.0133 1116 0.93
#> 13 Part-time Germany 5 0.00417 1121 0.934
#> 14 Part-time Switzerland 69 0.0575 1190 0.992
#> 15 Part-time UK 10 0.00833 1200 1
You can also group your data before. This will lead to within-group relative frequencies:
WoJ %>%
dplyr::group_by(country) %>%
tab_frequencies(employment)
#> # A tibble: 15 × 6
#> # Groups: country [5]
#> employment country n percent cum_n cum_percent
#> * <chr> <fct> <int> <dbl> <int> <dbl>
#> 1 Freelancer Austria 16 0.0773 16 0.0773
#> 2 Full-time Austria 165 0.797 181 0.874
#> 3 Part-time Austria 26 0.126 207 1
#> 4 Freelancer Denmark 85 0.226 85 0.226
#> 5 Full-time Denmark 275 0.731 360 0.957
#> 6 Part-time Denmark 16 0.0426 376 1
#> 7 Freelancer Germany 29 0.168 29 0.168
#> 8 Full-time Germany 139 0.803 168 0.971
#> 9 Part-time Germany 5 0.0289 173 1
#> 10 Freelancer Switzerland 10 0.0429 10 0.0429
#> 11 Full-time Switzerland 154 0.661 164 0.704
#> 12 Part-time Switzerland 69 0.296 233 1
#> 13 Freelancer UK 32 0.152 32 0.152
#> 14 Full-time UK 169 0.801 201 0.953
#> 15 Part-time UK 10 0.0474 211 1
(Compare the columns percent
, cum_n
and
cum_percent
with the output above.)
And of course, also tab_frequencies()
can easily be
visualized: