Short for automatic knowledge classification, akc is an R package used to carry out keyword classification based on network science (mainly community detection techniques), using bibliometric data. However, these provided functions are general, and could be extended to solve other tasks in text mining as well. Main functions are listed as below:
keyword_clean
- Automatic keyword cleaning and transfer
to tidy formatkeyword_extract
- Extract keywords from raw textkeyword_merge
- Merge keywords that supposed to have
same meaningskeyword_group
- Construct network from a tidy table and
divide them into groupskeyword_table
- Display the table with different groups
of keywordskeyword_vis
- Visualization of grouped keyword
co-occurrence networkGenerally provides a tidy framework of data
manipulation supported by dplyr
, akc was
written in data.table
when necessary to guarantee the
performance for big data analysis. Meanwhile, akc also
utilizes the state-of-the-art text mining functions provided by
stringr
,tidytext
,textstem
and
network analysis functions provided by
igraph
,tidygraph
and ggraph
. Pipe
%>%
has been exported from magrittr
and
could be used directly in akc.
# load pakcage
library(akc)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# inspect the built-in data
bibli_data_table#> # A tibble: 1,448 × 4
#> id title keyword abstr…¹
#> <int> <chr> <chr> <chr>
#> 1 1 Keeping the doors open in an age of austerity? Qualita… Auster… "Engli…
#> 2 2 Comparison of Slovenian and Korean library laws Compar… "This …
#> 3 3 Analysis of the factors affecting volunteering, satisf… Contin… "This …
#> 4 4 Redefining Library and Information Science education a… Curric… "The p…
#> 5 5 Can in-house use data of print collections shed new li… Check-… "Libra…
#> 6 6 Practices of community representatives in exploiting i… Commun… "The p…
#> 7 7 Exploring Becoming, Doing, and Relating within the inf… Librar… "Profe…
#> 8 8 Predictors of burnout in public library employees Emotio… "Work …
#> 9 9 The Roma and documentary film: Considerations for coll… Academ… "This …
#> 10 10 Mediation effect of knowledge management on the relati… Job pe… "This …
#> # … with 1,438 more rows, and abbreviated variable name ¹abstract
#> # ℹ Use `print(n = ...)` to see more rows
The data set contains bibliometric data on topic of “academic
library”,it is a data.frame of 4 columns(with docuent ID,article
title,keyword and abstract), more information could be found via
?bibli_data_data
.If the user want to carry out tasks by
simply copying the example codes,make sure to arrange the data in the
same format as biblio_data_table
and set the same names for
the corresponding columns.
The entire cleaning processes include: 1.Split the text with
separators; 2.Reomve the contents in the parentheses (including the
parentheses); 3.Remove whitespaces from start and end of string and
reduces repeated whitespaces inside a string; 4.Remove all the null
character string and pure number sequences; 5.Convert all letters to
lower case; 6.Lemmatization (not in default setting because it is not
recommended unless you need a relatively rough result. For better
merging, use keyword_merge
displayed below).
%>%
bibli_data_table keyword_clean() -> clean_data
clean_data#> # A tibble: 5,378 × 2
#> id keyword
#> <int> <chr>
#> 1 1 austerity
#> 2 1 community capacity
#> 3 1 library professional
#> 4 1 public libraries
#> 5 1 public service delivery
#> 6 1 volunteer relationship management
#> 7 1 volunteering
#> 8 2 comparative librarianship
#> 9 2 korea
#> 10 2 library legislation
#> # … with 5,368 more rows
#> # ℹ Use `print(n = ...)` to see more rows
Merge keywords that have common stem or lemma, and return the majority form of the word.
%>%
clean_data keyword_merge() -> merged_data
merged_data#> # A tibble: 5,372 × 2
#> id keyword
#> <int> <chr>
#> 1 1163 10.7202/1063788ar
#> 2 619 18th century
#> 3 1154 1password
#> 4 81 1science
#> 5 361 second-career librarianship
#> 6 662 second life
#> 7 1424 2016 us presidential election
#> 8 42 21st-century skills
#> 9 1114 21st century skills
#> 10 1051 24-hour opening
#> # … with 5,362 more rows
#> # ℹ Use `print(n = ...)` to see more rows
Create a tbl_graph(a class provided by tidygraph
package) from the tidy table with document ID and keyword. Each
entry(row) should contain only one keyword in the tidy format.
%>%
merged_data keyword_group() -> grouped_data
grouped_data#> # A tbl_graph: 207 nodes and 1332 edges
#> #
#> # An undirected simple graph with 1 component
#> #
#> # Node Data: 207 × 3 (active)
#> name freq group
#> <chr> <int> <int>
#> 1 information literacy 58 2
#> 2 academic libraries 145 2
#> 3 archives 12 1
#> 4 open access 32 3
#> 5 bibliometrics 31 3
#> 6 higher education 16 1
#> # … with 201 more rows
#> #
#> # Edge Data: 1,332 × 3
#> from to n
#> <int> <int> <int>
#> 1 1 97 14
#> 2 1 2 12
#> 3 2 14 8
#> # … with 1,329 more rows
The output table would show the top 10 keywords (by occurrence) and their frequency. Keywords are separated by “;”.
%>%
grouped_data keyword_table(top = 10)
#> # A tibble: 5 × 2
#> Group `Keywords (TOP 10)`
#> <int> <chr>
#> 1 1 public libraries (74); libraries (65); digital libraries (31); library …
#> 2 2 academic libraries (145); information literacy (58); librarians (25); l…
#> 3 3 open access (32); bibliometrics (31); library and information science (…
#> 4 4 university libraries (39); collection management (13); leadership (12);…
#> 5 5 social media (23); spain (9); sustainability (6); disinformation (5); f…
Keyword co-occurrence network in different groups. Colors are used to specify the groups, the size of nodes is proportional to the keyword frequency, while the alpha of edges is proportional to the co-occurrence relationship between keywords.
%>%
grouped_data keyword_vis()
To extract keywords from the abstract using the keywords as a
dictionary. More pre-processing filter should be implemented afterward,
such as cleaning, keyword merging and filtering by term frequency or
tf-idf. It is suggested to keep the size down before using
keyword_group
.
%>%
bibli_data_table keyword_clean(id = "id",keyword = "keyword") %>%
pull(keyword) %>%
-> my_dict
make_dict
%>%
bibli_data_table keyword_extract(id = "id",text = "abstract",dict = my_dict) %>%
keyword_merge(keyword = "keyword")
#> # A tibble: 27,130 × 2
#> id keyword
#> <int> <chr>
#> 1 619 18th century
#> 2 1223 18th century
#> 3 1154 1password
#> 4 81 1science
#> 5 983 1science
#> 6 15 2016 us presidential election
#> 7 662 3d environment
#> 8 910 fourth museum assembly
#> 9 624 55th library week
#> 10 747 aasl standards
#> # … with 27,120 more rows
#> # ℹ Use `print(n = ...)` to see more rows