easyScieloPak is an R package that allows you to search and access academic articles from SciELO programmatically.
The main goal of easyScieloPak is to simplify the process of querying SciELO from R by: - Making queries readable and reproducible. - Allowing filters like year, collection (country), language, journal, and subject category. - Handling pagination, data parsing, and cleaning automatically. - Providing clear and validated feedback when a query is incorrect. - Minimizing errors due to anti-scraping measures (e.g., 403 HTTP errors).
You can install the development version of
easyScieloPak from GitHub using either
devtools or remotes:
install.packages(“devtools”) devtools::install_github(“https://github.com/PabloIxcamparij/easyScieloPack.git”)
install.packages(“remotes”) remotes::install_github(“https://github.com/PabloIxcamparij/easyScieloPack.git”)
library(easyScieloPak)
df <- search_scielo(“salud ambiental”, collections = “Ecuador”, languages = “es”, n_max = 5) head(df)
df <- search_scielo(“ecology”, collections = “Chile”, languages = “en”, n_max = 8)
View(df) # View results in RStudio
Each filter only supports one value at a time (e.g., only one country, language, journal, or category).
Web scraping may be sensitive to structural changes in the SciELO website.
The number of fetched articles is limited by n_max
(default fallback is 100).
No official API is available, so the package depends on website scraping.
Rate-limiting / Blocking (403 errors): In some cases, SciELO may detect automated access and temporarily block the search, resulting in a 403 HTTP error. This is a common limitation of scraping. If this occurs, try the following:
Note: Reinstalling the package has no direct effect on the block.
-Default fallback limit: If the total number of available results cannot be determined, the query will default to fetching a maximum of 100 articles.
Recent Improvements -Rotating User-Agents: Each request uses a different User-Agent string (Chrome, Firefox, Safari variants) to appear more like a real browser and avoid blocking.
-Random delays between requests reduce server load and minimize scraping detection.
-Retry logic: If a request fails, the package retries automatically with a different User-Agent.
The current version of easyScieloPak is fully functional
for basic academic exploration through SciELO. However, the following
enhancements are planned for future versions:
Support for multiple filter values: Currently,
each filter (e.g., language, category, journal) only accepts a single
value. Future versions aim to support multiple values for broader and
more flexible queries (e.g.,
languages("es", "en", "pt")).
Improved scraping resistance: We plan to implement smarter mechanisms to reduce the chances of triggering SciELO’s anti-scraping protections (e.g., rotating user agents, request throttling, caching mechanisms).
Caching and offline mode: Possibility to cache previous search results locally for offline use or repeated queries.
Enhanced error diagnostics: Provide clearer messages and helper functions when 403 or parsing issues occur.
Journal/code normalization functions: Automatic mapping of journal names to their normalized internal identifiers.
SciELO is a multidisciplinary open-access platform hosting scientific journals from over 15 countries. It plays a vital role in disseminating research output from Latin America and beyond.
This package provides a lightweight, unofficial method to interact with SciELO’s search interface.
Feel free to open issues or submit pull requests to improve functionality, usability, or documentation.