ojsr allows you to crawl OJS archives, issues, articles, galleys, and search results, and retrieve metadata from articles.
Important Notes:
(from the OJS documentation, as of Jan.2020)
Open Journal Systems (OJS) is a journal management and publishing system that has been developed by the Public Knowledge Project through its federally funded efforts to expand and improve access to research.
OJS assists with every stage of the refereed publishing process, from submissions through to online publication and indexing. Through its management systems, its finely grained indexing of research, and the context it provides for research, OJS seeks to improve both the scholarly and public quality of refereed research.
OJS is open source software made freely available to journals worldwide for the purpose of making open access publishing a viable option for more journals, as open access can increase a journal’s readership as well as its contribution to the public good on a global scale (see PKP Publications).
Since OJS v3.1+ https://docs.pkp.sfu.ca/dev/api/ojs/3.1 a Rest API is provided. We are positive a better R interface should use that API instead of web scraping. So, why ojsr? According to https://pkp.sfu.ca/software/ojs/usage-data//, as of 2019 (when v3.1+ was launched), at least 15,000 journals worldwide were using OJS. OJS is an excellent free publishing solution for institutions that could probably not publish otherwise, and, presumably, cannot afford to update constantly. ojsr aims to help crawling and retrieving info from OJS during this legacy period.
Let’s say we want to scrape metadata from a journal’s issue collection of journals to compare them. We start with the journal’s titles and URL, and can use ojsr to scrape their issues, articles, and metadata.
# NOT RUN {
library(ojsr)
journal <- 'https://revistapsicologia.uchile.cl/index.php/RDP/'
issues <- ojsr::get_issues_from_archive(input_url = journal)
articles <- ojsr::get_articles_from_issue(input_url = issues$output_url[1:2]) # only first 2 issues
metadata <- ojsr::get_html_meta_from_article(input_url = articles$output_url[1:5]) # only first 5 articles
# }
get_issues_from_archive()
takes a vector of OJS URLs and
scrapes the issues URLs from the issue archive
You don’t need to provide the actual URL to issue archives.
get_issues_from_archive()
parses the URL you provide to
compose it. Then, it looks for links containing “/issue/view” in the
href. Links are post-processed to comply to OJS routing conventions
before returning.
journal <- 'https://revistapsicologia.uchile.cl/index.php/RDP/'
issues <- ojsr::get_issues_from_archive(input_url = journal)
The result is a long-format data frame (1 input_url may result in several rows, one for each output_url) containing:
get_articles_from_issue()
takes a vector of OJS (issue)
URLs and scrapes the links to articles from the issues table of
content
You don’t need to provide the actual URL of the issues’ ToC, but you
must provide URLs that include issue ID (articles URLs do not include
this info!). get_articles_from_issue()
parses the URL you
provide to compose the ToC URL. Then, it looks for links containing
“/article/view” in the href. Links are post-processed to comply to OJS
routing conventions before returning.
issue <- 'https://revistapsicologia.uchile.cl/index.php/RDP/issue/view/6031/'
articles <- ojsr::get_articles_from_issue(input_url = issue)
The result is a long-format dataframe (1 input_url may result in several rows, one for each output_url), containing:
get_articles_from_search()
takes a vector of OJS URLs
and a string for search criteria to compose search result URLs, then it
scrapes them to retrieve the articles’ URLs.
You don’t need to provide the actual URL of the search result pages.
get_articles_from_search()
parses the URL you provide to
compose the search result page(s) URL. If pagination is involved,
necessary links are also included. Then, it looks for links containing
“/article/view” in the href. Links are post-processed to comply to OJS
routing conventions before returning.
journal <- 'https://revistapsicologia.uchile.cl/index.php/RDP/'
criteria <- "psicologia+social"
articles_search <- ojsr::get_articles_from_search(input_url = journal, search_criteria = criteria)
The result is a long-format dataframe (1 input_url may result in several rows, one for each output_url), containing:
Galleys are the final presentation version of the articles content. Most of the time, these include full content in PDF and other reading formats. Less often, they are supplementary files (tables, dataset) in different formats.
get_galleys_from_article()
takes a vector of OJS URLs
and scrapes all the galleys URLs from the article view
You may provide any article-level URL (article abstract view, inline
view, PDF direct download, etc.).
get_galleys_from_article()
parses the URL you provide to
compose the article view URL. Then, it looks for links containing
“/article/view” in the href. Links are post-processed to comply to OJS
routing conventions before returning (i.e., having a galley ID).
article <- 'https://dspace.palermo.edu/ojs/index.php/psicodebate/article/view/516/311' # inline reader
galleys <- ojsr::get_galleys_from_article(input_url = article)
The result is a long-format dataframe (1 input_url may result in several rows, one for each output_url), containing:
get_html_meta_from_article()
takes a vector of OJS URLs
and scrapes all metadata written in HTML from the article view
(e.g., https://publicaciones.sociales.uba.ar/index.php/psicologiasocial/article/view/593).
You may provide any article-level URL (article abstract view, inline
view, PDF direct download, etc.).
get_html_meta_from_article()
parses the URL you provide to
compose the URL of the article view. Then, it looks for <meta>
tags in the <head> section of the HTML.
Important! This may not only retrieve bibliographic
metadata; any other “meta” property detailed on the HTML will be
obtained (e.g., descriptions for propagation on social network,
etc.).
article <- 'https://revistapsicologia.uchile.cl/index.php/RDP/article/view/75178'
metadata <- ojsr::get_html_meta_from_article(input_url = article)
The result is a long-format dataframe (1 input_url may result in several rows, one for each output_url), containing:
An alternative to web scraping metadata from the article pages HTML is to retrieve their OAI-PMH (Open Archives Initiative Protocol for ‘Metadata’ Harvesting) records.
get_oai_meta_from_article()
will try to access the OAI
records within the OJS for any article’s URL you have provided.
article <- 'https://dspace.palermo.edu/ojs/index.php/psicodebate/article/view/516/311' # xml galley
metadata_oai <- ojsr::get_oai_meta_from_article(input_url = article)
The result is a long-format dataframe (1 input_url may result in several rows, one for each output_url), containing:
get_html_meta_from_article()
results)get_html_meta_from_article()
results)Note: This function is in a very preliminary stage.
If you are interested in working with OAI records, you may want to check
Scott Chamberlain’s OAI package for R https://CRAN.R-project.org/package=oai. If you only have
the OJS home url, and would like to check all the article’s OAI records
at one shot, an interesting option is to parse it with
ojsr::parse_oai_url()
and passing the output_url to
oai::list_identifiers()
.
parse_base_url()
takes a vector of OJS URLs and
retrieves their base URL, according to OJS routing conventions.
mix_links <- c(
'https://dspace.palermo.edu/ojs/index.php/psicodebate/issue/archive',
'https://revistapsicologia.uchile.cl/index.php/RDP/article/view/75178'
)
base_url <- ojsr::parse_base_url(input_url = mix_links)
The result is a vector of the same length of your input.
parse_oai_url()
takes a vector of OJS URLs and retrieves
their OAI entry URL, according to OJS routing conventions.
mix_links <- c(
'https://dspace.palermo.edu/ojs/index.php/psicodebate/issue/archive',
'https://revistapsicologia.uchile.cl/index.php/RDP/article/view/75178'
)
oai_url <- ojsr::parse_oai_url(input_url = mix_links)
The result is a vector of the same length of your input.