Mining data from ChemRxiv

1.2. Mining data from ChemRxiv#

There are multiple datasets available which are open for data mining.
To download full text documents from open access databases the paperscraper tool can be used.

As an example, here we download full text articles from ChemRxiv on the topic of ‘buchwald-hartwig coupling’, but this tool allows to download any open-source articles.

Warning

Downloading the whole Chemrxiv paper dump could take some time.

import matextract  # noqa: F401

from paperscraper.get_dumps import chemrxiv

# Download of the ChemRxiv paper dump
chemrxiv(save_path="chemrxiv_2020-11-10.jsonl")

Tip

Depending on the keywords one could find many or even none articles. In order to find some articles one maybe has to redefine those keywords to more general ones.

from paperscraper.xrxiv.xrxiv_query import XRXivQuery
from paperscraper.pdf import save_pdf_from_dump
import pandas as pd

df = pd.read_json("./chemrxiv_2020-11-10.jsonl", lines=True)

# define keywords for the paper search
synthesis = ["synthesis"]
reaction = ["buchwald-hartwig"]

# combine keywords using "AND" logic, i.e. search for papers that contain both keywords
query = [synthesis, reaction]

# start searching for relevant papers in the ChemRxiv dump
querier = XRXivQuery("./chemrxiv_2020-11-10.jsonl")
querier.search_keywords(
    query, output_filepath="buchwald-hartwig_coupling_ChemRxiv.jsonl"
)

# Save PDFs in current folder and name the files by their DOI
save_pdf_from_dump(
    "./buchwald-hartwig_coupling_ChemRxiv.jsonl", pdf_path="./PDFs", key_to_save="doi"
)

Important

Data annotation is inevitable!

To evaluate the data extraction and find the best hyperparameters one must have a test and validation set. Annotating at least a small part of the obtained article dataset is crucial.

For further steps in the data extraction process, annotated data is needed to evaluate the extraction pipeline. For this, one could use an annotation tool like doccano, which is shown in the following data annotation notebook.