Obtaining a set of relevant data sources

1.1. Obtaining a set of relevant data sources#

At the start of the data extraction process you have to collect a set of potentially relevant data sources. Therefore, you could collect a dataset manually or use a tool to help to automate and speed up this process.

The Crossref API is a very useful tool to collect the metadata of relevant articles. Besides the API there are multiple Python libraries available that make access to the API easier. One of these libraries is crossrefapi. As an example, 100 sources including metadata on the topic ‘buchwald-hartwig coupling’ are extracted and saved into a JSON file.

Note

Instead of using the Crossref API one could also use a previously compiled data set. Additionally, those data sets could be extended using such APIs.

import matextract  # noqa: F401
from crossref.restful import Works
import json

works = Works(timeout=60)

# Performing the search for sources on the topic of buchwald-hartwig coupling for 10 papers
query_result = (
    works.query(bibliographic="buchwald-hartwig coupling")
    .select("DOI", "title", "author", "type", "publisher", "issued")
    .sample(10)
)

results = [item for item in query_result]

# Save 100 results including their metadata in a json file
with open("buchwald-hartwig_coupling_results.json", "w") as file:
    json.dump(results, file)

print(results)

With the obtained metadata one could afterwards try to filter for relevant or available data sources which could be downloaded through an API provided by the publishers or obtain from a data dump.

An example of the use of such an article download API is provided in the data mining notebook.