From Text to Insight: Large Language Models for Chemical Data Extraction

From Text to Insight: Large Language Models for Chemical Data Extraction#

About this book#

Structured data is at the heart of machine learning. LLMs offer a convenient way to generate structured data based on unstructured inputs. This book gives hands-on examples of the different steps in the extraction workflow using LLMs.

You can find more background on the topics covered in this book in our review article.

How to use this book?#

This book is based on Jupyter notebooks. That is, beyond simply reading along, you can also run the notebooks yourself. You have different options to do so.

Run it on the matextract JupyterHub#

You can start running most parts by clicking on this link. This will take you to the JupyterHub of Base4NFDI where the notebook can be run on a small CPU instance. We’re working on making it possible to also run the GPU-intensive parts.

Running it on your own machine#

If you have a reasonably modern computer you will be able to run many of the notebooks on your own hardware. Note, however, that certain notebooks will need to be run on GPUs. Those notebooks have a note about this on the top of the notebook.

In addition to hardware, you will also need some software. All relevant dependencies can be installed via the package for this online book.

Overall, you will need to run through the following steps. Note that we currently only support Linux and Mac. If you want to run the notebooks on Windows, we recommend that you install WSL and then run the notebooks from the Linux environment.

Use Python 3.11 (the code might also work on other versions, but we only tested 3.11)
Clone the repository

git clone https://github.com/lamalab-org/matextract-book.git

Then, go into the folder

cd matextract-book
(Optional, but recommended) Create a virtual environment:

python3 -m venv .venv

Then activate the environment

source .venv/bin/activate
Install dependencies

cd package && pip install .

`matextract` package#

Running the commands above will install a package called matextract. We will import it in all notebooks as it sets some plotting styles, but also useful defaults:

we turn on caching - a very effective way to save money if you use LLMs
we load some environment variables, such as API keys that you can edit in the .env file. This .env file needs to be in the root directory of the repository - i.e., where the .env.example file is placed. If you want to know more on how and why to use environment variables and .env files, you can check this article.

Table of Contents#

Introduction and background

Overview of the working principles of LLMs

A. Structured Extraction Workflow

B. Case Studies

Acknowledgment#

This work was supported by:

Carl-Zeiss Foundation (Mara Schilling-Wilhelmi, and Kevin Maik Jablonka)
Intel and Merck (via AWASES programme, Mara Schilling-Wilhelmi, and Kevin Maik Jablonka)
FAIRmat (Sherjeel Shabih, Christoph T. Koch, José A. Márquez, and Kevin Maik Jablonka)
Spanish AEI (Martiño Ríos-García, and María Victoria Gil)
CSIC (Martiño Ríos-García, and María Victoria Gil)

Citation#

If you use this book in your research, please cite it as follows:

@article{Schilling_Wilhelmi_2025,
  title={From text to insight: large language models for chemical data extraction},
  ISSN={1460-4744},
  url={http://dx.doi.org/10.1039/D4CS00913D},
  DOI={10.1039/d4cs00913d},
  journal={Chemical Society Reviews},
  publisher={Royal Society of Chemistry (RSC)},
  author={Schilling-Wilhelmi, Mara and Ríos-García, Martiño and Shabih, Sherjeel and Gil, María Victoria and Miret, Santiago and Koch, Christoph T. and Márquez, José A. and Jablonka, Kevin Maik},
  year={2025}
}