From Text to Insight: Large Language Models for Materials Science Data Extraction

From Text to Insight: Large Language Models for Materials Science Data Extraction#

About this book#

Structured data is at the heart of machine learning. LLMs offer a convenient way to generate structured data based on unstructured inputs. This book gives hands-on examples of the different steps in the extraction workflow using LLMs.

You can find more background on the topics covered in this book in our review article.

How to use this book?#

This book is based on Jupyter notebooks. That is, beyond simply reading along, you can also run the notebooks yourself. You have different options to do so.

Running it on your own machine#

If you have a reasonably modern computer you will be able to run many of the notebooks on your own hardware. Note, however, that certain notebooks will need to be run on GPUs. Those notebooks have a note about this on the top of the notebook.

In addition to hardware, you will also need some software. All relevant dependencies can be installed via the package for this online book.

Overall, you will need to run through the following steps. Note that we currently only support Linux and Mac. If you want to run the notebooks on Windows, we recommend that you install WSL and then run the notebooks from the Linux environment.

  1. Use Python 3.11 (the code might also work on other versions, but we only tested 3.11)

  2. Clone the repository

    git clone https://github.com/lamalab-org/matextract-book.git

    Then, go into the folder

    cd matextract-book

  3. (Optional, but recommended) Create a virtual environment:

    python3 -m venv .venv

    Then activate the environment

    source .venv/bin/activate

  4. Install dependencies

    cd package && pip install .

matextract package#

Running the commands above will install a package called matextract. We will import it in all notebooks as it sets some plotting styles, but also useful defaults:

Table of Contents#

Introduction and background

Acknowledgment#

This work was supported by:

  • Carl-Zeiss Foundation (Mara Schilling-Wilhelmi, and Kevin Maik Jablonka)

  • Intel and Merck (via AWASES programme, Mara Schilling-Wilhelmi, and Kevin Maik Jablonka)

  • FAIRmat (Sherjeel Shabih, Christoph T. Koch, José A. Márquez, and Kevin Maik Jablonka)

  • Spanish AEI (Martiño Ríos-García, and María Victoria Gil)

  • CSIC (Martiño Ríos-García, and María Victoria Gil)

Citation#

If you use this book in your research, please cite it as follows:

@misc{schillingwilhelmi2024textinsightlargelanguage,
      title={From Text to Insight: Large Language Models for Materials Science Data Extraction},
      author={Mara Schilling-Wilhelmi and Martiño Ríos-García and Sherjeel Shabih and María Victoria Gil and Santiago Miret and Christoph T. Koch and José A. Márquez and Kevin Maik Jablonka},
      year={2024},
      eprint={2407.16867},
      archivePrefix={arXiv},
      primaryClass={cond-mat.mtrl-sci},
      url={https://arxiv.org/abs/2407.16867},
}