Skip to main content
Back to top
Ctrl
+
K
From Text to Insight: Large Language Models for Chemical Data Extraction
Introduction and background
Overview of the working principles of LLMs
A. Structured Extraction Workflow
1. Obtaining data
1.1. Obtaining a set of relevant data sources
1.2. Mining data from ChemRxiv
1.3. Data annotation
2. Cleaning
2.1. Document parsing with OCR tools
2.2. Document cleaning
3. Strategies to tackle context window limitations
4. Choosing the learning paradigm
5. Beyond text
6. Agents
7. Constrained generation to guarantee syntactic correctness
8. Evaluations
B. Case Studies
9. Research articles vs datasets in chemistry and materials science
10. Collecting data on the synthesis procedures of bio-based adsorbents
11. Retrieving data from chacolgenide perovskites
12. Validation case study: Matching NMR spectra to composition of the molecule
13. Collecting data for reactions procedures
Repository
Open issue
.md
.pdf
Cleaning
2.
Cleaning
#