13. Collecting data for reactions procedures#
Warning
This notebook can be run with any computer, no special requirements are needed since the completions are done using API calls.
Motivation
In this notebook, we aim to illustrate a simple demo of extracting the different species involved in an organic reaction. The main aim is to develop a method to check if the data extracted by the model has the correct number of atoms in both sides of the reaction. We have extracted three reactions from the USPTO-ORD-100K dataset to show the process. [Ai et al., 2024]
To this end, we developed a simple pydantic
class that we use to constrain the model’s output by using the package instructor
. Then, using the tools of the rdkit
package, we will count the atoms from the SMILES and check if the condition is fulfilled.
13.1. First steps#
We begin by importing all the packages needed.
import matextract # noqa: F401
import json
from collections import defaultdict
from pydantic import BaseModel, Field
from typing import Optional, List
from litellm import OpenAI
import instructor
from rdkit import Chem
from rdkit.Chem.rdmolops import GetFormalCharge
import periodictable
Now we load the data with the reaction’s procedure.
Download data
The complete dataset can be easily download by running the following commands:
reactions = load_dataset("MrtinoRG/USPTO-ORD-100K", data_files="USPTO-n100k-t2048_exp1-COT.json", split="train")
# Replace 'your_file_path.json' with the path to your JSON file
file_path = "reactions.json"
with open(file_path, "r") as file:
data = json.load(file)
To constrain the output of the model, we create a simple pydantic
class in which we define three subclasses: reactants, solvent and products. For reactants and products we also include the amount as mass or volume, and the units for the amount.
class Reactant(BaseModel):
name: str
amount: Optional[float] = Field(
..., description="Amount as mass or volume of the reactant"
)
amount_units: Optional[float]
class Product(BaseModel):
name: str
amount: Optional[float] = Field(
..., description="Amount as mass or volume of the product"
)
amount_units: Optional[float]
class Solvent(BaseModel):
name: str = Field(
...,
description="Name of the specie. If another is contained in it, definitely it is the solvent",
)
class ReactionSpecies(BaseModel):
reactant: List[Reactant]
product: List[Product]
solvent: Optional[List[Solvent]]
We also need a converter from IUPAC names to SMILES. This is needed since the reaction procedures from the dataset use IUPAC, traditional or commercial names to refer to molecules, while rdkit
works preferentially with SMILES.
We defined such a utility using the Chemical Identifier Resolver in the package.
from matextract.utils import name_to_smiles
The last function that we need is to count the number atoms of each element in the molecules.
def composition(molecule):
"""Get the composition of an RDKit molecule:
Atomic counts, including hydrogen atoms, and any charge.
For example, fluoride ion (chemical formula F-, SMILES string [F-])
returns {9: 1, 0: -1}.
Args:
molecule: RDKit molecule
Returns:
dict: Atomic composition
"""
# Add hydrogen atoms--RDKit excludes them by default
m = Chem.MolFromSmiles(molecule)
molecule_with_Hs = Chem.AddHs(m)
comp = defaultdict(lambda: 0)
# Get atom counts
for atom in molecule_with_Hs.GetAtoms():
comp[atom.GetAtomicNum()] += 1
# If charged, add charge as "atomic number" 0
charge = GetFormalCharge(molecule_with_Hs)
if charge != 0:
comp[0] = charge
return comp
13.2. Prompting and extracting#
To extract the data, we are going to prompt the model with a two-shot prompt. In addition, the prompt will contain the reaction procedure to extract from.
system_prompt = "You are an expert in organic chemistry. Your task is to extract information about reactants and products from a given reaction procedure."
user_prompt = """Two examples are provided in order to help you:
Example 1:
Reaction procedure: {reaction1}
Answer: {answer1}
Example 2:
Reaction procedure: {reaction1}
Answer: {answer2}
The reaction procedure is the following:
{reaction_procedure}
Now extract the data from it according to the schema.
"""
reaction1 = "1-(3,4-dichlorobenzyl)-3-(4-(iodomethyl)thiazol-2-yl)urea (Intermediate 6) was taken up in tetrahydrofuran and an excess of the 2,4-dimethoxy-benzylamine (20 eq.) was added. The reaction was allowed to stir overnight at room temperature. The volatiles were removed in vacuo. Resulting oil triturated with water to give a gooey solid. Water was decanted off and resulting residue was purified by column chromatography using 0-8% gradient of 7 N ammonia/MeOH and DCM to give 1-(3,4-Dichloro-benzyl)-3-{4-[(2,4-dimethoxy-benzylamino)-methyl]-thiazol-2-yl}-urea."
answer1 = "reactant=[Reactant(name='1-(3,4-dichlorobenzyl)-3-(4-(iodomethyl)thiazol-2-yl)urea', amount=None, amount_units=None), Reactant(name='2,4-dimethoxy-benzylamine', amount=20.0, amount_units=None)] product=[Product(name='1-(3,4-Dichloro-benzyl)-3-{4-[(2,4-dimethoxy-benzylamino)-methyl]-thiazol-2-yl}-urea', amount=None, amount_units=None)] solvent=[Solvent(name='water'), Solvent(name='tetrahydrofuran')]"
reaction2 = "To a solution of (+)-trans-3-hydroxymethyl-4-phenylcyclopentan-1-one from Example 8, Step F (3.3 g, 16 mmol) in methylene chloride (100 mL) was added t-butyldimethylsilyl chloride (11 g, 49 mmol) and DIPEA (22 mL, 74 mmol). The reaction was stirred at rt for 16 h, poured into dilute aq. hydrochloric acid and extracted twice with ether. The organic layers were washed with brine, dried over sodium sulfate, a combined and concentrated. The residue was purified by FC (5% ethyl acetate in hexanes) to afford of (+)-trans-1-t-butyldimethylsilyloxymethyl-4-oxo-2-phenylcyclopentane (6.3 g) as a oil."
answer2 = "reactant=[Reactant(name='(+)-trans-3-hydroxymethyl-4-phenylcyclopentan-1-one', amount=3.3, amount_units='g'), Reactant(name='t-butyldimethylsilyl chloride', amount=11, amount_units='g'), Reactant(name='DIPEA', amount=22, amount_units='mL')] solvent=[Solvent(name='methylene chloride')] product=[Product(name='(+)-trans-1-t-butyldimethylsilyloxymethyl-4-oxo-2-phenylcyclopentane', amount=6.3, amount_units=g)]"
Finally, we perform the extraction for each of the reactions considered.
client = instructor.patch(OpenAI(), mode=instructor.Mode.MD_JSON)
for i, reaction in enumerate(data):
reaction_procedure = reaction["procedure_text"]
messages = [
{
"role": "system",
"content": system_prompt,
},
{
"role": "user",
"content": user_prompt.format(
reaction1=reaction1,
answer1=answer1,
reaction2=reaction2,
answer2=answer2,
reaction_procedure=reaction_procedure,
),
},
]
completion = client.chat.completions.create(
model="gpt-4",
response_model=ReactionSpecies,
max_retries=3,
messages=messages,
temperature=0,
)
reactant_smiles = [
name_to_smiles(reactant.name) for reactant in completion.reactant
]
product_smiles = [name_to_smiles(product.name) for product in completion.product]
# Extract the atoms number for reactants and products
reactants_atoms = []
for reactant in reactant_smiles:
reactants_atoms.append(composition(reactant))
products_atoms = []
for product in product_smiles:
products_atoms.append(composition(product))
print(f"Reaction {i+1}\n")
print(reaction_procedure)
print(completion)
print("\n")
print("Atom counting:")
# Summing up values for reactant_smiles
sum_reactant_smiles = defaultdict(int)
for d in reactants_atoms:
for key, value in d.items():
sum_reactant_smiles[key] += value
# Summing up values for product_smiles
sum_product_smiles = defaultdict(int)
for d in products_atoms:
for key, value in d.items():
sum_product_smiles[key] += value
# Comparing the summed values of equal keys
for key in sum_reactant_smiles:
if key in sum_product_smiles:
print(
f"{periodictable.elements[key]}: Reactant = {sum_reactant_smiles[key]}, Product = {sum_product_smiles[key]}"
)
else:
print(f"{periodictable.elements[key]} not present in Products")
print("\n\n")
Reaction 1
A solution of 0.55 g (1.6 mmol) (S)-7-amino-5-(4-methoxy-benzyl)-5H,7H-dibenzo[b,d]azepin-6-one, 3.74 ml (50 mmol) trifluoroacetic acid and 1.4 ml (16 mmol) trifluormethane sulfonic acid in 38 ml dichloromethane was stirred at room temperature for 4 hours. The solvent was distilled off and extraction with aqueous sodium bicarbonate solution/ethyl acetate followed by chromatography on silicagel with ethylacetate/methanol (100-95/0-5) yielded 0.35 g (96%) (S)-7-amino-5H,7H-dibenzo[b,d]azepin-6-one as orange solid; MS: m/e: 225.4 (M+H+).
reactant=[Reactant(name='(S)-7-amino-5-(4-methoxy-benzyl)-5H,7H-dibenzo[b,d]azepin-6-one', amount=0.55, amount_units=None), Reactant(name='trifluoroacetic acid', amount=3.74, amount_units=None), Reactant(name='trifluormethane sulfonic acid', amount=1.4, amount_units=None)] product=[Product(name='(S)-7-amino-5H,7H-dibenzo[b,d]azepin-6-one', amount=0.35, amount_units=None)] solvent=[Solvent(name='dichloromethane'), Solvent(name='aqueous sodium bicarbonate solution'), Solvent(name='ethyl acetate'), Solvent(name='ethylacetate/methanol')]
Atom counting:
C: Reactant = 25, Product = 14
O: Reactant = 7, Product = 1
N: Reactant = 2, Product = 2
H: Reactant = 22, Product = 12
F not present in Products
S not present in Products
Reaction 2
A solution of but-3-ynyl 4-methylbenzenesulfonate (2.0 mL) and piperazine (2.0 g) in EtOH (6 mL) was heated to reflux for 30 min. The mixture was concentrated, diluted with NaOH 2 M (8 mL) and extracted with Et2O (50 mL). Evaporation of the organic layer gave a 2:1 mixture of mono and bis-alkylated piperazine (450 mg) which was discarded. The aqueous layer was further extracted with DCM (100 mL) to give of 1-(but-3-ynyl)piperazine (640 mg).
reactant=[Reactant(name='but-3-ynyl 4-methylbenzenesulfonate', amount=2.0, amount_units=None), Reactant(name='piperazine', amount=2.0, amount_units=None)] product=[Product(name='1-(but-3-ynyl)piperazine', amount=640.0, amount_units=None)] solvent=[Solvent(name='EtOH'), Solvent(name='NaOH 2 M'), Solvent(name='Et2O'), Solvent(name='DCM')]
Atom counting:
C: Reactant = 15, Product = 8
S not present in Products
O not present in Products
H: Reactant = 22, Product = 14
N: Reactant = 2, Product = 2
Reaction 3
A mixture of 5.6 g of 1-(6-iodohexyl)-2,3-bis(phenylmethoxy) benzene, 2.1 g of 3-chloro-4-hydroxybenzoic acid methyl ester and 5.0 g of potassium carbonate in 50 mL of acetone was stirred at reflux for 20 hours. Workup as in Example 16, chromatography on 100 g of silica gel using 15% ethyl acetate-hexane and crystallization from ethyl acetate-hexane gave 3.7 g (59% yield), mp 68°-69°, 3-chloro-4-[6-[2,3-bis(phenylmethoxy)phenyl]hexyloxy]benzoic acid methyl ester.
reactant=[Reactant(name='1-(6-iodohexyl)-2,3-bis(phenylmethoxy) benzene', amount=5.6, amount_units=None), Reactant(name='3-chloro-4-hydroxybenzoic acid methyl ester', amount=2.1, amount_units=None), Reactant(name='potassium carbonate', amount=5.0, amount_units=None)] product=[Product(name='3-chloro-4-[6-[2,3-bis(phenylmethoxy)phenyl]hexyloxy]benzoic acid methyl ester', amount=3.7, amount_units=None)] solvent=[Solvent(name='acetone'), Solvent(name='ethyl acetate-hexane')]
Atom counting:
I not present in Products
C: Reactant = 35, Product = 34
O: Reactant = 8, Product = 5
H: Reactant = 36, Product = 35
Cl: Reactant = 1, Product = 1
K not present in Products
The extraction is effective for identifying species names, but struggles with identifying solvents. The model accurately identifies reactants and products in three reactions, but includes some species used for separation and purification as solvents. It also has difficulty identifying the units of amount, especially for volume measurements. Despite successful extraction, the atom count is not accurate for any of the reactions. This is because the reaction procedure only includes the final desired product and omits other products. To address this issue, one solution could be to incorporate an agent environment that retrieves all species from a reaction database. Alternatively, a simpler approach could be to limit the check to a subset of element types.
13.3. Bibliography#
Qianxiang Ai, Fanwang Meng, Jiale Shi, Brenden Pelkie, and Connor W. Coley. Extracting structured data from organic synthesis procedures using a fine-tuned large language model. ChemRxiv, 2024. doi:10.26434/chemrxiv-2024-979fz.