3. Strategies to tackle context window limitations#

Models always have a context window, which is the number of tokens they can process at a given time. This is an issue when we want to process any text that doesn’t fit in this context window. We can break the text into chunks that fit. In this notebook, we demonstrate a number of techniques to tackle this issue.

Text from data mining

The text here is the same one used in the data mining section.

import matextract  # noqa: F401

text = "As a fundamental problem in organic chemistry, synthesis planning aims at designing energy and cost-efficient reaction pathways for target compounds. In synthesis planning, it is crucial to understand regioselectivity, or the preference of a reaction over competing reaction sites. Precisely predicting regioselectivity enables early exclusion of unproductive reactions and paves the way to designing high-yielding synthetic routes with minimal separation and material costs. However, it is still at emerging state to combine chemical knowledge and data-driven methods to make practical predictions for regioselectivity. At the same time, metal-catalyzed cross-coupling reactions have profoundly transformed medicinal chemistry, and thus become one of the most frequently encountered types of reactions in synthesis planning. In this work, we for the first time introduce a chemical knowledge informed message passing neural network(MPNN) framework that directly identifies the intrinsic major products for metal-catalyzed cross-coupling reactions with regioselective ambiguity. Integrating both first principle methods and data-driven methods, our model achieves an overall accuracy of 95.24\\% on the test set of eight typical metal-catalyzed cross-coupling reaction types, including Suzuki-Miyaura, Stille, Sonogashira, Buchwald-Hartwig, Hiyama, Kumada, Negishi, and Heck reactions, outperforming other commonly used model types. To integrate electronic effects with steric effects in regioselectivity prediction, we propose a quantitative method to measure the steric hindrance effect. Our steric hindrance checker can successfully identify regioselectivity induced solely by steric hindrance. Notably under practical scenarios, our model outperforms 6 experimental organic chemists with an average working experience of over 10 years in the organic synthesis industry in terms of predicting major products in regioselective cases. We have also exemplified the practical usage of our model by fixing routes designed by open-access synthesis planning software and improving reactions by identifying low-cost starting materials. To assist general chemists in making prompt decisions about regioselectivity, we have developed a free web-based AI-empowered tool."

3.1. Fixed size chunking#

The simplest approach is to make chunks that fit in your context window without worrying about where it cuts the text. This, as you can see in the example below, is not ideal. Some words get chopped, and when the LLM sees these chunks separately, it likely will struggle to infer information correctly.

Choosing a chunk size

The context window (🪟) needs to fit the chunk (🍕), query (❓), and output (🧰). \

Example: 🪟 (512) = 🍕 (100) + ❓(50) + 🧰 (388)

chunk_size = 100
[text[i : i + chunk_size] for i in range(0, len(text), chunk_size)]
['As a fundamental problem in organic chemistry, synthesis planning aims at designing energy and cost-',
 'efficient reaction pathways for target compounds. In synthesis planning, it is crucial to understand',
 ' regioselectivity, or the preference of a reaction over competing reaction sites. Precisely predicti',
 'ng regioselectivity enables early exclusion of unproductive reactions and paves the way to designing',
 ' high-yielding synthetic routes with minimal separation and material costs. However, it is still at ',
 'emerging state to combine chemical knowledge and data-driven methods to make practical predictions f',
 'or regioselectivity. At the same time, metal-catalyzed cross-coupling reactions have profoundly tran',
 'sformed medicinal chemistry, and thus become one of the most frequently encountered types of reactio',
 'ns in synthesis planning. In this work, we for the first time introduce a chemical knowledge informe',
 'd message passing neural network(MPNN) framework that directly identifies the intrinsic major produc',
 'ts for metal-catalyzed cross-coupling reactions with regioselective ambiguity. Integrating both firs',
 't principle methods and data-driven methods, our model achieves an overall accuracy of 95.24\\% on th',
 'e test set of eight typical metal-catalyzed cross-coupling reaction types, including Suzuki-Miyaura,',
 ' Stille, Sonogashira, Buchwald-Hartwig, Hiyama, Kumada, Negishi, and Heck reactions, outperforming o',
 'ther commonly used model types. To integrate electronic effects with steric effects in regioselectiv',
 'ity prediction, we propose a quantitative method to measure the steric hindrance effect. Our steric ',
 'hindrance checker can successfully identify regioselectivity induced solely by steric hindrance. Not',
 'ably under practical scenarios, our model outperforms 6 experimental organic chemists with an averag',
 'e working experience of over 10 years in the organic synthesis industry in terms of predicting major',
 ' products in regioselective cases. We have also exemplified the practical usage of our model by fixi',
 'ng routes designed by open-access synthesis planning software and improving reactions by identifying',
 ' low-cost starting materials. To assist general chemists in making prompt decisions about regioselec',
 'tivity, we have developed a free web-based AI-empowered tool.']

3.2. Splitting based on special characters#

We can improve on this and start splitting by special characters such as . or \n. This keeps most semantic information in the same chunk. But of course, there can be cases where some information is lost in a previous chunk when we split by “.”.

text.split(".")
['As a fundamental problem in organic chemistry, synthesis planning aims at designing energy and cost-efficient reaction pathways for target compounds',
 ' In synthesis planning, it is crucial to understand regioselectivity, or the preference of a reaction over competing reaction sites',
 ' Precisely predicting regioselectivity enables early exclusion of unproductive reactions and paves the way to designing high-yielding synthetic routes with minimal separation and material costs',
 ' However, it is still at emerging state to combine chemical knowledge and data-driven methods to make practical predictions for regioselectivity',
 ' At the same time, metal-catalyzed cross-coupling reactions have profoundly transformed medicinal chemistry, and thus become one of the most frequently encountered types of reactions in synthesis planning',
 ' In this work, we for the first time introduce a chemical knowledge informed message passing neural network(MPNN) framework that directly identifies the intrinsic major products for metal-catalyzed cross-coupling reactions with regioselective ambiguity',
 ' Integrating both first principle methods and data-driven methods, our model achieves an overall accuracy of 95',
 '24\\% on the test set of eight typical metal-catalyzed cross-coupling reaction types, including Suzuki-Miyaura, Stille, Sonogashira, Buchwald-Hartwig, Hiyama, Kumada, Negishi, and Heck reactions, outperforming other commonly used model types',
 ' To integrate electronic effects with steric effects in regioselectivity prediction, we propose a quantitative method to measure the steric hindrance effect',
 ' Our steric hindrance checker can successfully identify regioselectivity induced solely by steric hindrance',
 ' Notably under practical scenarios, our model outperforms 6 experimental organic chemists with an average working experience of over 10 years in the organic synthesis industry in terms of predicting major products in regioselective cases',
 ' We have also exemplified the practical usage of our model by fixing routes designed by open-access synthesis planning software and improving reactions by identifying low-cost starting materials',
 ' To assist general chemists in making prompt decisions about regioselectivity, we have developed a free web-based AI-empowered tool',
 '']

3.3. Overlap between chunks#

To try to not lose this semantic information too much, we can add some overlap between chunks. This way some information is trickled in from the previous chunk and some from the next. Imagine this as reading the last two lines of the last paragraph and the first two of the next alongside the current paragraph you are reading.

chunked_sentences = text.split(".")
overlap = 15
[
    chunked_sentences[i - 1][-overlap:]
    + chunked_sentences[i]
    + chunked_sentences[i + 1][:5]
    for i in range(0, len(chunked_sentences) - 1)
]
['As a fundamental problem in organic chemistry, synthesis planning aims at designing energy and cost-efficient reaction pathways for target compounds In s',
 'arget compounds In synthesis planning, it is crucial to understand regioselectivity, or the preference of a reaction over competing reaction sites Prec',
 ' reaction sites Precisely predicting regioselectivity enables early exclusion of unproductive reactions and paves the way to designing high-yielding synthetic routes with minimal separation and material costs Howe',
 ' material costs However, it is still at emerging state to combine chemical knowledge and data-driven methods to make practical predictions for regioselectivity At t',
 'egioselectivity At the same time, metal-catalyzed cross-coupling reactions have profoundly transformed medicinal chemistry, and thus become one of the most frequently encountered types of reactions in synthesis planning In t',
 'thesis planning In this work, we for the first time introduce a chemical knowledge informed message passing neural network(MPNN) framework that directly identifies the intrinsic major products for metal-catalyzed cross-coupling reactions with regioselective ambiguity Inte',
 'ctive ambiguity Integrating both first principle methods and data-driven methods, our model achieves an overall accuracy of 9524\\% ',
 ' accuracy of 9524\\% on the test set of eight typical metal-catalyzed cross-coupling reaction types, including Suzuki-Miyaura, Stille, Sonogashira, Buchwald-Hartwig, Hiyama, Kumada, Negishi, and Heck reactions, outperforming other commonly used model types To i',
 'sed model types To integrate electronic effects with steric effects in regioselectivity prediction, we propose a quantitative method to measure the steric hindrance effect Our ',
 'indrance effect Our steric hindrance checker can successfully identify regioselectivity induced solely by steric hindrance Nota',
 'teric hindrance Notably under practical scenarios, our model outperforms 6 experimental organic chemists with an average working experience of over 10 years in the organic synthesis industry in terms of predicting major products in regioselective cases We h',
 'selective cases We have also exemplified the practical usage of our model by fixing routes designed by open-access synthesis planning software and improving reactions by identifying low-cost starting materials To a',
 'rting materials To assist general chemists in making prompt decisions about regioselectivity, we have developed a free web-based AI-empowered tool']

3.4. Embeddings, vectors, RAG#

If there are too many chunks to process all of them every time a query is made, a RAG, Retrieval Augmented Generation, approach can be used. This is usually used with a vector database to do a similarity search to find relevant chunks before querying the LLM.

We can use an embedding model to find suitable vectors to represent our vocabulary whether it is words or sentences. These vectors are then stored in a vector database for retrieval later.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = text.split(".")

text_embeddings = model.encode(chunked_sentences)
print(text_embeddings.shape)

ChromaDB is an Open Source Vector Database that we can use for our RAG application to query before the request is sent to the LLM. ChromaDB as default uses the same Sentence Embedding model we used above, “all-MiniLM-L6-v2”.

import chromadb

client = chromadb.Client()
collection = client.create_collection(name="MySentenceStore")
collection.add(
    documents=chunked_sentences,
    ids=[str(id) for id in range(0, len(chunked_sentences))],
)

Before we send our query to the LLM, we find a relevant chunk from our vector database. In this example, it will give us the sentence most relevant to our question.

query_results = collection.query(
    query_texts=["What has transformed medicinal chemistry?"], n_results=1
)
print(query_results["documents"])
[[' At the same time, metal-catalyzed cross-coupling reactions have profoundly transformed medicinal chemistry, and thus become one of the most frequently encountered types of reactions in synthesis planning']]