# Collecting data on the synthesis procedures of bio-based adsorbents

```{warning}
To run this notebook, you will need access to at least one GPU. The results that are printed were obtained using a single A100 graphic card. Running the notebook with this configuration takes around 60 minutes. The GPU is recommended especially because of the embeddings generation.
```

````{admonition} Motivation 
:class: note

In this notebook, we collect structured data on the **synthesis procedures of bio-based adsorbents**, including their CO<sub>2</sub> adsorption capacity. With this type of dataset, we could build a model to predict the CO<sub>2</sub> adsorption capacity of an adsorbent from the biomass precursor and synthesis conditions.

Here, we want to show different prompting approaches for structured data extraction from papers. To illustrate this process, we extract information from an open-source article by {cite:t}`Shao2020`. We use zero-shot and few-shot prompting, and also some advanced prompting techniques, such as Chain of Thought (CoT) or CoT with self-consistency.
````

## Scientific background

Carbon-based materials, such as porous activated carbons, are promising adsorbents for removing CO<sub>2</sub> from industrial off-gases due to their high surface area, selective adsorption of CO<sub>2</sub>, hydrophobicity, temperature stability, and ease of regeneration. Recently, bio-based adsorbents have received considerable attention as sustainable and cost-effective materials for CO<sub>2</sub> capture, as they can be produced from renewable sources that are available worldwide at lower cost through relatively simple treatment processes.

One of the main advantages associated with biomass-derived adsorbents is the high potential to modify their pore structure and functionalize their surface. The production of carbon adsorbents from biomass precursors involves physical or chemical activation to develop porosity through the reaction of the precursor with an activating agent. Each carbon precursor may require specific activation conditions, resulting in different textural characteristics. The CO<sub>2</sub> adsorption capacity of an activated carbon mainly depends on its pore structure.
The availability of models to predict their textural properties and CO<sub>2</sub> adsorption capacity could accelerate the development of adsorption processes on bio-derived adsorbents by helping to synthesize more efficient adsorbents for CO<sub>2</sub> capture. 


## First steps

We begin by importing all the required packages. 

In [2]:
import matextract  # noqa: F401

from statistics import mean
import json
import os
import re

from sentence_transformers import SentenceTransformer, util
import pandas as pd

from pydantic import BaseModel, Field
from typing import Optional, Union, Dict, Any, List

from groq import Groq
from litellm import completion

import instructor

```{admonition} Download and parse the pdf file into markdown
:class: tip, dropdown

The download of the article was done manually from the publisher's website. At the time this book was created, this and other publishers did not allow text mining, hopefully this will change soon to simplify things.

Additionally, to parse the pdf file into a more manageable format such as it is markdown, [`marker`](https://github.com/VikParuchuri/marker) was used. We use this package because it returns good results for tables. However, some other packages with more flexible license can be used such as [`doctr`](https://github.com/mindee/doctr).
```

In [3]:
output_path = "./parsed_article"
output_md_path = os.path.join(output_path, "article/article.md")
with open(output_md_path, "r") as file:
    text = file.read()

In [4]:
print(text)



![0_image_1.png](0_image_1.png)

# Selectable Microporous Carbons Derived From Poplar Wood By Three Preparation Routes For Co2 Capture Lishu Shao,* Yafei Sang, Na Liu, Jun Liu, Peng Zhan, Jianhan Huang, And Jienan Chen*

Cite This: ACS Omega 2020, 5, 17450−17462 Read Online

ACCESS Metrics & More Article Recommendations *sı Supporting Information
ABSTRACT: Biomass-derived porous carbons are one kind of sustainable, extensive, and flexible carbon material for CO2 capture.

Here, we prepared several microporous carbons from poplar wood by three preparation routes. Especially, the residues of the poplar wood after the bioethanol process were explored as precursors to prepare activated carbon by KOH and ZnCl2 activation. By the adjustment of the preparation routes and the optimization of the activation conditions, these porous carbons exhibited diversified morphology (sponge, nanosheets, and honeycomb structure),
tunable porosity (specific surface areas: 511−2153 m2/g), and narrow microp

We can see that the text was extracted correctly. However, some additional cleaning can be done, specially to remove the images since we are not going to need them.

In [5]:
pattern = r"!\[\d+_image_\d+\.png\]\(\d+_image_\d+\.png\)"
clean_string = re.sub(pattern, "", text)
clean_string = re.sub("\n+", "\n", clean_string)

chunks = clean_string.split("\n")

## Chunking

The next step is to create smaller chunks of text, so the text fits within the context length of the model. In addition, we will further clean up the document by removing some sections that do not contain data, such as the *References* section.

In [6]:
num_chunk = 0
while num_chunk < len(chunks) - 1:
    chunk = chunks[num_chunk]

    if len(chunk) == 0 or chunk is None:
        del chunks[num_chunk]
        continue

    if "ASSOCIATED CONTENT" in chunk:
        chunks = chunks[:num_chunk]
        break  # Break the loop entirely if the end or the article is found

    # In case the OCR extraction fails in correctly recognize the paragraphs.
    # We merge chunks when the previous chunk does not end with a dot.
    if chunk[-1].strip() != ".":
        chunks[num_chunk] = chunk + "\n" + chunks[num_chunk + 1]
        del chunks[num_chunk + 1]
    else:
        num_chunk += 1

To finalize with the cleaning, we are going to isolate the tables, i.e., leave each one in a chunk without additional text. This will help to improve the results when extracting the data from them.

In [7]:
new_chunks = []
num_chunk = 0
while num_chunk < len(chunks):
    chunk = chunks[num_chunk]
    num_chunk += 1
    # Find all "Table \d+." matches
    table_matches = re.findall(r"Table \d+\.", chunk)
    if len(table_matches) > 1:
        split_chunks = re.split(r"(Table \d+\.)", chunk)
        merged_chunks = []
        i = 0
        while i < len(split_chunks):
            split_chunk = split_chunks[i]
            match = re.search(r"Table \d+\.", split_chunk)
            if match:
                tmp = split_chunk + split_chunks[i + 1]

                merged_chunks.append(tmp)
                i += 2
            else:
                merged_chunks.append(split_chunk)
                i += 1
        new_chunks.extend(merged_chunks)
    else:
        new_chunks.append(chunk)

In [8]:
for i, chunk in enumerate(new_chunks):
    print(f"Chunk {i}")
    print(chunk)
    print("\n")

Chunk 0
# Selectable Microporous Carbons Derived From Poplar Wood By Three Preparation Routes For Co2 Capture Lishu Shao,* Yafei Sang, Na Liu, Jun Liu, Peng Zhan, Jianhan Huang, And Jienan Chen*
Cite This: ACS Omega 2020, 5, 17450−17462 Read Online
ACCESS Metrics & More Article Recommendations *sı Supporting Information
ABSTRACT: Biomass-derived porous carbons are one kind of sustainable, extensive, and flexible carbon material for CO2 capture.


Chunk 1
Here, we prepared several microporous carbons from poplar wood by three preparation routes. Especially, the residues of the poplar wood after the bioethanol process were explored as precursors to prepare activated carbon by KOH and ZnCl2 activation. By the adjustment of the preparation routes and the optimization of the activation conditions, these porous carbons exhibited diversified morphology (sponge, nanosheets, and honeycomb structure),
tunable porosity (specific surface areas: 511−2153 m2/g), and narrow micropore distribution (0.

The next step is to classify the different chunks, as they contain information about the data that we want to extract.

We perform that by using embeddings. To produce the embeddings, we use the best model from the [HuggingFace Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).

The way to proceed is to compare the embeddings of each chunk with those from a query containing the variables we want to extract. To measure the similarity between the two, we use [the cosine similarity measure](https://microsoft.github.io/kernel-memory/concepts/cosine-similarity).

```{margin}
Note that for this text-classification task, you can easily switch to your preferred HuggingFace model by substituting the model variable.
```

In [9]:
model = SentenceTransformer("Alibaba-NLP/gte-Qwen2-7B-instruct")

simple_prompt = "What are the porous carbons material, activation pretreatment method, activation chemical agent, activation temperature and CO2 uptake at 1 bar"

query = model.encode(simple_prompt, convert_to_tensor=True)

cosine_similarities = []
for chunk in new_chunks:
    text_embeddings = model.encode(chunk, convert_to_tensor=True)
    cosine_similarities.append(util.pytorch_cos_sim(query, text_embeddings).item())

cos_mean = mean(cosine_similarities)

classified_chunks = []
for i, value in enumerate(cosine_similarities):
    if value >= cos_mean:
        classified_chunks.append(new_chunks[i])

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)


To evaluate the output of the model, we manually extracted the data from the article and saved it in a JSON file. 

In [10]:
# Load the JSON data
with open("ground_truth_data.json", "r") as file:
    ground_data = json.load(file)

In [11]:
df = pd.DataFrame(ground_data)
df

Unnamed: 0,name,pretreatment_process_method,pretreatment_activation_chemical_agent,activation_temperature,activation_temperature_units,co2_uptake_amount,co2_uptake_units
0,DZC-600-2,,ZnCl2,600,ºC,104.7,mg/g
1,BKC-600-2,bio-pretreatment,ZnCl3,600,ºC,80.0,mg/g
2,HKC-600-2,hydrothermal,ZnCl4,600,ºC,90.3,mg/g
3,DKC-600-2,,KOH,600,ºC,88.5,mg/g
4,BKC-600-2,bio-pretreatment,KOH,600,ºC,116.0,mg/g
5,HKC-600-2,hydrothermal,KOH,600,ºC,161.1,mg/g
6,HKC-700-2,hydrothermal,KOH,700,ºC,124.5,mg/g
7,HKC-800-2,hydrothermal,KOH,800,ºC,151.6,mg/g
8,HKC-600-1,hydrothermal,KOH,600,ºC,146.5,mg/g
9,HKC-800-1,hydrothermal,KOH,800,ºC,217.0,mg/g


## Prompting

Once we have only the chunks that apparently contain useful information, we can extract the data from them and compare it with the ground truth.

For data extraction, we use the Llama-3-70B-Instruct model, which we accessed through the [Groq](https://groq.com) API. We start by using a simple zero-shot prompt, to then escalate to some advanced prompting techniques for comparison.

In [12]:
base_model = "groq/llama3-70b-8192"

First, we define the system prompt that we will use for all the cases. This system prompt is quite simple, only presenting a role and a task to the model.

In [13]:
system_prompt = (
    "You are a scientific assistant and your task is to extract certain information from text. "
    "We are in a scientific environment. You MUST be critical of the units of the variables. "
    "Do not leave information behind. "
    "Only extract the variables that were developed in this study. You must omit the ones extracted from the bibliography"
)

### Naive Zero-Shot

To start, we will use a simple zero-shot prompt only asking the model to extract a list of variables and to join them with the ones extracted from previous chunks.

In [14]:
simple_prompt = """Extract only the variables detailed below from the provided text. Then join them with the data from previous chunks.
Only extract data if you know the corresponding sample or carbon adsorbent.
To provide the answer with the data, do it using a schema similar to the following:

- Name of the sample.
- Pretreatment process used.
- Pretreatment activation chemical agent used.
- Activation temperature.
- Units for the activation temperature.
- Amount of CO2 uptake.
- CO2 uptake units.

Text to extract from:

{chunk}

Add the newly extracted data to the one from previous chunks that is the following:

{memory}

Never leave the information from previous chunks behind.
Begin extracting!
"""

In [15]:
summary = ""
for chunk in classified_chunks:
    messages = [
        {
            "role": "system",
            "content": system_prompt,
        },
        {"role": "user", "content": simple_prompt.format(chunk=chunk, memory=summary)},
    ]
    response = completion(
        model=base_model,
        messages=messages,
        temperature=0,
    )

    summary = response.choices[0].message.content

In [16]:
print(summary)

After carefully reading the provided text, I did not extract any new variables that meet the specified criteria. The text only provides information about the gas adsorption measurement method and the degassing process, but it does not provide any data about the samples, pretreatment processes, activation temperatures, or CO2 uptake.

Therefore, the extracted data remains the same as the previous chunks:

- Name of the sample: DKC-600-2
- Pretreatment process used: Not mentioned
- Pretreatment activation chemical agent used: KOH
- Activation temperature: 600
- Units for the activation temperature: °C
- Amount of CO2 uptake: 88.5
- CO2 uptake units: mg/g

- Name of the sample: DZC-600-2
- Pretreatment process used: Not mentioned
- Pretreatment activation chemical agent used: ZnCl2
- Activation temperature: 600
- Units for the activation temperature: °C
- Amount of CO2 uptake: Not mentioned
- CO2 uptake units: Not applicable

- Name of the sample: BKC-600-2
- Pretreatment process used: No

By comparing the results with the ground data, we see that they are not very good as the model only extracts some samples names and some of them are repeated. For the other variables, we see that some activation temperatures, temperature units and activation agent are correctly extracted, but this is not good enough. Hopefully, we can improve the results by building a more elaborate prompt.

### Zero-Shot with detailed schema

To try to improve the results, we will provide within the prompt a detailed schema of the variables that we want to extract and which we want the model to follow in its completions.

In [14]:
json_schema = {
    "sample_name": {"type": str},
    "pretreatment_process_method": {"type": str},
    "pretreatment_activation_chemical_agent": {"type": str},
    "activation_temperature": {"type": int},
    "activation_temperature_units": {"type": str},
    "co2_uptake_amount": {"type": float},
    "co2_uptake_units": {"type": str},
}

In the prompt we only substitute the list with the variables with the JSON schema defined above.

In [18]:
simple_prompt = """Extract the variables detailed bellow from the provided text, and then add them to the data from previous chunks.
To answer follow the next JSON for each of the samples:

{json_schema}

Text to extract from:

{chunk}

Finally, add the new extracted data, if there are new samples, to the data from previous chunks that is the following:

{memory}

Begin extracting!
"""

In [19]:
summary = ""
for chunk in classified_chunks:
    messages = [
        {
            "role": "system",
            "content": system_prompt,
        },
        {
            "role": "user",
            "content": simple_prompt.format(
                json_schema=str(json_schema), chunk=chunk, memory=summary
            ),
        },
    ]
    response = completion(
        model=base_model,
        messages=messages,
        temperature=0,
    )

    summary = response.choices[0].message.content

In [20]:
print(summary)

There is no new sample data to extract from the provided text. The text does not provide specific information about the samples, such as sample name, pretreatment process method, pretreatment activation chemical agent, activation temperature, CO2 uptake amount, and units. It appears to be a general discussion about the results of CO2 capture performance and porous texture studies.

The data from previous chunks remains the same:

{'HKC-800-1': {'co2_uptake_amount': 217.0, 'co2_uptake_units': 'mg/g'}},
{'c-CBAP-1N': {'co2_uptake_amount': 223.5, 'co2_uptake_units': 'mg/g'}},
{'H150-800': {'co2_uptake_amount': 228.1, 'co2_uptake_units': 'mg/g'}},
{'NPC500': {'co2_uptake_amount': 235.8, 'co2_uptake_units': 'mg/g'}},
{'Bamboo-1-973': {'co2_uptake_amount': 233.2, 'co2_uptake_units': 'mg/g'}},
{'AC-K-W-2-700': {'co2_uptake_amount': 237.6, 'co2_uptake_units': 'mg/g'}},
{'NHPCT-4-7': {'co2_uptake_amount': 243.3, 'co2_uptake_units': 'mg/g'}},
{'HCP2a-K700': {'co2_uptake_amount': 251.0, 'co2_upta

The results are even worse than with the previous prompt since in this case, only one of the sample names is extracted correctly. All the other samples are taken from the literature mentioned in the paper for comparison of the results with those of previous studies. Also, only two of the variables are detailed, meaning that the model can not even correctly follow the schema provided.

### Constrained Zero-Shot prompt

To improve the results, and encourage the model to follow a schema that allows us to easily read and evaluate the results, we will slightly improve our system. 

To do this we will constrain the model to follow a `pydantic` schema using `Instructor`. In addition, this will allow us to add new data through code in an easier way that is more robust that prompting the model to do it.

First we will define the `pydantic` Base Model that we want the model to follow.

````{margin}
Note that for this constraining, we move away from using `litellm`. This is because the interaction between `litellm` and `instructor` is not always correct, returning errors for some cases.
````

In [15]:
client = instructor.patch(Groq(), mode=instructor.Mode.MD_JSON)


class Sample(BaseModel):
    name: str = Field(
        ..., description="The name or acronym of the porous carbon material"
    )
    pretreatment_process_method: Optional[str] = None
    pretreatment_activation_chemical_agent: Optional[str]
    activation_temperature: Optional[int]
    activation_temperature_units: Optional[str]
    co2_uptake_amount: Optional[float] = Field(
        ..., description="The amount of CO2 uptake at 1 bar and 273K"
    )
    co2_uptake_units: Optional[str]


class Samples(BaseModel):
    sample: List[Sample]

And we define the function to add new samples and the new data to the existing samples.

In [16]:
def add_summary_to_schema(summary: Union[Samples, str], new_info: Samples) -> Samples:
    # Initialize summary as a new Samples instance if it's an empty string
    if summary == "":
        summary = Samples(sample=[])
    # Convert summary to Samples instance if it's a string
    elif isinstance(summary, str):
        summary = Samples.parse_raw(summary)
    elif not isinstance(summary, (Samples, str)):
        raise ValueError(
            "Summary must be an instance of Samples or a JSON string representing a Samples instance."
        )

    # Iterate over each Sample in new_info
    for new_sample in new_info.sample:
        # Check if there's an existing sample with the same name
        existing_sample = None
        for sample in summary.sample:
            if sample.name == new_sample.name:
                existing_sample = sample
                break

        # If no existing sample with the same name, add the new sample to summary
        if not existing_sample:
            summary.sample.append(new_sample)
        else:
            # If there's an existing sample, update its fields with non-None values from new_sample
            for field in new_sample.model_fields.keys():
                new_value = getattr(new_sample, field)
                if new_value is not None:
                    setattr(existing_sample, field, new_value)

    return summary

Finally, it is good to define our evaluation criteria and an evaluation function that we will use to evaluate the results.

To evaluate the extraction process, in this case we are going to follow the next convention that is the same as described in the text:
- *True positive (TP)* is a value correctly extracted (exact match) for one key.
- *False positive (FP)* is a value extracted from the paper, but it does not match what we expected.
- *False negative (FN)* is a value that is in the ground truth, but that has not been extracted by the model.
- *True negative (TN)*, as pointed in the main text, is not applicable.
  
And as a remembering of the typical metrics used to evaluate the data extraction task:

$\mathrm{Precision} = \mathrm{TP} / (\mathrm{TP} + \mathrm{FP})$

$\mathrm{Recall} = \mathrm{TP} / (\mathrm{TP} + \mathrm{FN})$

$F_1 \mathrm{Score}= 2 * (\mathrm{Precision} \cdot \mathrm{Recall}) / (\mathrm{Precision} + \mathrm{Recall})$

In [17]:
def metrics(summary: Samples, ground_data: List[Dict[str, Any]]) -> Dict[str, float]:
    tp = 0
    fp = 0
    fn = 0
    # Assuming 'samples_instance.sample' is a list of Sample objects
    for sample in summary.sample:
        # Convert the sample object to a dictionary
        sample_dict = vars(sample)
        # Find the corresponding ground truth entry
        ground_truth = next(
            (item for item in ground_data if item["name"] == sample_dict["name"]), None
        )
        if ground_truth:
            # Compare values
            for key, value in ground_truth.items():
                if key in sample_dict and sample_dict[key] == value:
                    tp += 1
                # When the ground truth is equal to None, it is captured in the above statement
                elif sample_dict[key] is None:
                    fn += 1
                else:
                    fp += 1

    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f1 = 2 * (precision * recall) / (precision + recall)

    return {
        "true_positives": tp,
        "false_positives": fp,
        "false_negatives": fn,
        "precision": precision,
        "recall": recall,
        "f1": f1,
    }

Now, we define the prompt that we want to use. Note that this prompt looks even simpler than the previous one. However, this is a bit more nuanced since the `pydantic` schema is passed to the models, e.g., for OpenAI models use [function calling](https://hamel.dev/blog/posts/prompt/#instructor).

In [24]:
simple_prompt = """Extract from the provided text the variables about porous carbon materials.

Text to extract from:

{chunk}

Begin extracting!
"""

And do the completion for each of the chunks.

In [25]:
summary = ""
for chunk in classified_chunks:
    messages = [
        {
            "role": "system",
            "content": system_prompt,
        },
        {
            "role": "user",
            "content": simple_prompt.format(chunk=chunk),
        },
    ]
    response: Samples = client.chat.completions.create(
        messages=messages,
        temperature=0,
        model="llama3-70b-8192",
        max_retries=3,
        response_model=Samples,
    )
    summary = add_summary_to_schema(summary, response)

In [26]:
data = []

for sample in summary.sample:
    sample_dict = {
        "Name": sample.name,
        "Pretreatment Process Method": sample.pretreatment_process_method,
        "Pretreatment Activation Chemical Agent": sample.pretreatment_activation_chemical_agent,
        "Activation Temperature": sample.activation_temperature,
        "Activation Temperature Units": sample.activation_temperature_units,
        "CO2 Uptake Amount": sample.co2_uptake_amount,
        "CO2 Uptake Units": sample.co2_uptake_units,
    }
    data.append(sample_dict)

df = pd.DataFrame(data)

In [27]:
df

Unnamed: 0,Name,Pretreatment Process Method,Pretreatment Activation Chemical Agent,Activation Temperature,Activation Temperature Units,CO2 Uptake Amount,CO2 Uptake Units
0,DZC-600-2,,ZnCl2,600.0,K,104.7,mg/g
1,DKC-600-2,,KOH,600.0,K,88.5,mg/g
2,BKC-600-2,,KOH,600.0,K,116.0,mg/g
3,HKC-600-2,,KOH,600.0,K,161.1,mg/g
4,HKC-700-2,,,700.0,K,124.5,mg/g
5,HKC-800-2,,,800.0,K,425.3,mg/g
6,Porous Carbon Material,,,,,,
7,,,,,,,
8,HKC-600-1,,KOH,600.0,K,146.5,mg/g
9,HKC-800-1,,KOH,800.0,K,450.7,mg/g


By manually inspecting the results, we can see that the extraction is partially correct, since the samples of our interest are extracted with the corresponding variables. On the other hand, the model is not able to correctly differentiate the adsorbents that were prepared in this work from those that were taken from the literature to compare. 

But we will obtain a better measure of the performance if we correctly evaluate the results. For that, we use the function previously defined.

In [28]:
results_zero_shot = metrics(summary, ground_data)

In [29]:
print(f"True positives: {results_zero_shot['true_positives']}")
print(f"False positives: {results_zero_shot['false_positives']}")
print(f"False negatives: {results_zero_shot['false_negatives']}")
print("*" * 25)
print(f"Precision: {round(results_zero_shot['precision'], 2)}")
print(f"Recall: {round(results_zero_shot['recall'], 2)}")
print(f"F1-Score: {round(results_zero_shot['f1'], 2)}")

True positives: 34
False positives: 14
False negatives: 8
*************************
Precision: 0.71
Recall: 0.81
F1-Score: 0.76


The results are good. However, it is possible to use more advanced prompting techniques in which we give the model some more context to try to improve the results even further.

### Two-Shot prompt

These few-shot prompts take advantage of the well-known in-context learning that Large Language Models possess and provide them with additional information within the prompt. {cite}`brown2020language`

In this prompt, we will give the model two examples on similar cases.

In [30]:
two_shot_prompt = """Two examples are given to you to help you better understand the task.
Example 1:

Text to extract from: {text1}
Answer: {answer1}

Example 2:

Text to extract from: {text2}
Answer: {answer2}

Now extract from the next text the variables about porous carbon materials.

Text to extract from:

{chunk}

Begin extracting!
"""

```` {margin}
Note that the second paragraph is not exactly as it is in the article's original text. This is because it was reduced on purpose to lower the amount of tokens.
````

In [31]:
text1 = "The bamboo was first added into a tubular furnace (KSY-6-16A, Tianjin Zhonghuan Co. Ltd, China) and heated to 773 K at an increasing rate of 5 Kmin1; then the temperature was kept for 1.5 h. In the activation process, the carbonized materials were impregnated by the KOH solution at the predetermined KOH/C mass ratios, and the mixture was dried at 378 K for 12 h. The resulting dry material was placed in a tubular furnace, followed by heating to the predetermined activation temperature at a ramp of 10 Kmin-1, which was held for 1.5 h. The heating process was conducted under N2 flow protection. Finally, the activated carbon particles were washed by using aq. HCl (1 mol L-1), followed by washing with deionized water until the pH value of the wash water was less than 8.0. The bamboo-derived activated carbon is denoted as Bamboo-X-Y, where X represents the KOH/C mass ratio, and Y denotes the activation temperature in K."
answer1 = "sample=[Sample(name='Bamboo-3-873', pretreatment_process_method=None, pretreatment_activation_chemical_agent='KOH', activation_temperature=773, activation_temperature_units='K', co2_uptake_amount=7.0, co2_uptake_units='mmol g-1'), Sample(name='Bamboo-1-973', pretreatment_process_method=None, pretreatment_activation_chemical_agent='KOH', activation_temperature=773, activation_temperature_units='K', co2_uptake_amount=5.3, co2_uptake_units='mmol g-1')]"

text2 = "| Table 1. Comparison of CO2 adsorption on activated carbons prepared from different precursors reported in the literature. Precursors Activating Adsorption CO2 uptake S(CO2/N2) [b] Ref. agents temperature[a] [K] [mmol g1 ] sawdust KOH 273/298 6.1/4.8 5.4 [25] polypyrrole KOH 273 6.2 5.3 [22] polypyrrole KOH 298 4.3 15.9 [23] polyfurfuryl KOH 298 3.2 6.5 [24] Bamboo-3-873 KOH 273/298 7.0/4.5 8.6 this study Bamboo-1-973 KOH 273/298 5.3/4.0 11.1 this study [a] Pressure: 1 bar. [b] Data was measured at 298 K and 1 bar; NA=not available.   |"
answer2 = "sample=[Sample(name='Bamboo-3-873', pretreatment_process_method=None, pretreatment_activation_chemical_agent='KOH', activation_temperature=None, activation_temperature_units=None, co2_uptake_amount=7.0, co2_uptake_units='mmol g-1'), Sample(name='Bamboo-1-973', pretreatment_process_method=None, pretreatment_activation_chemical_agent='KOH', activation_temperature=None, activation_temperature_units=None, co2_uptake_amount=5.3, co2_uptake_units='mmol g-1')]"

Both text fragments are from {cite:t}`Wei2012biomass`.

Ideally, the few-shots should contain at least once each of the variables. This is not followed for the shots presented above, but we are going to give it a chance.

In [32]:
summary = ""
for chunk in classified_chunks:
    messages = [
        {
            "role": "system",
            "content": system_prompt,
        },
        {
            "role": "user",
            "content": two_shot_prompt.format(
                text1=text1, answer1=answer1, text2=text2, answer2=answer2, chunk=chunk
            ),
        },
    ]
    response: Samples = client.chat.completions.create(
        messages=messages,
        temperature=0,
        model="llama3-70b-8192",
        max_retries=3,
        response_model=Samples,
    )
    summary = add_summary_to_schema(summary, response)

In [33]:
results_two_shot = metrics(summary, ground_data)

In [34]:
print(f"True positives: {results_two_shot['true_positives']}")
print(f"False positives: {results_two_shot['false_positives']}")
print(f"False negatives: {results_two_shot['false_negatives']}")
print("*" * 25)
print(f"Precision: {round(results_two_shot['precision'], 2)}")
print(f"Recall: {round(results_two_shot['recall'], 2)}")
print(f"F1-Score: {round(results_two_shot['f1'], 2)}")

True positives: 29
False positives: 19
False negatives: 8
*************************
Precision: 0.6
Recall: 0.78
F1-Score: 0.68


We can see that the results with the two-shot prompt do not improve those obtained when using the zero-shot prompt. This can mean that the shots chosen are not meaningful enough.

### Four-Shot prompt

Since the results when using only two-shot prompting are not better than the results from zero-shot, we are going to increase the number of shots to four examples and see if we can improve the zero-shot results.

The procedure is the same as for the previous case, we take the two examples from the previous case, we add two more, and we do the completion for all the chunks.

In [35]:
four_shot_prompt = """Four examples are given to you to help you better understand the task.
Example 1:

Text to extract from: {text1}
Answer: {answer1}

Example 2:

Text to extract from: {text2}
Answer: {answer2}

Example 3:

Text to extract from: {text3}
Answer: {answer3}

Example 4:

Text to extract from: {text4}
Answer: {answer4}

Now extract from the next text the variables about porous carbon materials.

Text to extract from:

{chunk}

Begin extracting!
"""

In [36]:
text3 = "K2CO3 activation: The precursor was impregnated in K2CO3 solution with an impregnation ratio (gK2CO3/g precursor) of 1 and the mixture was kept under refluxed and boiling for 4 h. Then, the filtered material was carbonized at 900 C for 2 h under N2 (flow rate 100 ml min1; heating rate 5 C min1). The resultant AC was repeatedly washed with 0.1 M HCl and hot distilled water and then dried. The carbon sample is labeled as AC_K2CO3. The carbonization step of the two samples was carried out on a tubular quartz tube kept inside a horizontal furnace."
answer3 = "sample=[Sample(name='AC_K2CO3', pretreatment_process_method='Carbonization', pretreatment_activation_chemical_agent='K2CO3', activation_temperature=900, activation_temperature_units='C', co2_uptake_amount=None, co2_uptake_units=None)]"

text4 = "Table 4 | CO2 uptake at 1 bar and 0 C of various carbon materials in comparison with AC_KOH and AC_K2CO3. Materials Precursor Activation SBET (m2 g1 ) | CO2 uptake (mmol g1 ) | Reference | | | |\n|-------------------------|-------------------------------------------|-------------|------|------|------------|\n| Activated carbon | Empty fruit bunch (EFB) of oil palm trees | KOH | 2510 | 5.2  | [66] |\n| Activated carbon | Fungi | KOH | 1479 | 5.5 | [34] |\n| Activated carbon | Olive stones | KOH | - | 5.6 | This study |\n| Activated carbon | Olive stones | K2CO3 | - | 3.8 | This study |"
answer4 = "sample=[Sample(name='Activated carbon from olive stones 1', pretreatment_process_method=None, pretreatment_activation_chemical_agent=KOH, activation_temperature=None, activation_temperature_units=None, co2_uptake_amount=5.6, co2_uptake_units='mmol g-1'), Sample(name='Activated carbon from olive stones 2', pretreatment_process_method=None, pretreatment_activation_chemical_agent=K2CO3, activation_temperature=None, activation_temperature_units=None, co2_uptake_amount=3.8, co2_uptake_units='mmol g-1')]"

````{margin}
These new examples were taken from the article by {cite:t}`Moussa2017`. Again, the second example was reduced to avoid high number of tokens.
````

In [37]:
summary = ""
for chunk in classified_chunks:
    messages = [
        {
            "role": "system",
            "content": system_prompt,
        },
        {
            "role": "user",
            "content": four_shot_prompt.format(
                text1=text1,
                answer1=answer1,
                text2=text2,
                answer2=answer2,
                text3=text3,
                answer3=answer3,
                text4=text4,
                answer4=answer4,
                chunk=chunk,
            ),
        },
    ]
    response: Samples = client.chat.completions.create(
        messages=messages,
        temperature=0,
        model="llama3-70b-8192",
        max_retries=3,
        response_model=Samples,
    )
    summary = add_summary_to_schema(summary, response)

In [38]:
results_four_shot = metrics(summary, ground_data)

In [39]:
print(f"True positives: {results_four_shot['true_positives']}")
print(f"False positives: {results_four_shot['false_positives']}")
print(f"False negatives: {results_four_shot['false_negatives']}")
print("*" * 25)
print(f"Precision: {round(results_four_shot['precision'], 2)}")
print(f"Recall: {round(results_four_shot['recall'], 2)}")
print(f"F1-Score: {round(results_four_shot['f1'], 2)}")

True positives: 34
False positives: 14
False negatives: 8
*************************
Precision: 0.71
Recall: 0.81
F1-Score: 0.76


The results obtained using the four-shot prompt only show improvement with respect the two-shot prompt. However, the metrics are exactly the same as for the zero-shot prompting.

### Chain of Thought (CoT)

This is the first real advanced prompt technique in this book.

The CoT prompt {cite}`wei2023chainofthought` encourages the model to think the task step by step, thus, activating the reasoning capabilities of the model, often leading to better results.

The problem with CoT and similar reasoning prompts is that the model reasons through completion, i.e., by producing tokens. Therefore, it is not possible to constrain the LLM output when using these type of prompts so as not to break the reasoning.

In [19]:
cot_prompt = """Extract the variables detailed below from the provided text. Then add them to the data from previous chunks.
To answer follow the next JSON format for each of the samples:

{json_schema}

Think step by step about what variables are present in the text and what the values are by studying and reasoning about the following text:

{chunk}

Begin extracting!
"""

Since we can not constrain the CoT completion, we are going to parse the output from the CoT prompt using another LLM that is constrained. Thus, during the first completion, the model reasons to extract the data while during the second the output is constrained. This constraining allow us to use the same functions used above for joining the samples and calculating the metrics.

For the parsing, we are going to define a very simple prompt.

In [20]:
prompt = """The original text is the following one:

{original_text}

Now extract the data that is contained as a JSON object
"""

In [43]:
responses = []
for i, chunk in enumerate(classified_chunks):
    client = Groq()
    messages = [
        {
            "role": "system",
            "content": system_prompt,
        },
        {
            "role": "user",
            "content": cot_prompt.format(json_schema=json_schema, chunk=chunk),
        },
    ]
    # First completion with CoT prompt
    response1 = (
        completion(
            model=base_model,
            messages=messages,
            temperature=0,
        )
        .choices[0]
        .message.content
    )

    responses.append(response1)

    client = instructor.patch(Groq(), mode=instructor.Mode.MD_JSON)

    messages = [
        {
            "role": "system",
            "content": "You are a text extractor and parser. Your task is to take a text and extract the information you are asked for.",
        },
        {
            "role": "user",
            "content": prompt.format(original_text=response1),
        },
    ]

    # Second completion: parsing and constraining
    response: Samples = client.chat.completions.create(
        messages=messages,
        temperature=0,
        model="llama3-70b-8192",
        max_retries=3,
        response_model=Samples,
    )
    summary = add_summary_to_schema(summary, response)

To have more insights of what happen during all the process and see the reasoning of the model, we should check each completion made by the model.

In [44]:
df_cot = pd.DataFrame(
    {
        "reasoning": responses,
    }
)
pd.set_option("max_colwidth", None)
df_cot

Unnamed: 0,reasoning
0,"Based on the provided text, I can extract the following variables:\n\n{'sample_name': {'type': <class 'str'>, 'value': 'DZC-600-2'},\n 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'carbonization'},\n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'ZnCl2'},\n 'activation_temperature': {'type': <class 'int'>, 'value': None}, # No temperature value is mentioned\n 'activation_temperature_units': {'type': <class 'str'>, 'value': None}, # No temperature units are mentioned\n 'co2_uptake_amount': {'type': <class 'float'>, 'value': None}, # No CO2 uptake amount is mentioned\n 'co2_uptake_units': {'type': <class 'str'>, 'value': None}} # No CO2 uptake units are mentioned\n\nNote that I couldn't extract values for 'activation_temperature', 'activation_temperature_units', 'co2_uptake_amount', and 'co2_uptake_units' as they are not mentioned in the provided text."
1,"Here are the extracted variables in the specified JSON format:\n\n```\n{\n 'DKC-600-2': {\n 'type': <class 'str'>,\n 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'two-step route'},\n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': ''},\n 'activation_temperature': {'type': <class 'int'>, 'value': 600},\n 'activation_temperature_units': {'type': <class 'str'>, 'value': '°C'},\n 'co2_uptake_amount': {'type': <class 'float'>, 'value': None},\n 'co2_uptake_units': {'type': <class 'str'>, 'value': ''}\n },\n 'BKC-600-2': {\n 'type': <class 'str'>,\n 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'two-step route'},\n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': ''},\n 'activation_temperature': {'type': <class 'int'>, 'value': 600},\n 'activation_temperature_units': {'type': <class 'str'>, 'value': '°C'},\n 'co2_uptake_amount': {'type': <class 'float'>, 'value': None},\n 'co2_uptake_units': {'type': <class 'str'>, 'value': ''}\n },\n 'HKC-600-2': {\n 'type': <class 'str'>,\n 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'two-step route'},\n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': ''},\n 'activation_temperature': {'type': <class 'int'>, 'value': 600},\n 'activation_temperature_units': {'type': <class 'str'>, 'value': '°C'},\n 'co2_uptake_amount': {'type': <class 'float'>, 'value': None},\n 'co2_uptake_units': {'type': <class 'str'>, 'value': ''}\n },\n 'HKC-700-2': {\n 'type': <class 'str'>,\n 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'hydrothermally combined activation route'},\n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': ''},\n 'activation_temperature': {'type': <class 'int'>, 'value': 700},\n 'activation_temperature_units': {'type': <class 'str'>, 'value': '°C'},\n 'co2_uptake_amount': {'type': <class 'float'>, 'value': None},\n 'co2_uptake_units': {'type': <class 'str'>, 'value': ''}\n },\n 'HKC-800-1': {\n 'type': <class 'str'>,\n 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'hydrothermally combined activation route'},\n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': ''},\n 'activation_temperature': {'type': <class 'int'>, 'value': 800},\n 'activation_temperature_units': {'type': <class 'str'>, 'value': '°C'},\n 'co2_uptake_amount': {'type': <class 'float'>, 'value': None},\n 'co2_uptake_units': {'type': <class 'str'>, 'value': ''}\n },\n 'HKC-800-2': {\n 'type': <class 'str'>,\n 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'hydrothermally combined activation route'},\n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': ''},\n 'activation_temperature': {'type': <class 'int'>, 'value': 800},\n 'activation_temperature_units': {'type': <class 'str'>, 'value': '°C'},\n 'co2_uptake_amount': {'type': <class 'float'>, 'value': None},\n 'co2_uptake_units': {'type': <class 'str'>, 'value': ''}\n }\n}\n```\n\nNote that `co2_uptake_amount` and `co2_uptake_units` are not mentioned in the text, so their values are set to `None` and an empty string, respectively."
2,"There are no variables that match the specified format in the provided text. The text only mentions the pore size of the mesopores, which is not one of the variables we are looking for.\n\nSince there are no matching variables, the output will be an empty dictionary:\n\n{}\n\nLet me know when you're ready to move on to the next chunk of text!"
3,"From the provided text, I can extract the following variables:\n\n{'sample_name': {'type': <class 'str'>, 'value': None}, \n'pretreatment_process_method': {'type': <class 'str'>, 'value': None}, \n'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': None}, \n'activation_temperature': {'type': <class 'int'>, 'value': None}, \n'activation_temperature_units': {'type': <class 'str'>, 'value': 'K'}, \n'co2_uptake_amount': {'type': <class 'float'>, 'value': None}, \n'co2_uptake_units': {'type': <class 'str'>, 'value': None}}\n\nNote that I couldn't find values for 'sample_name', 'pretreatment_process_method', 'pretreatment_activation_chemical_agent', 'activation_temperature', and 'co2_uptake_amount' in the provided text. Also, I extracted the unit of 'activation_temperature' as 'K' (Kelvin) which is a unit of temperature.\n\nPlease let me know if I should proceed with the next chunk of text or if you need any further clarification."
4,"Here are the extracted variables in the specified JSON format:\n\n{'BKC-600-2': {'type': <class 'str'>}, \n 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'bioethanol'}, \n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'KOH'}, \n 'activation_temperature': {'type': <class 'int'>, 'value': 600}, \n 'activation_temperature_units': {'type': <class 'str'>, 'value': 'K'}, \n 'co2_uptake_amount': {'type': <class 'float'>, 'value': 116.0}, \n 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}}\n\n{'DKC-600-2': {'type': <class 'str'>}, \n 'pretreatment_process_method': {'type': <class 'str'>, 'value': ''}, \n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'KOH'}, \n 'activation_temperature': {'type': <class 'int'>, 'value': 600}, \n 'activation_temperature_units': {'type': <class 'str'>, 'value': 'K'}, \n 'co2_uptake_amount': {'type': <class 'float'>, 'value': 88.5}, \n 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}}\n\n{'HKC-600-2': {'type': <class 'str'>}, \n 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'hydrothermal'}, \n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'KOH'}, \n 'activation_temperature': {'type': <class 'int'>, 'value': 600}, \n 'activation_temperature_units': {'type': <class 'str'>, 'value': 'K'}, \n 'co2_uptake_amount': {'type': <class 'float'>, 'value': 54.6}, \n 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}}\n\n{'HKC-600-2': {'type': <class 'str'>}, \n 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'hydrothermal'}, \n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'KOH'}, \n 'activation_temperature': {'type': <class 'int'>, 'value': 600}, \n 'activation_temperature_units': {'type': <class 'str'>, 'value': 'K'}, \n 'co2_uptake_amount': {'type': <class 'float'>, 'value': 161.1}, \n 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}}"
5,"Here are the extracted variables in the specified JSON format:\n\n{'HKC-600-2': \n {'type': <class 'str'>, \n 'pretreatment_process_method': {'type': <class 'str'>, 'value': None}, \n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': None}, \n 'activation_temperature': {'type': <class 'int'>, 'value': 600}, \n 'activation_temperature_units': {'type': <class 'str'>, 'value': '°C'}, \n 'co2_uptake_amount': {'type': <class 'float'>, 'value': 161.1}, \n 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'} \n}, \n\n'HKC-700-2': \n {'type': <class 'str'>, \n 'pretreatment_process_method': {'type': <class 'str'>, 'value': None}, \n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': None}, \n 'activation_temperature': {'type': <class 'int'>, 'value': 700}, \n 'activation_temperature_units': {'type': <class 'str'>, 'value': '°C'}, \n 'co2_uptake_amount': {'type': <class 'float'>, 'value': 124.5}, \n 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'} \n}, \n\n'HKC-800-2': \n {'type': <class 'str'>, \n 'pretreatment_process_method': {'type': <class 'str'>, 'value': None}, \n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': None}, \n 'activation_temperature': {'type': <class 'int'>, 'value': 800}, \n 'activation_temperature_units': {'type': <class 'str'>, 'value': '°C'}, \n 'co2_uptake_amount': {'type': <class 'float'>, 'value': 151.6}, \n 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'} \n}, \n\n'HKC-600-1': \n {'type': <class 'str'>, \n 'pretreatment_process_method': {'type': <class 'str'>, 'value': None}, \n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': None}, \n 'activation_temperature': {'type': <class 'int'>, 'value': 600}, \n 'activation_temperature_units': {'type': <class 'str'>, 'value': '°C'}, \n 'co2_uptake_amount': {'type': <class 'float'>, 'value': 146.5}, \n 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'} \n}, \n\n'HKC-800-1': \n {'type': <class 'str'>, \n 'pretreatment_process_method': {'type': <class 'str'>, 'value': None}, \n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': None}, \n 'activation_temperature': {'type': <class 'int'>, 'value': 800}, \n 'activation_temperature_units': {'type': <class 'str'>, 'value': '°C'}, \n 'co2_uptake_amount': {'type': <class 'float'>, 'value': 217}, \n 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'} \n}}"
6,"Based on the provided text, I will extract the variables as follows:\n\n**Sample 1**\n{'sample_name': {'type': <class 'str'>, 'value': 'ZnCl2-activated porous carbons'},\n'pretreatment_process_method': {'type': <class 'str'>, 'value': 'Not mentioned'},\n'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'ZnCl2'},\n'activation_temperature': {'type': <class 'int'>, 'value': 273},\n'activation_temperature_units': {'type': <class 'str'>, 'value': 'K'},\n'co2_uptake_amount': {'type': <class 'float'>, 'value': 90.3},\n'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}\n\n**Sample 2**\n{'sample_name': {'type': <class 'str'>, 'value': 'ZnCl2-activated porous carbons'},\n'pretreatment_process_method': {'type': <class 'str'>, 'value': 'Not mentioned'},\n'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'ZnCl2'},\n'activation_temperature': {'type': <class 'int'>, 'value': 273},\n'activation_temperature_units': {'type': <class 'str'>, 'value': 'K'},\n'co2_uptake_amount': {'type': <class 'float'>, 'value': 120.2},\n'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}\n\n**Sample 3**\n{'sample_name': {'type': <class 'str'>, 'value': 'ZnCl2-activated porous carbons'},\n'pretreatment_process_method': {'type': <class 'str'>, 'value': 'Not mentioned'},\n'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'ZnCl2'},\n'activation_temperature': {'type': <class 'int'>, 'value': 273},\n'activation_temperature_units': {'type': <class 'str'>, 'value': 'K'},\n'co2_uptake_amount': {'type': <class 'float'>, 'value': 113.8},\n'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}\n\n**Sample 4**\n{'sample_name': {'type': <class 'str'>, 'value': 'porous carbons'},\n'pretreatment_process_method': {'type': <class 'str'>, 'value': 'Not mentioned'},\n'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'Not mentioned'},\n'activation_temperature': {'type': <class 'int'>, 'value': 298},\n'activation_temperature_units': {'type': <class 'str'>, 'value': 'K'},\n'co2_uptake_amount': {'type': <class 'float'>, 'value': 48.6},\n'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}\n\n**Sample 5**\n{'sample_name': {'type': <class 'str'>, 'value': 'porous carbons'},\n'pretreatment_process_method': {'type': <class 'str'>, 'value': 'Not mentioned'},\n'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'Not mentioned'},\n'activation_temperature': {'type': <class 'int'>, 'value': 298},\n'activation_temperature_units': {'type': <class 'str'>, 'value': 'K'},\n'co2_uptake_amount': {'type': <class 'float'>, 'value': 126.1},\n'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}\n\nPlease let me know if this extraction is correct or if I need to make any changes."
7,"From the provided text, I can extract the following variables:\n\n{'HKC-800-1': \n {'type': <class 'str'>, \n 'pretreatment_process_method': {'type': None}, \n 'pretreatment_activation_chemical_agent': {'type': None}, \n 'activation_temperature': {'type': 800, 'activation_temperature_units': {'type': '°C'}}, \n 'co2_uptake_amount': {'type': 217.0}, \n 'co2_uptake_units': {'type': 'mg/g'}}\n}\n\nNote that I've assumed the unit of activation temperature to be °C based on the notation ""HKC-800-1"", which is a common notation in scientific literature. Also, I've assumed the unit of CO2 uptake to be mg/g based on the context of the sentence."
8,"Here are the extracted variables in the specified JSON format:\n\n```\n{\n 'DZC-600-2': {\n 'type': <class 'str'>,\n 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'NA'},\n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'NA'},\n 'activation_temperature': {'type': <class 'int'>, 'value': 600},\n 'activation_temperature_units': {'type': <class 'str'>, 'value': 'K'},\n 'co2_uptake_amount': {'type': <class 'float'>, 'value': 104.7},\n 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}\n },\n 'BZC-600-2': {\n 'type': <class 'str'>,\n 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'NA'},\n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'NA'},\n 'activation_temperature': {'type': <class 'int'>, 'value': 600},\n 'activation_temperature_units': {'type': <class 'str'>, 'value': 'K'},\n 'co2_uptake_amount': {'type': <class 'float'>, 'value': 80.0},\n 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}\n },\n 'HZC-600-2': {\n 'type': <class 'str'>,\n 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'NA'},\n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'NA'},\n 'activation_temperature': {'type': <class 'int'>, 'value': 600},\n 'activation_temperature_units': {'type': <class 'str'>, 'value': 'K'},\n 'co2_uptake_amount': {'type': <class 'float'>, 'value': 90.3},\n 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}\n },\n 'DKC-600-2': {\n 'type': <class 'str'>,\n 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'NA'},\n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'NA'},\n 'activation_temperature': {'type': <class 'int'>, 'value': 600},\n 'activation_temperature_units': {'type': <class 'str'>, 'value': 'K'},\n 'co2_uptake_amount': {'type': <class 'float'>, 'value': 88.5},\n 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}\n },\n 'BKC-600-2': {\n 'type': <class 'str'>,\n 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'NA'},\n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'NA'},\n 'activation_temperature': {'type': <class 'int'>, 'value': 600},\n 'activation_temperature_units': {'type': <class 'str'>, 'value': 'K'},\n 'co2_uptake_amount': {'type': <class 'float'>, 'value': 116.0},\n 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}\n },\n 'HKC-600-2': {\n 'type': <class 'str'>,\n 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'NA'},\n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'NA'},\n 'activation_temperature': {'type': <class 'int'>, 'value': 600},\n 'activation_temperature_units': {'type': <class 'str'>, 'value': 'K'},\n 'co2_uptake_amount': {'type': <class 'float'>, 'value': 161.1},\n 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}\n },\n 'HKC-700-2': {\n 'type': <class 'str'>,\n 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'NA'},\n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'NA'},\n 'activation_temperature': {'type': <class 'int'>, 'value': 700},\n 'activation_temperature_units': {'type': <class 'str'>, 'value': 'K'},\n 'co2_uptake_amount': {'type': <class 'float'>, 'value': 124.5},\n 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}\n },\n 'HKC-800-2': {\n 'type': <class 'str'>,\n 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'NA'},\n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'NA'},\n 'activation_temperature': {'type': <class 'int'>, 'value': 800},\n 'activation_temperature_units': {'type': <class 'str'>, 'value': 'K'},\n 'co2_uptake_amount': {'type': <class 'float'>, 'value': 151.6},\n 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}\n },\n 'HKC-600-1': {\n 'type': <class 'str'>,\n 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'NA'},\n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'NA'},\n 'activation_temperature': {'type': <class 'int'>, 'value': 600},\n 'activation_temperature_units': {'type': <class 'str'>, 'value': 'K'},\n 'co2_uptake_amount': {'type': <class 'float'>, 'value': 146.5},\n 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}\n },\n 'HKC-800-1': {\n 'type': <class 'str'>,\n 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'NA'},\n 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'NA'},\n 'activation_temperature': {'type': <class 'int'>, 'value': 800},\n 'activation_temperature_units': {'type': <class 'str'>, 'value': 'K'},\n 'co2_uptake_amount': {'type': <class 'float'>, 'value': 217.0},\n 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}\n }\n}\n```\n\nNote that the `pretreatment_process_method` and `pretreatment_activation_chemical_agent` variables have 'NA' values, as they are not explicitly mentioned in the text."
9,"Here are the extracted variables in the specified JSON format:\n\n{'HKC-800-1': {'type': <class 'str'>}, 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'NA'}, 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'NA'}, 'activation_temperature': {'type': <class 'int'>, 'value': 800}, 'activation_temperature_units': {'type': <class 'str'>, 'value': 'K'}, 'co2_uptake_amount': {'type': <class 'float'>, 'value': 217.0}, 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}}\n\n{'c-CBAP-1N': {'type': <class 'str'>}, 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'NA'}, 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'NA'}, 'activation_temperature': {'type': <class 'int'>, 'value': 'NA'}, 'activation_temperature_units': {'type': <class 'str'>, 'value': 'NA'}, 'co2_uptake_amount': {'type': <class 'float'>, 'value': 223.5}, 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}}\n\n{'H150-800': {'type': <class 'str'>}, 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'NA'}, 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'NA'}, 'activation_temperature': {'type': <class 'int'>, 'value': 800}, 'activation_temperature_units': {'type': <class 'str'>, 'value': 'NA'}, 'co2_uptake_amount': {'type': <class 'float'>, 'value': 228.1}, 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}}\n\n{'NPC500': {'type': <class 'str'>}, 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'NA'}, 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'NA'}, 'activation_temperature': {'type': <class 'int'>, 'value': 500}, 'activation_temperature_units': {'type': <class 'str'>, 'value': 'NA'}, 'co2_uptake_amount': {'type': <class 'float'>, 'value': 235.8}, 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}}\n\n{'Bamboo-1-973': {'type': <class 'str'>}, 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'NA'}, 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'NA'}, 'activation_temperature': {'type': <class 'int'>, 'value': 973}, 'activation_temperature_units': {'type': <class 'str'>, 'value': 'NA'}, 'co2_uptake_amount': {'type': <class 'float'>, 'value': 233.2}, 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}}\n\n{'AC-K-W-2-700': {'type': <class 'str'>}, 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'NA'}, 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'NA'}, 'activation_temperature': {'type': <class 'int'>, 'value': 700}, 'activation_temperature_units': {'type': <class 'str'>, 'value': 'NA'}, 'co2_uptake_amount': {'type': <class 'float'>, 'value': 237.6}, 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}}\n\n{'NHPCT-4-7': {'type': <class 'str'>}, 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'NA'}, 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'NA'}, 'activation_temperature': {'type': <class 'int'>, 'value': 'NA'}, 'activation_temperature_units': {'type': <class 'str'>, 'value': 'NA'}, 'co2_uptake_amount': {'type': <class 'float'>, 'value': 243.3}, 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}}\n\n{'HCP2a-K700': {'type': <class 'str'>}, 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'NA'}, 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'NA'}, 'activation_temperature': {'type': <class 'int'>, 'value': 700}, 'activation_temperature_units': {'type': <class 'str'>, 'value': 'NA'}, 'co2_uptake_amount': {'type': <class 'float'>, 'value': 251.0}, 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}}\n\n{'ACDS-800-2': {'type': <class 'str'>}, 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'NA'}, 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'NA'}, 'activation_temperature': {'type': <class 'int'>, 'value': 800}, 'activation_temperature_units': {'type': <class 'str'>, 'value': 'NA'}, 'co2_uptake_amount': {'type': <class 'float'>, 'value': 264.0}, 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}}\n\n{'CMS-K3': {'type': <class 'str'>}, 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'NA'}, 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'NA'}, 'activation_temperature': {'type': <class 'int'>, 'value': 'NA'}, 'activation_temperature_units': {'type': <class 'str'>, 'value': 'NA'}, 'co2_uptake_amount': {'type': <class 'float'>, 'value': 286.4}, 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}}\n\n{'CSC-650': {'type': <class 'str'>}, 'pretreatment_process_method': {'type': <class 'str'>, 'value': 'NA'}, 'pretreatment_activation_chemical_agent': {'type': <class 'str'>, 'value': 'NA'}, 'activation_temperature': {'type': <class 'int'>, 'value': 650}, 'activation_temperature_units': {'type': <class 'str'>, 'value': 'NA'}, 'co2_uptake_amount': {'type': <class 'float'>, 'value': 295.7}, 'co2_uptake_units': {'type': <class 'str'>, 'value': 'mg/g'}}"


By analyzing the completions, we see that they are not reasoning at all. Normally, to correct this, the CoT prompt is build in a few-shot configuration to encourage the model to reason.

Studying the final results and metrics will allow us to draw further conclusions.

In [45]:
print(summary)

sample=[Sample(name='DZC-600-2', pretreatment_process_method='ZnCl2-activated', pretreatment_activation_chemical_agent='ZnCl2', activation_temperature=600, activation_temperature_units='K', co2_uptake_amount=104.7, co2_uptake_units='mg/g'), Sample(name='DKC-600-2', pretreatment_process_method='KOH-activated', pretreatment_activation_chemical_agent='KOH', activation_temperature=600, activation_temperature_units='K', co2_uptake_amount=88.5, co2_uptake_units='mg/g'), Sample(name='BKC-600-2', pretreatment_process_method='KOH-activated', pretreatment_activation_chemical_agent='KOH', activation_temperature=600, activation_temperature_units='K', co2_uptake_amount=116.0, co2_uptake_units='mg/g'), Sample(name='HKC-600-2', pretreatment_process_method='KOH-activated', pretreatment_activation_chemical_agent='KOH', activation_temperature=600, activation_temperature_units='K', co2_uptake_amount=161.1, co2_uptake_units='cm3/g'), Sample(name='HKC-700-2', pretreatment_process_method='hydrothermally com

And finally compute proper metrics.

In [46]:
results_cot = metrics(summary, ground_data)

In [47]:
print(f"True positives: {results_cot['true_positives']}")
print(f"False positives: {results_cot['false_positives']}")
print(f"False negatives: {results_cot['false_negatives']}")
print("*" * 25)
print(f"Precision: {round(results_cot['precision'], 2)}")
print(f"Recall: {round(results_cot['recall'], 2)}")
print(f"F1-Score: {round(results_cot['f1'], 2)}")

True positives: 30
False positives: 26
False negatives: 0
*************************
Precision: 0.54
Recall: 1.0
F1-Score: 0.7


The results are similar to the two-shot prompt, and worse than for the zero- and four-shot.

There are several potential reasons why this happened. For instance, as pointed out above, the model did not reason at all. One way of solving this could be preparing the CoT prompt as a few-shot prompt showing the model how we expect it to reason about the different problems.


### Chain of Thought + self-consistency

Finally, to try to improve the previous results, we are going to sample different outputs from the CoT prompt which is known as self-consistency. Self-consistency {cite}`wang2023selfconsistency` involves sampling different answers of different calls to the model using the same prompt, to then give all these answers to the model and let it decide which information is correct and provide a unique and improved final answer.

````{margin}
We set a different temperature than 0 to give the model room to vary between the different completions. Self-consistency at temperature equal to 0 makes no sense.
````

In [23]:
responses = {}
for j in range(3):
    summary = ""
    for i, chunk in enumerate(classified_chunks):
        client = Groq()

        messages = [
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": cot_prompt.format(json_schema=json_schema, chunk=chunk),
            },
        ]
        response1 = (
            completion(
                model=base_model,
                messages=messages,
                temperature=0.25,
            )
            .choices[0]
            .message.content
        )

        client = instructor.patch(Groq(), mode=instructor.Mode.MD_JSON)

        messages = [
            {
                "role": "system",
                "content": "You are a text extractor and parser. Your task is to take a text and extract the information you are asked for.",
            },
            {
                "role": "user",
                "content": prompt.format(original_text=response1),
            },
        ]
        response: Samples = client.chat.completions.create(
            messages=messages,
            temperature=0,
            model="llama3-70b-8192",
            max_retries=3,
            response_model=Samples,
        )
        summary = add_summary_to_schema(summary, response)
    responses[f"response_{j+1}"] = summary

Then, to evaluate the three completions and only take the valuable information, we use another LLM prompted for that specific task.

In [24]:
decision_prompt_system = """ You are a scientific assistant.
Your task is to take three extraction results from different agents and provide a final answer with the information that you consider correct"""

In [25]:
decision_prompt_user = """ First response:

{response_1}

Second response:

{response_2}

Third response:

{response_3}

Now provide your final response by analysing the three responses.
"""

In [26]:
print(responses)

{'response_1': Samples(sample=[Sample(name='DZC-600-2', pretreatment_process_method='ZnCl2-activated', pretreatment_activation_chemical_agent='ZnCl2', activation_temperature=600, activation_temperature_units='K', co2_uptake_amount=104.7, co2_uptake_units='mg/g'), Sample(name='DKC-600-2', pretreatment_process_method='KOH-activated', pretreatment_activation_chemical_agent='KOH', activation_temperature=600, activation_temperature_units='K', co2_uptake_amount=88.5, co2_uptake_units='mg/g'), Sample(name='BKC-600-2', pretreatment_process_method='KOH-activated', pretreatment_activation_chemical_agent='KOH', activation_temperature=600, activation_temperature_units='K', co2_uptake_amount=116.0, co2_uptake_units='mg/g'), Sample(name='HKC-600-2', pretreatment_process_method='KOH-activated', pretreatment_activation_chemical_agent='KOH', activation_temperature=600, activation_temperature_units='K', co2_uptake_amount=161.1, co2_uptake_units='mg/g'), Sample(name='HKC-700-2', pretreatment_process_meth

In [28]:
client = instructor.patch(Groq(), mode=instructor.Mode.MD_JSON)

messages = [
    {
        "role": "system",
        "content": decision_prompt_system,
    },
    {
        "role": "user",
        "content": decision_prompt_user.format(
            response_1=str(responses["response_1"]),
            response_2=str(responses["response_2"]),
            response_3=str(responses["response_3"]),
        ),
    },
]
response: Samples = client.chat.completions.create(
    messages=messages,
    temperature=0,
    model="llama3-70b-8192",
    max_retries=3,
    response_model=Samples,
)

In [29]:
print(response)

sample=[Sample(name='DZC-600-2', pretreatment_process_method='ZnCl2-activated', pretreatment_activation_chemical_agent='ZnCl2', activation_temperature=600, activation_temperature_units='K', co2_uptake_amount=104.7, co2_uptake_units='mg/g'), Sample(name='DKC-600-2', pretreatment_process_method='KOH-activated', pretreatment_activation_chemical_agent='KOH', activation_temperature=600, activation_temperature_units='K', co2_uptake_amount=88.5, co2_uptake_units='mg/g'), Sample(name='BKC-600-2', pretreatment_process_method='KOH-activated', pretreatment_activation_chemical_agent='KOH', activation_temperature=600, activation_temperature_units='K', co2_uptake_amount=116.0, co2_uptake_units='mg/g'), Sample(name='HKC-600-2', pretreatment_process_method='KOH-activated', pretreatment_activation_chemical_agent='KOH', activation_temperature=600, activation_temperature_units='K', co2_uptake_amount=161.1, co2_uptake_units='mg/g'), Sample(name='HKC-700-2', pretreatment_process_method='hydrothermally comb

In [30]:
results_self_consistency = metrics(response, ground_data)

In [31]:
print(f"True positives: {results_self_consistency['true_positives']}")
print(f"False positives: {results_self_consistency['false_positives']}")
print(f"False negatives: {results_self_consistency['false_negatives']}")
print("*" * 25)
print(f"Precision: {round(results_self_consistency['precision'], 2)}")
print(f"Recall: {round(results_self_consistency['recall'], 2)}")
print(f"F1-Score: {round(results_self_consistency['f1'], 2)}")

True positives: 33
False positives: 21
False negatives: 2
*************************
Precision: 0.61
Recall: 0.94
F1-Score: 0.74


The results are better than the simple CoT prompt, and very close to the four-shot prompt. One way of trying to improve these results would be to try different temperatures and see which value works better.

## Final conclusions

In this notebook, we tested several prompting techniques, from less to more complex techniques.

The results might not seem very logical, since the most advanced and complex technique is not the one that returns the best results. However, as pointed out in the main text and by {cite:t}`stechly2024chainthoughtlessnessanalysiscot` and  {cite:t}`ridnik2024codegenerationalphacodiumprompt`, the results are not always as expected when using these prompting techniques. Nevertheless, it is worth testing them because they are very easy to apply.

## References



```{bibliography}
:filter: docname in docnames
```