Data annotation

1.3. Data annotation#

For structured data extraction, it is indeed essential to define a structure into which the data is supposed to be stored. This process of defining and applying structure to data is often referred to as data annotation. The choice of data structure and format is crucial and can significantly impact the efficiency and effectiveness of your data extraction process.

1.3.1. Considerations for choosing a data structure#

While there’s no one-size-fits-all answer, here are some key points to consider when deciding on a data structure:

Nested vs. flat: Certain formats like YAML or JSON are better at expressing dependencies and hierarchies compared to flat formats like CSV.
Verbosity and human readability: Some formats (e.g., XML) can be more verbose, which may affect readability.
Type annotations and documentation: It’s best practice to include as much metadata as possible, including data types and descriptions.
Machine readability: Consider how easily the format can be parsed by various programming languages and tools.
Extensibility: Choose a format that allows for easy addition of new fields or structures as your data needs evolve.
Standardization: Consider if there are any industry-standard formats for your specific domain.

Old formats regain attention

I'm switching from team JSON to team XML for LLM prompts. Escaping JSON and writing JSON is much more limiting. You can just yeet XML tags wherever and not have to worry about escaping/format/validity.
— Andrew White 🐦‍⬛/acc (@andrewwhite01) May 2, 2024

Note also that models have different preferences for formats. For example, Claude was reported to perform very well with XML, whereas GPT seems to prefer JSON.

Examples of various serialization formats

For modeling reactions, we can serialize data in a simple data model in different formats. 

Description of our data model:

- A reaction has a name, reactants, products, and conditions.
- Each reactant and product has a chemical formula and amount.
- Conditions include temperature and pressure.

#### CSV example

```
reaction_name,reactant_formula,reactant_amount,product_formula,product_amount,temperature,pressure
Combustion of Methane,CH4,1 mol,CO2,1 mol,298 K,1 atm
Combustion of Methane,O2,2 mol,H2O,2 mol,298 K,1 atm
Photosynthesis,6 CO2,6 mol,C6H12O6,1 mol,300 K,1 atm
Photosynthesis,6 H2O,6 mol,O2,6 mol,300 K,1 atm
```

#### YAML Example

```
reactions:
  - name: Combustion of Methane
    reactants:
      - formula: CH4
        amount: 1 mol
      - formula: O2
        amount: 2 mol
    products:
      - formula: CO2
        amount: 1 mol
      - formula: H2O
        amount: 2 mol
    conditions:
      temperature: 298 K
      pressure: 1 atm
  - name: Photosynthesis
    reactants:
      - formula: 6 CO2
        amount: 6 mol
      - formula: 6 H2O
        amount: 6 mol
    products:
      - formula: C6H12O6
        amount: 1 mol
      - formula: O2
        amount: 6 mol
    conditions:
      temperature: 300 K
      pressure: 1 atm
```

#### JSON Example

```
{
  "reactions": [
    {
      "name": "Combustion of Methane",
      "reactants": [
        { "formula": "CH4", "amount": "1 mol" },
        { "formula": "O2", "amount": "2 mol" }
      ],
      "products": [
        { "formula": "CO2", "amount": "1 mol" },
        { "formula": "H2O", "amount": "2 mol" }
      ],
      "conditions": {
        "temperature": "298 K",
        "pressure": "1 atm"
      }
    },
    {
      "name": "Photosynthesis",
      "reactants": [
        { "formula": "6 CO2", "amount": "6 mol" },
        { "formula": "6 H2O", "amount": "6 mol" }
      ],
      "products": [
        { "formula": "C6H12O6", "amount": "1 mol" },
        { "formula": "O2", "amount": "6 mol" }
      ],
      "conditions": {
        "temperature": "300 K",
        "pressure": "1 atm"
      }
    }
  ]
}
```

#### XML example 

```
<reactions>
  <reaction>
    <name>Combustion of Methane</name>
    <reactants>
      <reactant>
        <formula>CH4</formula>
        <amount>1 mol</amount>
      </reactant>
      <reactant>
        <formula>O2</formula>
        <amount>2 mol</amount>
      </reactant>
    </reactants>
    <products>
      <product>
        <formula>CO2</formula>
        <amount>1 mol</amount>
      </product>
      <product>
        <formula>H2O</formula>
        <amount>2 mol</amount>
      </product>
    </products>
    <conditions>
      <temperature>298 K</temperature>
      <pressure>1 atm</pressure>
    </conditions>
  </reaction>
  <reaction>
    <name>Photosynthesis</name>
    <reactants>
      <reactant>
        <formula>6 CO2</formula>
        <amount>6 mol</amount>
      </reactant>
      <reactant>
        <formula>6 H2O</formula>
        <amount>6 mol</amount>
      </reactant>
    </reactants>
    <products>
      <product>
        <formula>C6H12O6</formula>
        <amount>1 mol</amount>
      </product>
      <product>
        <formula>O2</formula>
        <amount>6 mol</amount>
      </product>
    </products>
    <conditions>
      <temperature>300 K</temperature>
      <pressure>1 atm</pressure>
    </conditions>
  </reaction>
</reactions>
```

1.3.2. Additional aspects to consider#

1.3.2.1. Schema definition languages#

When working with complex data structures, it can be beneficial to use schema definition languages. These allow you to formally define your data structure, which can then be used for validation, documentation, and even code generation. Some examples include:

JSON Schema for JSON.
XSD (XML Schema Definition) for XML.
LinkML (Linked data Modeling Language) for various formats.

LinkML

LinkML (Linked data Modeling Language) is a framework for modeling and working with structured data. It can be useful for structured data extraction using LLMs in several ways:

Schema definition: LinkML allows you to define schemas for your data models, specifying classes, attributes, and relationships. This provides a structured foundation for extracting information.
Interoperability: It promotes interoperability between different data formats and systems, making it easier to integrate extracted data into various workflows.
Validation: LinkML schemas can be used to validate extracted data, ensuring it conforms to the defined structure and constraints.
Code generation: It can generate code in various languages (e.g., Python, Java) based on your schema, facilitating data manipulation and processing.
Semantic web integration: LinkML is compatible with semantic web technologies, allowing extracted data to be easily linked and integrated with existing knowledge graphs.

1.3.2.2. Semantic annotations#

Beyond just structuring data, consider adding semantic annotations. This involves linking your data to standardized vocabularies or ontologies, which can greatly enhance the interoperability and machine-readability of your data. For example:

Using standard chemical identifiers (like InChI or SMILES) in your chemical formulas.
Linking temperature and pressure units to standard unit ontologies (QUDT ontology) or materials science ontologies (EMMO, MatOnto).

1.3.2.3. Handling missing or uncertain data#

In real-world data extraction scenarios, you often encounter missing or uncertain data. Your annotation scheme should have a way to represent:

Missing data (e.g., null values, specific placeholders).
Uncertain data (e.g., confidence scores, possible value ranges).

1.3.2.3.1. Versioning#

If your data model is likely to evolve over time, consider how you’ll handle versioning. This might involve:

Including version numbers in your schema.
Maintaining backwards compatibility.
Documenting changes between versions.

Data model, schema format

Data Model: Represents the structure and organization of data, defining how data is stored, organized, and manipulated. In this case, it includes reactions, reactants, products, and conditions.

Schema: Defines the structure of data within a particular format, specifying the data types, constraints, and relationships. For example, an XML Schema (XSD) or JSON Schema can be used to validate the structure of XML or JSON data, respectively.

Format: Refers to the way data is encoded and represented for storage or transmission. CSV, YAML, JSON, and XML are different formats that encode data in different way.

In practice, it is often most convenient to define a data schema in code. This has multiple advantages:

the data schema can be tracked with all other code using version control,
there are existing routines for export in various formats,
data can be conveniently accessed in code, e.g., via class attributes.

pydantic is a library that makes it easy to define and validate data in Python. It can also parse data from various formats and serialize it to various formats.

from pydantic import BaseModel, Field
from typing import List

class Reaction(BaseModel):
    reaction_name: str = Field(..., description="The name of the reaction")
    reactants: List[str] = Field(..., description="The reactants of the reaction")
    catalyst: List[str] = Field(..., description="The catalysts of the reaction")
    base: str = Field(..., description="The base of the reaction")
    solvent: str = Field(..., description="The solvent of the reaction")
    temperature: int = Field(..., description="The temperature of the reaction")
    temperature_unit: str = Field(..., description="The unit of the temperature")
    product: str = Field(..., description="The product of the reaction")
    rxn_yield: float = Field(..., description="The yield of the reaction")

We can now use the Reaction class to create an instance of the Reaction using a dict.

rxn_dict = {
    "reaction_name": "Buchwald-Hartwig reaction",
    "reactants": ["5-Bromo-m-xylene", "Benzylmethylamine"],
    "catalyst": ["Bis(dibenzylideneacetone)palladium(0)", "Tri(o-tolyl)phosphine"],
    "base": "Sodium tert-butoxide",
    "solvent": "Toluene",
    "temperature": 65,
    "temperature_unit": "°C",
    "product": "N-Benzyl-N-methyl(3,5-xylyl)amine",
    "rxn_yield": 88,
}

rxn = Reaction(**rxn_dict)

We can also save the data as a JSON file.

rxn.model_dump_json()

'{"reaction_name":"Buchwald-Hartwig reaction","reactants":["5-Bromo-m-xylene","Benzylmethylamine"],"catalyst":["Bis(dibenzylideneacetone)palladium(0)","Tri(o-tolyl)phosphine"],"base":"Sodium tert-butoxide","solvent":"Toluene","temperature":65,"temperature_unit":"°C","product":"N-Benzyl-N-methyl(3,5-xylyl)amine","rxn_yield":88.0}'

Annotating data

The creation of a test and validation set is crucial and can be time-consuming. Therefore, using an automation tool is recommended. For doing so, TeamTat can be a convenient choice, but there are also other tools available such as Label Studio, Docano or Prodigy.

Labeling itself can be very difficult and it can be worth to consider guidelines used by other projects.