1.3. Data annotation#

For structured data extraction, it is indeed essential to define a structure into which the data is supposed to be stored. This process of defining and applying structure to data is often referred to as data annotation. The choice of data structure and format is crucial and can significantly impact the efficiency and effectiveness of your data extraction process.

1.3.1. Considerations for choosing a data structure#

While there’s no one-size-fits-all answer, here are some key points to consider when deciding on a data structure:

  • Nested vs. flat: Certain formats like YAML or JSON are better at expressing dependencies and hierarchies compared to flat formats like CSV.

  • Verbosity and human readability: Some formats (e.g., XML) can be more verbose, which may affect readability.

  • Type annotations and documentation: It’s best practice to include as much metadata as possible, including data types and descriptions.

  • Machine readability: Consider how easily the format can be parsed by various programming languages and tools.

  • Extensibility: Choose a format that allows for easy addition of new fields or structures as your data needs evolve.

  • Standardization: Consider if there are any industry-standard formats for your specific domain.

Old formats regain attention

Note also that models have different preferences for formats. For example, Claude was reported to perform very well with XML, whereas GPT seems to prefer JSON.

1.3.2. Additional aspects to consider#

1.3.2.1. Schema definition languages#

When working with complex data structures, it can be beneficial to use schema definition languages. These allow you to formally define your data structure, which can then be used for validation, documentation, and even code generation. Some examples include:

LinkML

LinkML (Linked data Modeling Language) is a framework for modeling and working with structured data. It can be useful for structured data extraction using LLMs in several ways:

  • Schema definition: LinkML allows you to define schemas for your data models, specifying classes, attributes, and relationships. This provides a structured foundation for extracting information.

  • Interoperability: It promotes interoperability between different data formats and systems, making it easier to integrate extracted data into various workflows.

  • Validation: LinkML schemas can be used to validate extracted data, ensuring it conforms to the defined structure and constraints.

  • Code generation: It can generate code in various languages (e.g., Python, Java) based on your schema, facilitating data manipulation and processing.

  • Semantic web integration: LinkML is compatible with semantic web technologies, allowing extracted data to be easily linked and integrated with existing knowledge graphs.

1.3.2.2. Semantic annotations#

Beyond just structuring data, consider adding semantic annotations. This involves linking your data to standardized vocabularies or ontologies, which can greatly enhance the interoperability and machine-readability of your data. For example:

  • Using standard chemical identifiers (like InChI or SMILES) in your chemical formulas.

  • Linking temperature and pressure units to standard unit ontologies (QUDT ontology) or materials science ontologies (EMMO, MatOnto).

1.3.2.3. Handling missing or uncertain data#

In real-world data extraction scenarios, you often encounter missing or uncertain data. Your annotation scheme should have a way to represent:

  • Missing data (e.g., null values, specific placeholders).

  • Uncertain data (e.g., confidence scores, possible value ranges).

1.3.2.3.1. Versioning#

If your data model is likely to evolve over time, consider how you’ll handle versioning. This might involve:

  • Including version numbers in your schema.

  • Maintaining backwards compatibility.

  • Documenting changes between versions.

Data model, schema format

Data Model: Represents the structure and organization of data, defining how data is stored, organized, and manipulated. In this case, it includes reactions, reactants, products, and conditions.

Schema: Defines the structure of data within a particular format, specifying the data types, constraints, and relationships. For example, an XML Schema (XSD) or JSON Schema can be used to validate the structure of XML or JSON data, respectively.

Format: Refers to the way data is encoded and represented for storage or transmission. CSV, YAML, JSON, and XML are different formats that encode data in different way.

In practice, it is often most convenient to define a data schema in code. This has multiple advantages:

  • the data schema can be tracked with all other code using version control,

  • there are existing routines for export in various formats,

  • data can be conveniently accessed in code, e.g., via class attributes.

pydantic is a library that makes it easy to define and validate data in Python. It can also parse data from various formats and serialize it to various formats.

from pydantic import BaseModel, Field
from typing import List
class Reaction(BaseModel):
    reaction_name: str = Field(..., description="The name of the reaction")
    reactants: List[str] = Field(..., description="The reactants of the reaction")
    catalyst: List[str] = Field(..., description="The catalysts of the reaction")
    base: str = Field(..., description="The base of the reaction")
    solvent: str = Field(..., description="The solvent of the reaction")
    temperature: int = Field(..., description="The temperature of the reaction")
    temperature_unit: str = Field(..., description="The unit of the temperature")
    product: str = Field(..., description="The product of the reaction")
    rxn_yield: float = Field(..., description="The yield of the reaction")

We can now use the Reaction class to create an instance of the Reaction using a dict.

rxn_dict = {
    "reaction_name": "Buchwald-Hartwig reaction",
    "reactants": ["5-Bromo-m-xylene", "Benzylmethylamine"],
    "catalyst": ["Bis(dibenzylideneacetone)palladium(0)", "Tri(o-tolyl)phosphine"],
    "base": "Sodium tert-butoxide",
    "solvent": "Toluene",
    "temperature": 65,
    "temperature_unit": "°C",
    "product": "N-Benzyl-N-methyl(3,5-xylyl)amine",
    "rxn_yield": 88,
}
rxn = Reaction(**rxn_dict)

We can also save the data as a JSON file.

rxn.model_dump_json()
'{"reaction_name":"Buchwald-Hartwig reaction","reactants":["5-Bromo-m-xylene","Benzylmethylamine"],"catalyst":["Bis(dibenzylideneacetone)palladium(0)","Tri(o-tolyl)phosphine"],"base":"Sodium tert-butoxide","solvent":"Toluene","temperature":65,"temperature_unit":"°C","product":"N-Benzyl-N-methyl(3,5-xylyl)amine","rxn_yield":88.0}'

Annotating data

The creation of a test and validation set is crucial and can be time-consuming. Therefore, using an automation tool is recommended. For doing so, TeamTat can be a convenient choice.

Labeling itself can be very difficult and it can be worth to consider guidelines used by other projects.