{ "cells": [ { "cell_type": "markdown", "id": "93e1ab4b3d759339", "metadata": { "collapsed": false }, "source": [ "## Data annotation\n", "\n", "For structured data extraction, it is indeed essential to define a structure into which the data is supposed to be stored. This process of defining and applying structure to data is often referred to as data annotation. The choice of data structure and format is crucial and can significantly impact the efficiency and effectiveness of your data extraction process.\n", "\n", "### Considerations for choosing a data structure\n", "While there's no one-size-fits-all answer, here are some key points to consider when deciding on a data structure:\n", "\n", "- Nested vs. flat: Certain formats like YAML or JSON are better at expressing dependencies and hierarchies compared to flat formats like CSV.\n", "- Verbosity and human readability: Some formats (e.g., XML) can be more verbose, which may affect readability.\n", "- Type annotations and documentation: It's best practice to include as much metadata as possible, including data types and descriptions.\n", "- Machine readability: Consider how easily the format can be parsed by various programming languages and tools.\n", "- Extensibility: Choose a format that allows for easy addition of new fields or structures as your data needs evolve.\n", "- Standardization: Consider if there are any industry-standard formats for your specific domain.\n", "\n", "```{admonition} Old formats regain attention \n", ":class: note \n", "\n", "
\n", "\n", "Note also that models have different preferences for formats. [For example, Claude was reported to perform very well with XML, whereas GPT seems to prefer JSON](https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/).\n", "\n", "```\n", "\n", "```{admonition} Examples of various serialization formats\n", ":class: dropdown \n", "\n", "\n", " For modeling reactions, we can serialize data in a simple data model in different formats. \n", "\n", " Description of our data model:\n", "\n", " - A reaction has a name, reactants, products, and conditions.\n", " - Each reactant and product has a chemical formula and amount.\n", " - Conditions include temperature and pressure.\n", "\n", " #### CSV example\n", "\n", " ```\n", " reaction_name,reactant_formula,reactant_amount,product_formula,product_amount,temperature,pressure\n", " Combustion of Methane,CH4,1 mol,CO2,1 mol,298 K,1 atm\n", " Combustion of Methane,O2,2 mol,H2O,2 mol,298 K,1 atm\n", " Photosynthesis,6 CO2,6 mol,C6H12O6,1 mol,300 K,1 atm\n", " Photosynthesis,6 H2O,6 mol,O2,6 mol,300 K,1 atm\n", " ```\n", "\n", " #### YAML Example\n", "\n", " ```\n", " reactions:\n", " - name: Combustion of Methane\n", " reactants:\n", " - formula: CH4\n", " amount: 1 mol\n", " - formula: O2\n", " amount: 2 mol\n", " products:\n", " - formula: CO2\n", " amount: 1 mol\n", " - formula: H2O\n", " amount: 2 mol\n", " conditions:\n", " temperature: 298 K\n", " pressure: 1 atm\n", " - name: Photosynthesis\n", " reactants:\n", " - formula: 6 CO2\n", " amount: 6 mol\n", " - formula: 6 H2O\n", " amount: 6 mol\n", " products:\n", " - formula: C6H12O6\n", " amount: 1 mol\n", " - formula: O2\n", " amount: 6 mol\n", " conditions:\n", " temperature: 300 K\n", " pressure: 1 atm\n", " ```\n", "\n", " #### JSON Example\n", "\n", " ```\n", " {\n", " \"reactions\": [\n", " {\n", " \"name\": \"Combustion of Methane\",\n", " \"reactants\": [\n", " { \"formula\": \"CH4\", \"amount\": \"1 mol\" },\n", " { \"formula\": \"O2\", \"amount\": \"2 mol\" }\n", " ],\n", " \"products\": [\n", " { \"formula\": \"CO2\", \"amount\": \"1 mol\" },\n", " { \"formula\": \"H2O\", \"amount\": \"2 mol\" }\n", " ],\n", " \"conditions\": {\n", " \"temperature\": \"298 K\",\n", " \"pressure\": \"1 atm\"\n", " }\n", " },\n", " {\n", " \"name\": \"Photosynthesis\",\n", " \"reactants\": [\n", " { \"formula\": \"6 CO2\", \"amount\": \"6 mol\" },\n", " { \"formula\": \"6 H2O\", \"amount\": \"6 mol\" }\n", " ],\n", " \"products\": [\n", " { \"formula\": \"C6H12O6\", \"amount\": \"1 mol\" },\n", " { \"formula\": \"O2\", \"amount\": \"6 mol\" }\n", " ],\n", " \"conditions\": {\n", " \"temperature\": \"300 K\",\n", " \"pressure\": \"1 atm\"\n", " }\n", " }\n", " ]\n", " }\n", " ```\n", "\n", " #### XML example \n", "\n", " ```\n", "I'm switching from team JSON to team XML for LLM prompts. Escaping JSON and writing JSON is much more limiting. You can just yeet XML tags wherever and not have to worry about escaping/format/validity.
— Andrew White 🐦⬛/acc (@andrewwhite01) May 2, 2024