## Document cleaning

Since most of the downloaded files include irrelevant sections like bibliography and acknowledgments the next step would be to remove those. 
In addition, depending on the chosen document parsing techniques there might be additional cleaning steps that are required. 



### Cleaning text using hand-coded rules

In [15]:
import matextract  # noqa: F401

with open("raw_text.txt", "r") as f:
    content = f.read()

#### Removing extraneous line breaks

In [21]:
print(content)

Linear Amine-Linked Oligo-BODIPYS: Convergent Access via
Sebastian H. Rôttger, [a] Lukas J. Patalag,o) Felix Hasenmaile,a Lukas Milbrandt,o) Burkhard

Buchwald-Hartwig Coupling
Butschke,cl Peter G. Jonesld] and Daniel B. Werz*la)
[a] S.H. Rôttger, Dr. F. Hasenmaile, Prof. Dr. D.B. Werz
Institute of Organic Chemistry
AlbertstraBe 21, 79104 Freiburg (Breisgau), Germany
E-mail: daniel. wer@chemeunltelbupde
[b] Dr. L. J. Patalag, L. Milbrandt
Technische Universitât Braunschweig
Institute of Organic Chemistry
Hagenring 30, 38106 Braunschweig, Germany
[c] Dr. B. Butschke
Abert.ludwgsUnkerstat Freiburg
Institute of Inorganic and Analytical Chemistry
AlbertstraBe 21, 79104 Freiburg (Breisgau), Germany
[d] Prof. Dr. P. G. Jones
Technische Universitât Braunschweig
Institute of Inorganic and. Analytical Chemistry
Hagenring 30, 38106 Braunschweig, Germany

DFG Cluster of Excellence livMats @FIT and Aber.uowgsUnversiat Freiburg

Abstract: A convergent route towards nitrogen-bridged BODIPY
oligomers

We see that this extraction contains many line breaks at places where there should be known. 
Thus, a first step will be to remove those extraneous line breaks.

In [17]:
from unstructured.cleaners.core import group_broken_paragraphs

In [20]:
print(group_broken_paragraphs(content.replace("\n\n", "\n")))

Linear Amine-Linked Oligo-BODIPYS: Convergent Access via Sebastian H. Rôttger, [a] Lukas J. Patalag,o) Felix Hasenmaile,a Lukas Milbrandt,o) Burkhard Buchwald-Hartwig Coupling Butschke,cl Peter G. Jonesld] and Daniel B. Werz*la) [a] S.H. Rôttger, Dr. F. Hasenmaile, Prof. Dr. D.B. Werz Institute of Organic Chemistry AlbertstraBe 21, 79104 Freiburg (Breisgau), Germany E-mail: daniel. wer@chemeunltelbupde [b] Dr. L. J. Patalag, L. Milbrandt Technische Universitât Braunschweig Institute of Organic Chemistry Hagenring 30, 38106 Braunschweig, Germany [c] Dr. B. Butschke Abert.ludwgsUnkerstat Freiburg Institute of Inorganic and Analytical Chemistry AlbertstraBe 21, 79104 Freiburg (Breisgau), Germany [d] Prof. Dr. P. G. Jones Technische Universitât Braunschweig Institute of Inorganic and. Analytical Chemistry Hagenring 30, 38106 Braunschweig, Germany DFG Cluster of Excellence livMats @FIT and Aber.uowgsUnversiat Freiburg Abstract: A convergent route towards nitrogen-bridged BODIPY oligomers ha

As a next step, we could then remove the parts that we don't need.
In this case, a good approximation might be to remove everything up to the introduction and everything following the introductions.

```{admonition} Regex
:class: tip

Regular expressions (regex) are a powerful tool for pattern matching, allowing for complex searches, substitutions, and data extraction based on specific string patterns. For instance the first regular expression `r'\[MISSING_PAGE_FAIL:\d+\]` is used to remove any text matching the pattern `[MISSING_PAGE_FAIL:` followed by one or more digits and a closing bracket. 

[This page](https://programminghistorian.org/en/lessons/cleaning-ocrd-text-with-regular-expressions) provides a good overview in the context of text-data cleaning.
```

In [35]:
import re

# Define the pattern to match the text between the introduction and acknowledgments sections
match_text_between_introduction_and_acknowledgment_sections = re.compile(
    r"Introduction.*?Acknowledgements", re.DOTALL
)

# Extract the text between the introduction and acknowledgments sections
filtered_text = re.findall(
    match_text_between_introduction_and_acknowledgment_sections, content
)[0].replace("Acknowledgements", "")

In [34]:
filtered_text

'Introduction\n\nhalide.\n\nrespective\n\nNot only does the selective\n\n- -  - EN\n\ncooe\n\nThe family of BODIPY dyes, first reported in 1968 by Treibs and\nKreuzer,"l has gained major interest in research in the past\ndecades because of their fairly simple preparative access, their\nflexibility in terms of possible modifications and their useful\nproperties such as outstanding attenuation coefficients and also\nhigh fluorescence quantum yields.2) Hence, they are already\nwidely applied for imaging, e.g. as biomarkers for medical\npurposes, and have also proven to be applicable in other fields,\nfor instance as various types of photosensitizers and organic light-\nemitting diodes (OLEDs).3) Various types of oligo-BODIPYS have\nalready shown the capability to enhance such desirable\nproperties and thus have been the focus of much recent\npreparative chemistry. Alkylene bridged or directly connected\n\nBThiswork\n\nsymmetric & unsymmetric dimers\nand\nfunctionalized examples\n\nBODIPYS

This now looks already much better as it no longer has the extraneous linebreaks and we also removed a lot of the extraneous text.

### Cleaning text using LLMs

Yet, we still see that there are artifacts where the text has misplaced words or characters.
However, those are difficult to remove using hard-coded rules. As alternative, or additional step, one can use LLMs to remove remaining problems. 

In [45]:
from litellm import OpenAI

client = OpenAI()

In [54]:
completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": "We extracted the following text using OCR from a PDF. Clean up the text, i.e. remove extraneous characters, fix other issues such as words that do not fit in or remove characters that obviously are not part of the text. Return only the cleaned text.",
        },
        {"role": "user", "content": filtered_text},
    ],
    temperature=0,
)

In [57]:
print(completion.choices[0].message.content)

Introduction

The family of BODIPY dyes, first reported in 1968 by Treibs and Kreuzer, has gained major interest in research in the past decades because of their fairly simple preparative access, their flexibility in terms of possible modifications, and their useful properties such as outstanding attenuation coefficients and high fluorescence quantum yields. Hence, they are already widely applied for imaging, e.g., as biomarkers for medical purposes, and have also proven to be applicable in other fields, for instance, as various types of photosensitizers and organic light-emitting diodes (OLEDs). Various types of oligo-BODIPYs have already shown the capability to enhance such desirable properties and thus have been the focus of much recent preparative chemistry. Alkylene-bridged or directly connected symmetric and unsymmetric dimers and functionalized examples of BODIPYs have been known for several years (Figure 1A, top).

Figure 1. A) Various C-C bridged (top) and heteroatom bridged (

This LLM-cleaned text already looks much better. 
However, LLM-based cleaning has the drawback of higher cost and it introduces another possibility for errors to creep in.

### Removing sections in Markdown files

Markdown syntax (as one, e.g., obtains using tools such as `nougat`) makes document cleaning simpler because sections can be readily identified.

In this example the patterns `[MISSING_PAGE_FAIL:x]` and the sections 'Acknowledgments' and 'References' are detected and deleted. 

In [1]:
import re


def clean_text(text):
    # Delete the pattern [MISSING_PAGE_FAIL:x]
    cleaned_text = re.sub(r"\[MISSING_PAGE_FAIL:\d+\]", "", text)

    # Delete the acknowledgements section
    cleaned_text = re.sub(
        r"## Acknowledgements.*?(?=##|$)", "", cleaned_text, flags=re.S
    )

    # delete the references section
    cleaned_text = re.sub(r"## References.*", "", cleaned_text, flags=re.S)

    return cleaned_text


input_file = "./markdown_files/10.26434_chemrxiv-2024-1l0sn.mmd"

with open(input_file, "r", encoding="utf-8") as f:
    content = f.read()

# clean the text
cleaned_text = clean_text(content)
print(cleaned_text)



These types of connectivity have also been converted to extended \(\pi\)-systems by oxidative follow-up reactions, allowing a higher level of conjugation and hence strong bathochromic shifts.[8] The installation of heteroatoms has however been a challenge for some time. In 2014, Shinokubo et al. presented linearly connect monomers through an azo-bridge at the \(\beta\)-position (Figure 1A (d)).[10] Linear connectivity at the \(\alpha\)-position using heteroatoms such as sulfur has been achieved through a similarly iterative process by the groups of Hao and Jiao (Figure 1A (e)).[7] Furthermore, cyclic amine-linked oligo-BODIPYs have already been synthesized in a one-pot reaction in 2022 by Song et al., utilizing Buchwald-Hartwig conditions (Figure 1A (f)).[10]

We present a novel type of BODIPY oligomers, connected via _N_-bridges in a linear fashion (Figure 1B). Utilizing both symmetric and unsymmetric BODIPY monomers as building blocks has paved the way to selectively synthesize oli

```{tip}
A more powerful cleaning scripts for Markdown files, e.g., as produced using `nougat`, was created for the ChemNLP project. 
You can find it [here](https://github.com/OpenBioML/chemnlp/blob/main/data/natural/preprocess_nougat.py).
```

### Harmonizing XML files

Many APIs return the articles directly in machine-readable XML format. However, the ones of different publishers are quite different, which can make this kind of cleanup tedious. Thus, it is great that there are packages such as [Pub2TEI](https://github.com/kermitt2/Pub2TEI) that can help one to streamline this process.

```{note}

Execute the following lines in bash terminal.

    docker run --rm --gpus all --init --ulimit core=0 -p 8060:8060 grobid/pub2tei:0.2
    git clone https://github.com/kermitt2/Pub2TEI
    cd Pub2TEI/client
    pip install requests

This will start the starting the Pub2TEI service with Docker.
```

In [13]:
import os
import requests
import time

# Define the input directory containing XML files and the output directory for TEI files
input_dir = "./XML_files"
output_dir = "./XML_files_cleaned"
os.makedirs(output_dir, exist_ok=True)

# Define the Pub2TEI server URL
server_url = "http://localhost:8060/service/processXML"


# Function to process a single XML file
def process_xml_file(xml_file, output_dir):
    files = {
        "input": open(xml_file, "rb"),
        "segmentSentences": (
            None,
            "1",
        ),  # Optional, set to '1' for sentence segmentation
        "grobidRefine": (None, "1"),  # Optional, set to '1' for refining with Grobid
    }
    for attempt in range(5):  # Retry up to 5 times
        try:
            response = requests.post(server_url, files=files)
            if response.status_code == 200:
                with open(output_dir, "wb") as f:
                    f.write(response.content)
                print(f"Processed {xml_file} successfully.")
                return output_dir
            else:
                print(
                    f"Failed to process {xml_file}. Status code: {response.status_code}"
                )
            break
        except ConnectionError as e:
            print(f"Connection error: {e}. Retrying in 5 seconds...")
            time.sleep(5)


# Process all XML files in the input directory
for filename in os.listdir(input_dir):
    if filename.endswith(".xml"):
        input_file = os.path.join(input_dir, filename)
        print(input_file)
        output_file = os.path.join(output_dir, filename.replace(".xml", ".tei.xml"))
        process_xml_file(input_file, output_file)

./XML_files/ao0c01342.xml
Processed ./XML_files/ao0c01342.xml successfully.


In [14]:
with open(output_file, "r", encoding="utf-8") as f:
    content = f.read()
    print(f"Content of {output_file}:\n")
    print(content)

Content of ./XML_files_cleaned/ao0c01342.tei.xml:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title level="a" type="main">Synthesis and Biological Evaluation of Three New Chitosan
Schiff Base Derivatives</title>
         </titleStmt>
         <publicationStmt>
            <publisher>American Chemical Society</publisher>
            <availability>
               <p>
                  <s>American Chemical Society</s>
               </p>
            </availability>
            <date type="e-published" when="2020-06-01">2020</date>
            <date when="2020-06-16">2020</date>
            <date type="Copyright" when="2020">2020</date>
         </publicationStmt>
         <notesStmt>
            <note type="cont

One could use this tool to first unify all different downloaded files from different publisher styles and afterwards remove irrelevant section of these articles automatically as shown in the beginning of this section. 