##  Document parsing with OCR tools


An important part of data extraction pipelines is often converting inputs into a form that the text-based pipelines can use.

In many cases, this conversion involves that image inputs (e.g., scans of a paper) must be converted into text. 
This involves multiple steps:

- characters must be recognized, this is known as [OCR](https://en.wikipedia.org/wiki/Optical_character_recognition) (Optical Character Recognition),
- the layout and reading order must be understood,
- relevant blocks of text must be extracted, cleaned and combined.

In the past, this was often done using tools specialized for each of these steps (e.g., [`tesseract`](https://github.com/tesseract-ocr/tesseract), [`LayoutParser`](https://github.com/Layout-Parser/layout-parser)). New tools such as [`nougat`](https://github.com/facebookresearch/nougat) or [`marker`](https://github.com/VikParuchuri/marker), however, allow to perform the entire process end-to-end.


As an example we will demonstrate the conversion of a PDF to plain text that can be sent to an LLM using the [`docTR`](https://github.com/mindee/doctr) tool.

In [None]:
import matextract  # noqa: F401
import os
from doctr.io import DocumentFile
from doctr.models import ocr_predictor

[`docTR`](https://mindee.github.io/doctr/latest/using_doctr/using_models.html) internally uses different modules for text detection (identifying sequences of characters) and then text recognition (converting the detected elements to text).

In [5]:
def convert_pdf_with_doctr(pdf_path, det_arch="db_resnet50", reco_arch="crnn_vgg16_bn"):
    model = ocr_predictor(det_arch=det_arch, reco_arch=reco_arch, pretrained=True)
    model = ocr_predictor(pretrained=True)
    # PDF
    doc = DocumentFile.from_pdf(pdf_path)
    # Analyze
    result = model(doc)

    return result.render()

As an example, the PDF downloaded in the [data mining notebook](../obtaining_data/data_mining.ipynb) was converted into markdown files. 

In [7]:
pdf_dir = "../obtaining_data/PDFs"
specific_pdf_file = "10.26434_chemrxiv-2024-1l0sn.pdf"


# Check if the specific file exists in the directory
pdf_path = os.path.join(pdf_dir, specific_pdf_file)
text = convert_pdf_with_doctr(pdf_path)

In [8]:
print(text)

Linear Amine-Linked Oligo-BODIPYS: Convergent Access via
Sebastian H. Rôttger, [a] Lukas J. Patalag,o) Felix Hasenmaile,a Lukas Milbrandt,o) Burkhard

Buchwald-Hartwig Coupling
Butschke,cl Peter G. Jonesld] and Daniel B. Werz*la)
[a] S.H. Rôttger, Dr. F. Hasenmaile, Prof. Dr. D.B. Werz
Institute of Organic Chemistry
AlbertstraBe 21, 79104 Freiburg (Breisgau), Germany
E-mail: daniel. wer@chemeunltelbupde
[b] Dr. L. J. Patalag, L. Milbrandt
Technische Universitât Braunschweig
Institute of Organic Chemistry
Hagenring 30, 38106 Braunschweig, Germany
[c] Dr. B. Butschke
Abert.ludwgsUnkerstat Freiburg
Institute of Inorganic and Analytical Chemistry
AlbertstraBe 21, 79104 Freiburg (Breisgau), Germany
[d] Prof. Dr. P. G. Jones
Technische Universitât Braunschweig
Institute of Inorganic and. Analytical Chemistry
Hagenring 30, 38106 Braunschweig, Germany

DFG Cluster of Excellence livMats @FIT and Aber.uowgsUnversiat Freiburg

Abstract: A convergent route towards nitrogen-bridged BODIPY
oligomers

In [10]:
with open("raw_text.txt", "w") as f:
    f.write(text)

```{important}

To review the quality and accuracy of the conversion at least partially afterward is crucial. 
If the OCR-tool is not able to convert the relevant parts correctly, one should think about using a different method.
```

The obtained text contains some errors.  Most obvious one is that the text still contains page numbers of other characters that are not relevant for the main text. 

More advanced approaches such as [`nougat`](https://facebookresearch.github.io/nougat/), [`llamaparser`](https://docs.cloud.llamaindex.ai/llamaparse/getting_started/python) or [`marker`](https://github.com/VikParuchuri/marker) minimize those errors. In addition, [Deep Search by IBM](https://ds4sd.github.io/) allows converting from PDF into JSON files.

```{admonition} Conversion of PDF to Markdown using Nougat
:class: dropdown note

Converting PDFs to Markdown using `nougat` is not much more difficult and can be accomplished by installing `nougat` with `pip install nougat-ocr`. 

After that, a conversion can be performed directly on the command line using 

    nougat path/to/file.pdf -o output_directory

In the output directory, the converted files will be stored as Markdown files with the extension `mmd`.

```

However, even those more advanced techniques will still make mistake and will struggle to handle very old tables.
To deal with those cases, one could use a [vision model](../beyond_text/beyond_images.ipynb) or an [agentic approach](../agents/agent.ipynb) to minimize those errors.

Afterward the received files should be cleaned, as shown in the [document cleaning notebook](./cleaning.ipynb).