5. Beyond text#
NLP-LLMs tend to have problems with analyzing and understanding complex structures such as tables, plots and images included in scientific articles. Since especially in chemistry and material science information about chemical components is included in these, one should think about different approaches for these structures. Therefore, vision language models (VLMs) since they can analyse images alongside text. There are several open and closed-source VLMs available, e.g., Vision models from OpenAI, Claude models and DeepSeek-VL. As an example, we show the extraction of images with GPT4-o.
First one has to convert the file into images.
Note
The used PDF file was obtained in the mining data notebook.
import matextract # noqa: F401
from pdf2image import convert_from_path
file_path = "../obtaining_data/PDFs/10.26434_chemrxiv-2024-1l0sn.pdf"
# converting the PDF files to images
pdf_images = convert_from_path(file_path)
After that, one should process the obtained images, for instance, rotate pages with vertical text since many models have problems with this.
Correcting text orientation
The algorithm we use here, applies the following steps:
Convert the image to grayscale.
Detect edges using the Canny edge detection algorithm.
Use Hough Line Transform to detect lines in the image.
Calculate the angles of these lines.
Find the dominant angle by taking the median of all angles.
Based on the dominant angle, determine if the image needs to be rotated 90, 180, or 270 degrees.
Rotate the image accordingly.
Alternative implementation
You could implement the preprocessing for text-orientation also with the popular tesseract
package.
tesseract
’s image_to_osd
(Orientation and Script Detection) function is specifically designed to detect text orientation, including cases where text might be rotated 90, 180, or 270 degrees. It can identify the script (e.g., Latin, Cyrillic, Arabic) used in the document, which can be useful for multi-language documents and can often handle documents with mixed orientations or complex layouts better than simpler edge-detection methods.
def correct_text_orientation(image, save_directory, file_path, i):
if isinstance(image, Image.Image):
image = pil_to_cv2(image)
rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = pytesseract.image_to_osd(rgb, output_type=Output.DICT)
rotated = imutils.rotate_bound(image, angle=results["rotate"])
base_filename = os.path.basename(file_path)
name_without_ext, * = os.path.splitext(base*filename)
new_filename = os.path.join(
savedirectory, f"corrected{name_without_ext}_page{i+1}.png"
)
cv2.imwrite(new_filename, rotated)
print(f"[INFO] {file_path} - corrected image saved as {new_filename}")
return rotated
import imutils
import cv2
import os
import base64
from PIL import Image
import numpy as np
# most VLM models struggle with rotated text therefore, rotated text gets detect and the pages flipped
def correct_text_orientation(image, save_directory, file_path, i):
if isinstance(image, Image.Image):
image = pil_to_cv2(image)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# This applies Canny edge detection to our grayscale image, with lower threshold 50, upper threshold 150, and an aperture size of 3.
edges = cv2.Canny(gray, 50, 150, apertureSize=3)
# find lines in the edge-detected image, which often correspond to text lines or document edges
lines = cv2.HoughLinesP(
edges, 1, np.pi / 180, 100, minLineLength=100, maxLineGap=10
)
angles = []
for line in lines:
x1, y1, x2, y2 = line[0]
# The slope of a line is essentially rise over run, or (y2-y1)/(x2-x1). This ratio is exactly what arctan converts into an angle
angle = np.arctan2(y2 - y1, x2 - x1) * 180.0 / np.pi
angles.append(angle)
# Find the dominant angle
dominant_angle = np.median(angles)
# Determine if the image needs to be rotated 90, 180, or 270 degrees
if abs(dominant_angle) < 45:
rotation_angle = 0
elif 45 <= dominant_angle < 135:
rotation_angle = 90
elif -135 <= dominant_angle < -45:
rotation_angle = -90
else:
rotation_angle = 180
rotated = imutils.rotate_bound(image, angle=rotation_angle)
base_filename = os.path.basename(file_path)
name_without_ext, _ = os.path.splitext(base_filename)
new_filename = os.path.join(
save_directory, f"corrected_{name_without_ext}_page{i+1}.png"
)
cv2.imwrite(new_filename, rotated)
print(f"[INFO] {file_path} - corrected image saved as {new_filename}")
return rotated
# the images get converted into jpeg format
def convert_to_jpeg(cv2_image):
retval, buffer = cv2.imencode(".jpg", cv2_image)
if retval:
return buffer
# conversion of the images from a python-image-library object to an OpenCV object
def pil_to_cv2(image):
np_image = np.array(image)
cv2_image = cv2.cvtColor(np_image, cv2.COLOR_RGB2BGR)
return cv2_image
# the images get resized to a unified size with a maximum dimensions
def resize_image(image, max_dimension):
width, height = image.size
# Check if the image has a palette and convert it to true color mode
if image.mode == "P":
if "transparency" in image.info:
image = image.convert("RGBA")
else:
image = image.convert("RGB")
# convert to black and white
image = image.convert("L")
if width > max_dimension or height > max_dimension:
if width > height:
new_width = max_dimension
new_height = int(height * (max_dimension / width))
else:
new_height = max_dimension
new_width = int(width * (max_dimension / height))
image = image.resize((new_width, new_height), Image.LANCZOS)
return image
Next, one has to convert the pictures into machine-readable Base-64 format.
# process the images to a unified and for an VLM better suiting format
def process_image(image, max_size, output_folder, file_path, i):
width, height = image.size
resized_image = resize_image(image, max_size)
rotate_image = correct_text_orientation(resized_image, output_folder, file_path, i)
jpeg_image = convert_to_jpeg(rotate_image)
base64_encoded_image = base64.b64encode(jpeg_image).decode("utf-8")
return (
base64_encoded_image,
max(width, height),
)
output_folder_images = "./images"
# all images get preprocessed
images_base64 = [
process_image(image, 2048, output_folder_images, file_path, j)[0]
for j, image in enumerate(pdf_images)
]
[INFO] ../obtaining_data/PDFs/10.26434_chemrxiv-2024-1l0sn.pdf - corrected image saved as ./images/corrected_10.26434_chemrxiv-2024-1l0sn_page1.png
[INFO] ../obtaining_data/PDFs/10.26434_chemrxiv-2024-1l0sn.pdf - corrected image saved as ./images/corrected_10.26434_chemrxiv-2024-1l0sn_page2.png
[INFO] ../obtaining_data/PDFs/10.26434_chemrxiv-2024-1l0sn.pdf - corrected image saved as ./images/corrected_10.26434_chemrxiv-2024-1l0sn_page3.png
[INFO] ../obtaining_data/PDFs/10.26434_chemrxiv-2024-1l0sn.pdf - corrected image saved as ./images/corrected_10.26434_chemrxiv-2024-1l0sn_page4.png
[INFO] ../obtaining_data/PDFs/10.26434_chemrxiv-2024-1l0sn.pdf - corrected image saved as ./images/corrected_10.26434_chemrxiv-2024-1l0sn_page5.png
[INFO] ../obtaining_data/PDFs/10.26434_chemrxiv-2024-1l0sn.pdf - corrected image saved as ./images/corrected_10.26434_chemrxiv-2024-1l0sn_page6.png
As a next step, one could call the OpenAI API. Therefore, one needs an API-key to pay for the calls. Moreover, one needs to create the prompt including the images and the text prompt.
Prompt
This is a very simple example prompt. One should optimize and engineer the prompt before usage. For that one could use a tool like DSPy.
# the text prompt text for the model call gets defined
prompt_text = "Extract all the relevant information about Buchwald-Hartwig reactions included in these images."
# the composite prompt is put together
def get_prompt_vision_model(images_base64, prompt_text):
content = []
# the images get added in base64 format and in the end the text prompt will be added
for data in images_base64:
content.append(create_image_content(data))
content.append({"type": "text", "text": prompt_text})
return content
# the images get converted into base64 format
def create_image_content(image, detail="high"):
return {
"type": "image_url",
# the level of detail is set to 'high' since mostly text on the images is small
"image_url": {"url": f"data:image/jpeg;base64,{image}", "detail": detail},
}
# the composite prompt for the model call gets defined
prompt = get_prompt_vision_model(images_base64, prompt_text)
To call the actual model one could use LiteLLM instead of directly using an API like the OpenAI-API. So one could easily switch between different models for different providers.
from litellm import completion
# Define the function to call the LiteLLM API
def call_litellm(prompt, model="gpt-4o", temperature: float = 0.0, **kwargs):
"""Call LiteLLM model
Args:
prompt (str): Prompt to send to model
model (str, optional): Name of the API. Defaults to "gpt-4o".
temperature (float, optional): Inference temperature. Defaults to 0.
Returns:
dict: New data
"""
messages = [
{
"role": "system",
"content": (
"You are a scientific assistant, extracting important information about reaction conditions "
"out of PDFs in valid JSON format. Extract just data which you are 100% confident about the "
"accuracy. Keep the entries short without details. Be careful with numbers."
),
},
{"role": "user", "content": prompt},
]
response = completion(
model=model,
messages=messages,
temperature=temperature,
**kwargs,
)
# Extract and return the message content and token usage
message_content = response["choices"][0]["message"]["content"]
input_tokens = response["usage"]["prompt_tokens"]
output_tokens = response["usage"]["completion_tokens"]
return message_content, input_tokens, output_tokens
# Call the LiteLLM API and print the output and token usage
output, input_tokens, output_tokens = call_litellm(prompt=prompt)
print("Output: ", output)
print("Input tokens used:", input_tokens, "Output tokens used:", output_tokens)
Output: Here is the extracted information about Buchwald-Hartwig reactions from the provided images:
```json
{
"Buchwald-Hartwig Reactions": {
"Key Step": "Cross-coupling reaction of an α-amino-BODIPY and the respective halide.",
"Conditions": [
{
"Reagents": [
"Pd(OAc)2",
"(±)-BINAP",
"Cs2CO3",
"PhMe"
],
"Temperature": "80 °C",
"Time": "1.5 h"
},
{
"Reagents": [
"Pd(OAc)2",
"(±)-BINAP",
"Cs2CO3",
"PhMe"
],
"Temperature": "80 °C",
"Time": "6-322 h"
}
],
"Yields": [
{
"Compound": "EDM-Ar-mono-NH2",
"Yield": "47%"
},
{
"Compound": "DM-Ar-mono-NH2",
"Yield": "58%"
},
{
"Compound": "Br-Ar-mono-NH2",
"Yield": "56%"
},
{
"Compound": "EDM-Ar-di",
"Yield": "20%"
},
{
"Compound": "DM-Ar-di",
"Yield": "30%"
},
{
"Compound": "Br-Ar-di",
"Yield": "44%"
},
{
"Compound": "EDM-Ar-di",
"Yield": "66%"
},
{
"Compound": "DM-Ar-di",
"Yield": "62%"
},
{
"Compound": "Br-Ar-di",
"Yield": "45%"
}
]
}
}
```
Input tokens used: 6704 Output tokens used: 389
Tip
To get only the JSON part of the output, one could use Regex to extract this content.
Since in the article there is no experimental section provided, the model just extracted general information about the reactions. It failed with extraction the data provided in the reaction schemes. To extract this information, one should use tools presented in the agentic section.
Now one could use this structured output to build up a database of Buchwald-Hartwig-Coupling reactions.