Comparing OCR and VLM techniques in processing tabular data
Verbovskiy, Andrey (2025-06-09)
Verbovskiy, Andrey
A. Verbovskiy
09.06.2025
© 2025 Andrey Verbovskiy. Ellei toisin mainita, uudelleenkäyttö on sallittu Creative Commons Attribution 4.0 International (CC-BY 4.0) -lisenssillä (https://creativecommons.org/licenses/by/4.0/). Uudelleenkäyttö on sallittua edellyttäen, että lähde mainitaan asianmukaisesti ja mahdolliset muutokset merkitään. Sellaisten osien käyttö tai jäljentäminen, jotka eivät ole tekijän tai tekijöiden omaisuutta, saattaa edellyttää lupaa suoraan asianomaisilta oikeudenhaltijoilta.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:oulu-202506094256
https://urn.fi/URN:NBN:fi:oulu-202506094256
Tiivistelmä
The primary goal of this thesis is to identify the most effective methods—both in terms of accuracy and processing speed—for extracting content from various types of tabular data. The focus is on comparing Visual Language Models (VLMs) and Optical Character Recognition (OCR) techniques when applied to scanned image-based tables and handwritten tables. To this end, a dedicated dataset containing a mix of both table types is used to evaluate the strengths and weaknesses of each approach.
The experimental setup is built within the chunker component of a Retrieval-Augmented Generation (RAG) pipeline. The pipeline is composed of several Python-based Django microservices running on a Linux Ubuntu 24.04.2 LTS virtual machine. The two main microservices used in this study are the chunker, which prepossesses documents by splitting them into smaller, manageable chunks, and the multimodal microservice, which grants access to the VLM through vllm—in this case, the GPT4o-mini model. Tesseract OCR is integrated into the chunker via the unstructured Python library for traditional text
extraction.
The experiments aim to evaluate both latency and content extraction quality. While it is expected that the VLM will outperform OCR in complex or handwritten tables particularly those with less rigid formatting the OCR approach is anticipated to demonstrate a clear advantage in speed. This may be especially evident when processing smaller or in-text tables, where the overhead of using a large language model may not be justified.
Ultimately, the findings of this thesis will inform decisions on when to employ advanced multimodal models versus traditional OCR in practical document processing pipelines, based on the trade-off between speed and accuracy.
The experimental setup is built within the chunker component of a Retrieval-Augmented Generation (RAG) pipeline. The pipeline is composed of several Python-based Django microservices running on a Linux Ubuntu 24.04.2 LTS virtual machine. The two main microservices used in this study are the chunker, which prepossesses documents by splitting them into smaller, manageable chunks, and the multimodal microservice, which grants access to the VLM through vllm—in this case, the GPT4o-mini model. Tesseract OCR is integrated into the chunker via the unstructured Python library for traditional text
extraction.
The experiments aim to evaluate both latency and content extraction quality. While it is expected that the VLM will outperform OCR in complex or handwritten tables particularly those with less rigid formatting the OCR approach is anticipated to demonstrate a clear advantage in speed. This may be especially evident when processing smaller or in-text tables, where the overhead of using a large language model may not be justified.
Ultimately, the findings of this thesis will inform decisions on when to employ advanced multimodal models versus traditional OCR in practical document processing pipelines, based on the trade-off between speed and accuracy.
Kokoelmat
- Avoin saatavuus [38618]