Authors: Mihir Chauhan, Abhishek Satbhai, Mohammad Abuzar Hashemi, Mir Basheer Ali, Bina Ramamurthy, Mingchen Gao, Siwei Lyu, Sargur Srihari
Abstract: Handwriting Verification is a critical in document forensics. Deep learning
based approaches often face skepticism from forensic document examiners due to
their lack of explainability and reliance on extensive training data and
handcrafted features. This paper explores using Vision Language Models (VLMs),
such as OpenAI’s GPT-4o and Google’s PaliGemma, to address these challenges. By
leveraging their Visual Question Answering capabilities and 0-shot
Chain-of-Thought (CoT) reasoning, our goal is to provide clear,
human-understandable explanations for model decisions. Our experiments on the
CEDAR handwriting dataset demonstrate that VLMs offer enhanced
interpretability, reduce the need for large training datasets, and adapt better
to diverse handwriting styles. However, results show that the CNN-based
ResNet-18 architecture outperforms the 0-shot CoT prompt engineering approach
with GPT-4o (Accuracy: 70%) and supervised fine-tuned PaliGemma (Accuracy:
71%), achieving an accuracy of 84% on the CEDAR AND dataset. These findings
highlight the potential of VLMs in generating human-interpretable decisions
while underscoring the need for further advancements to match the performance
of specialized deep learning models.
Source: http://arxiv.org/abs/2407.21788v1