ICLR Poster Enhancing Document Understanding with Group Position Embedding: A Novel Approach to Incorporate Layout Information

Poster

Enhancing Document Understanding with Group Position Embedding: A Novel Approach to Incorporate Layout Information

Yuke Zhu · Yue Zhang · Dongdong Liu · Chi Xie · Zihua Xiong · Bo Zheng · Sheng Guo

Hall 3 + Hall 2B #234

[ Abstract ]

Fri 25 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

Recent advancements in document understanding have been dominated by leveraging large language models (LLMs) and multimodal large models. However, enabling LLMs to comprehend complex document layouts and structural information often necessitates intricate network modifications or costly pre-training, limiting their practical applicability. In this paper, we introduce Group Position Embedding (GPE), a novel and efficient technique to enhance the layout understanding capabilities of LLMs without architectural changes or additional pre-training. GPE achieves this by strategically grouping the attention heads and feeding each group with distinct positional embeddings, effectively encoding layout information relevant to document comprehension. This simple yet powerful method allows for effective integration of layout information within the existing LLM framework. We evaluate GPE against several competitive baselines across five mainstream document tasks. We also introduce a challenging benchmark called BLADE, specifically designed to assess layout comprehension. Extensive experiments on both established and BLADE benchmarks confirm the efficacy of GPE in significantly advancing the state-of-the-art in document understanding.

Live content is unavailable. Log in and register to view live content