Text summarization via global structure awareness
Abstract
With the explosive growth of information, the volume of long documents has surged, and the cost of processing them continues to rise, making text summarization increasingly important. Existing studies primarily focus on model enhancements and sentence-level pruning based on contextual dependencies and semantic patterns. Although some approaches leverage large language models (LLMs) for text summarization and achieve higher accuracy, they incur substantial computational costs and often overlook global structural modeling. Consequently, summarized texts may lose critical logical chains, disrupting coherence and weakening downstream task performance. To address these issues, we propose GloSA-Sum, a novel text summarization framework that performs global structural analysis of texts via topological data analysis (TDA), enabling efficient summarization while preserving semantic cores and logical dependencies. Specifically, we construct a semantic-weighted graph from sentence embeddings, where persistent homology identifies core semantics and logical structures, preserved in a ``protection pool'' as the backbone for summarization. We design a topology-guided iterative strategy, where lightweight proxy metrics approximate sentence importance to avoid repeated high-cost computations, thus preserving structural integrity while improving efficiency. To further enhance long-text processing, we propose a hierarchical strategy that integrates segment-level and global summarization. Experiments on multiple datasets demonstrate that GloSA-sum reduces redundancy while preserving semantic and logical integrity, striking a balance between accuracy and efficiency, and further benefits LLM downstream tasks by shortening contexts while retaining essential reasoning chains.