# AncientDoc
**Repository Path**: ByteDance/AncientDoc
## Basic Information
- **Project Name**: AncientDoc
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: CC0-1.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-09-09
- **Last Updated**: 2025-09-15
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# AncientDoc: Benchmarking Vision-Language Models on Chinese Ancient Documents
[](link_to_paper)
[](https://huggingface.co/datasets/yuchuan123/AncientDoc)
[](link_to_models)
## π Introduction
Chinese ancient documents are invaluable carriers of history and culture, but their **visual complexity**, **linguistic variety**, and **lack of benchmarks** make them challenging for modern Vision-Language Models (VLMs).
We introduce **AncientDoc**, the **first benchmark** designed for evaluating VLMs on **Chinese ancient documents**, covering the full pipeline from **OCR** to **knowledge reasoning**.
- **5 Tasks:** Page-level OCR, Vernacular Translation, Reasoning-based QA, Knowledge-based QA, Linguistic Variant QA
- **14 Categories:** 100+ books, ~3,000 pages across dynasties from Warring States to Qing
- **Rich Annotations:** OCR + semantic translation + multi-level QA pairs
- **Comprehensive Evaluation:** CER, Precision/Recall/F1, CHRF++, BERTScore, and human-aligned GPT-4o scoring
---
## π Dataset Overview
- **Source:** Digitized ancient documents from Harvard Library and others
- **Dynasty Coverage:** From Warring States, Han, Tang, Song, Ming to Qing
- **Category Coverage:** 14 semantic categories (e.g., collected works, Chuci-style poetry, medicine, astronomy, literary criticism, art)
- **Total Size:** ~3,000 page images, with annotations across five tasks
---
## π§© Task Definition
1. **Page-level OCR** β extract complete text in correct reading order (vertical right-to-left, with annotations).
2. **Vernacular Translation** β translate classical Chinese into modern vernacular.
3. **Reasoning-based QA** β infer implicit meanings, causality, and ideology.
4. **Knowledge-based QA** β answer factual and cultural questions from texts.
5. **Linguistic Variant QA** β recognize rhetorical devices, stylistic features, and literary styles.
---
## π Evaluation Metrics
- **OCR Task:** CER, Char Precision/Recall/F1
- **Translation & QA Tasks:** CHRF++, BERTScore (BS-F1)
- **LLM-as-a-Judge:** GPT-4o scoring aligned with human ratings
---
## π Baseline Results
We evaluate **open-source** (Qwen2.5-VL, InternVL, LLaVA, etc.) and **closed-source** (GPT-4o, Gemini2.5-Pro, Doubao-V2, etc.) VLMs.
- **OCR:** Gemini2.5-Pro achieves lowest CER (32.03)
- **Translation:** Gemini2.5-Pro leads with BS-F1 72.5
- **Reasoning QA:** Qwen2.5-VL-72B shows strongest implicit reasoning
- **Knowledge QA:** GPT-4o achieves best factual QA performance
- **Variant QA:** GPT-4o & Gemini2.5-Pro excel in stylistic recognition
---
### Data Format
Each JSONL file contains:
```json
{
"image": "class/book/page_001.png",
"task": "OCR",
"question": "Please extract the text...",
"answer": "倫倩δΉζ..."
}
```
---
## π Citation
If you use AncientDoc in your research, please cite:
```bibtex
@article{ancientdoc2025,
title={Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning},
author={Haiyang Yu, Yuchuan Wu, Fan Shi, Lei Liao, Jinghui Lu, Xiaodong Ge, Han Wang, Minghan Zhuo, Xuecheng Wu, Xiang Fei, Hao Feng, Guozhi Tang, An-Lan Wang, Hanshen Zhu, Yangfan He, Quanhuan Liang, Liyuan Meng, ChaoFeng, Can Huang, Jingqun Tang, Bin Li},
journal={Under Review},
year={2025}
}
```
---
## π Resources
- π Dataset: [HuggingFace Link](link_to_dataset)
- π Paper: [arXiv Link](link_to_paper)
- π€ Baseline Models: [Weights & Logs](link_to_models)
## Data License
The AncientDoc dataset runs under the CC0 license.