# OCR

**Repository Path**: www-manian-com/ocr

## Basic Information

- **Project Name**: OCR
- **Description**: OCR
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 1
- **Created**: 2025-06-23
- **Last Updated**: 2025-06-23

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# OCR

## 1. 技术
使用的是 tess4j 使用是中英文语言+各种方向的文字模型 

## 2. 依赖
```xml
 <dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>5.8.0</version>
</dependency>
```

## 3. 语言模型
tesseract 支持一百多种语言识别，
你可以从[Traineddata 语言模型说明下载页面](https://github.com/tesseract-ocr/tessdoc/blob/main/Data-Files.md) 
选择自己想要识别的.traineddata 格式的语言模型文件下载。


（1）特殊的模型

如果你想识别图片里的各种方向的文字可以下载 [osd.traineddata](https://gitcode.com/mirrors/tesseract-ocr/tessdata/tree/main) 模型

如果你想识别图片里的各种数学公式、方程可以下载 [equ.traineddata](https://gitcode.com/mirrors/tesseract-ocr/tessdata/tree/main) 模型

（2）语言模型

tesseract 在 GitHub 上的有三个独立的语言模型存储库
[tessdata](https://gitcode.com/tesseract-ocr/tessdata/tree/main) 、
[tessdata-best](https://gitcode.com/mirrors/tesseract-ocr/tessdata_best/overview?utm_source=csdn_github_accelerator&isLogin=1) 、
[tessdata-fast ](https://gitcode.com/mirrors/tesseract-ocr/tessdata_best/overview?utm_source=csdn_github_accelerator&isLogin=1) 、
他们分别都存储了语言模型，他们的区别是：


|  |如何训练得到的|	速度|	识别准确性|	是否支持旧版|	是否支持再训练|
|--|--|--|--|--|--|
|tessdata	|传统+LSTM（并整合tessdata-best）	|比 tessdata-best 更快|	比 tessdata-best 准确度稍低|	支持	|不支持|
|tessdata-best	|仅 LSTM（基于langdata）|	最慢|最准确	|不支持|	支持|
|tessdata-fast	|比 tessdata-best 更小的LSTM网络整合 |	最快的|	最不准确	|不支持	|不支持|

# 4. 使用
项目的 resources 文件夹下新建 tessdata 文件夹，然后把上面下载的 .traineddata 格式的语言模型文件复制到 tessdata 下