• 欢迎访问速搜资源吧,如果在网站上找不到你需要的资源,可以在留言板上留言,管理员会尽量满足你!

【速搜问答】Tesseract是什么

问答 admin 2年前 (2020-08-12) 923次浏览 已收录 0个评论

汉英对照:
Chinese-English Translation:

Tesseract是一个光学字符识别引擎,支持多种操作系统。Tesseract是基于Apache许可证的自由软件,自2006 年起由Google赞助开发。 2006年,Tesseract被认为是最精准的开源光学字符识别引擎之一。

Tesseract is an optical character recognition engine that supports multiple operating systems. Tesseract is free software under the Apache license and has been developed under the sponsorship of Google since 2006. In 2006, Tesseract was considered one of the most accurate open source optical character recognition engines.

Tesseract 是一个光学字符识别引擎,支持多种操作系统。Tesseract 是基于 Apache 许可证的自由软件,自 2006 年起由 Google 赞助开发。 2006 年,Tesseract 被认为是最精准的开源光学字符识别引擎之一。

Tesseract is an optical character recognition engine that supports multiple operating systems. Tesseract is free software under the Apache license and has been developed under the sponsorship of Google since 2006. In 2006, Tesseract was considered one of the most accurate open source optical character recognition engines.

Tesseract 最初是在 1985 年至 1994 年之间在布里斯托的惠普实验室和位于格里利科罗拉多州的惠普公司开发的,1996 年进行了一些更改以移植到 Windows,并在 1998 年进行了一些 C ++化。2005 年 Tesseract 开放由 HP 采购。自 2006 年以来,它是由 Google 开发的。

Tesseract was originally developed at HP Labs in Bristol and Hewlett Packard in Greeley, colo., between 1985 and 1994. It made some changes to migrate to windows in 1996, and made some C + + changes in 1998. Tesseract was opened to purchase by HP in 2005. It has been developed by Google since 2006.

Tesseract OCR 该软件包包含一个 OCR 引擎 – libtesseract 和一个命令行程序 – tesseract。 Tesseract 4 增加了一个基于 OCR 引擎的新神经网络(LSTM),该引擎专注于线路识别,但仍然支持 Tesseract 3 的传统 Tesseract OCR 引擎,该引擎通过识别字符模式来工作。通过使用 Legacy OCR Engine 模式(–oem 0)启用与 Tesseract 3 的兼容性。它还需要训练有素的数据文件,这些文件支持传统引擎,例如来自 tessdata 存储库的文件。

Tesseract OCR the package contains an OCR engine – libtesseract and a command-line program – Tesseract. Tesseract 4 adds a new neural network (LSTM) based on OCR engine, which focuses on line recognition, but still supports Tesseract 3’s traditional Tesseract OCR engine, which works by recognizing character patterns. Enable compatibility with Tesseract 3 by using legacy OCR engine mode (– OEM 0). It also requires trained data files that support traditional engines, such as files from the tessdata repository.

Tesseract 支持 unicode(UTF-8),可以“开箱即用” 识别 100 多种语言。

Tesseract supports Unicode (UTF-8) and can recognize more than 100 languages “out of the box”.

Tesseract 支持各种输出格式:纯文本,hOCR(HTML),PDF,不可见文本的 PDF,TSV。主分支还具有 ALTO(XML)输出的实验支持。

Tesseract supports various output formats: plain text, HCR (HTML), PDF, PDF for invisible text, TSV. The main branch also has the experimental support of Alto (XML) output.

Tesseract OCR 引擎于 20 世纪 80 年代出现,更新迭代至今,它已经包括内置的深度学习模型,变成了十分稳健的 OCR 工具。而 Tesseract 和 OpenCV 的 EAST 检测器是一个很棒的组合,感兴趣的读者可参考机器之心报道。

Tesseract OCR engine appeared in the 1980s. It has included built-in deep learning model and has become a very robust OCR tool. Tesseract and opencv’s East detector is a great combination, and interested readers can refer to the heart of machine report.

Tesseract 支持 Unicode(UTF-8)字符集,可以识别超过 100 种语言,还包含多种输出支持,比如纯文本、PDF、TSV 等。但是为了得到更好的 OCR 结果,还必须提升提供给 Tesseract 的图像的质量。

Tesseract supports the Unicode (UTF-8) character set, can recognize more than 100 languages, and also includes a variety of output support, such as plain text, PDF, TSV, etc. However, in order to get better OCR results, the quality of the images provided to Tesseract must be improved.

值得注意的是,在执行实际的 OCR 之前,Tesseract 会在内部执行多种不同的图像处理操作(使用 Leptonica 库)。通常情况下表现不错,但在一些特定的情况下的效果却不够好,导致准确度显著下降。

It is worth noting that Tesseract performs a number of different image processing operations internally (using the leptonica Library) before performing the actual OCR. In general, the performance is good, but in some specific cases, the effect is not good enough, resulting in a significant decline in accuracy.


速搜资源网 , 版权所有丨如未注明 , 均为原创丨转载请注明原文链接:【速搜问答】Tesseract是什么
喜欢 (0)
[361009623@qq.com]
分享 (0)
发表我的评论
取消评论
表情 贴图 加粗 删除线 居中 斜体 签到

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址