Show HN: Parsr – A toolchain to transform documents in usable structured text

栏目: IT技术 · 发布时间: 6年前

内容简介：Parsr, is a minimal-footprint document (image, pdf) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data for data scientists and developers.

Turn your documents into data!

Français | 中文

Parsr, is a minimal-footprint document (image, pdf) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data for data scientists and developers.

It provides users with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysis automation, archival, and many others.

Currently, Parsr can perform:

Document Hierarchy Regeneration - Words, Lines and Paragraphs
Headings Detection
Table Detection and Reconstruction
Lists Detection
Text Order Detection
Named Entity Recognition (Dates, Percentages, etc)
Key-Value Pair Detection (for the extraction of specific form-based entries)
Page Number Detection
Header-Footer Detection
Link Detection
Whitespace Removal

Parsr takes as input an image (.JPG, .PNG, .TIFF, ...) or a PDF generates the following output formats:

JSON
Markdown
Text
CSV (for tables), or Pandas Dataframes (see here )
PDF

Turn your documents into data!

Getting Started

Installation

-- The advanced installation guide is available here --

The quickest way to install and run the Parsr API is through the docker image :

docker pull axarev/parsr

If you also wish to install the GUI for sending documents and visualising results:

docker pull axarev/parsr-ui-localhost

Note: Parsr can also be installed bare-metal (not via Docker containers), the procedure for which is documented in the installation guide .

Usage

-- The advanced usage guide is available here --

To run the API , issue:

docker run -p 3001:3001 axarev/parsr

which will launch it on http://localhost:3001 .

Consult the documentation on the usage of the API .

To use the Jupyter Notebook and the python interface to the Parsr API, follow here .
To use the GUI tool (the API needs to already be running), issue:
```
docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest
```
Then, access it through http://localhost:8080 .

Refer to the Configuration documentation to interpret the configurable options in the GUI viewer.

The API based usage and the command line usage are documented in the advanced usage guide.

Documentation

All documentation files can be found here .

Contribute

Please refer to the contribution guidelines .

Third Party Licenses

Third Party Libraries licenses for its dependencies :

QPDF : Apache http://qpdf.sourceforge.net
GraphicsMagick : MIT http://www.graphicsmagick.org/index.html
ImageMagick : Apache 2.0 https://imagemagick.org/script/license.php
Pdfminer.six : MIT https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE
PDF.js : Apache 2.0 https://github.com/mozilla/pdf.js
Tesseract : Apache 2.0 https://github.com/tesseract-ocr/tesseract
Camelot : MIT https://github.com/camelot-dev/camelot
MuPDF (Optional dependency): AGPL https://mupdf.com/license.html
Pandoc (Optional dependency): GPL https://github.com/jgm/pandoc

License

Licensed under the Apache 2.0 license (see the LICENSE file).

以上所述就是小编给大家介绍的《Show HN: Parsr – A toolchain to transform documents in usable structured text》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Web 2.0界面设计模式

黄玮 / 电子工业出版社 / 2013-9-1 / 59

本书集Web 2.0的发展及特点、Web 2.0界面设计模式基本理论、实际模式实践及代码实现等诸多内容于一身，具有很强的实用性。这些内容不是简单的顺序堆砌，而是以Web 2.0界面设计模式和应用为主线，其中完美地穿插了各种与之相关的Web 2.0设计理念、用户行为模式、用户体验及基于Dojo的实现方式等相关知识，真正做到将Web 2.0界面设计模式所需要的方方面面的知识有机地融为一个整体。实现不需......一起来看看《Web 2.0界面设计模式》这本书的介绍吧!

码农工具

Show HN: Parsr – A toolchain to transform documents in usable structured text

Turn your documents into data!

Table of Contents

Getting Started

Installation

Usage

Documentation

Contribute

Third Party Licenses

License

Web 2.0界面设计模式

CSS 压缩/解压工具

在线进制转换器

正则表达式在线测试