内容简介:Parsr, is a minimal-footprint document (image, pdf) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data for data scientists and developers.
Turn your documents into data!
Parsr, is a minimal-footprint document (image, pdf) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data for data scientists and developers.
It provides users with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysis automation, archival, and many others.
Currently, Parsr can perform:
- Document Hierarchy Regeneration - Words, Lines and Paragraphs
- Headings Detection
- Table Detection and Reconstruction
- Lists Detection
- Text Order Detection
- Named Entity Recognition (Dates, Percentages, etc)
- Key-Value Pair Detection (for the extraction of specific form-based entries)
- Page Number Detection
- Header-Footer Detection
- Link Detection
- Whitespace Removal
Parsr takes as input an image (.JPG, .PNG, .TIFF, ...) or a PDF generates the following output formats:
- JSON
- Markdown
- Text
- CSV (for tables), or Pandas Dataframes (see here )
Table of Contents
- Turn your documents into data!
Getting Started
Installation
-- The advanced installation guide is available here --
The quickest way to install and run the Parsr API is through the docker image :
docker pull axarev/parsr
If you also wish to install the GUI for sending documents and visualising results:
docker pull axarev/parsr-ui-localhost
Note: Parsr can also be installed bare-metal (not via Docker containers), the procedure for which is documented in the installation guide .
Usage
-- The advanced usage guide is available here --
To run the API , issue:
docker run -p 3001:3001 axarev/parsr
which will launch it on http://localhost:3001 .
Consult the documentation on the usage of the API .
- To use the Jupyter Notebook and the python interface to the Parsr API, follow here .
- To use the GUI tool (the API needs to already be running), issue:
docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest
Then, access it through http://localhost:8080 .
Refer to the Configuration documentation to interpret the configurable options in the GUI viewer.
The API based usage and the command line usage are documented in the advanced usage guide.
Documentation
All documentation files can be found here .
Contribute
Please refer to the contribution guidelines .
Third Party Licenses
Third Party Libraries licenses for its dependencies :
- QPDF : Apache http://qpdf.sourceforge.net
- GraphicsMagick : MIT http://www.graphicsmagick.org/index.html
- ImageMagick : Apache 2.0 https://imagemagick.org/script/license.php
- Pdfminer.six : MIT https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE
- PDF.js : Apache 2.0 https://github.com/mozilla/pdf.js
- Tesseract : Apache 2.0 https://github.com/tesseract-ocr/tesseract
- Camelot : MIT https://github.com/camelot-dev/camelot
- MuPDF (Optional dependency): AGPL https://mupdf.com/license.html
- Pandoc (Optional dependency): GPL https://github.com/jgm/pandoc
License
Copyright 2019 AXA Group Operations S.A.
Licensed under the Apache 2.0 license (see the LICENSE file).
以上所述就是小编给大家介绍的《Show HN: Parsr – A toolchain to transform documents in usable structured text》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
创新者的窘境(全新修订版)
克莱顿•克里斯坦森 / 胡建桥 / 中信出版社 / 2014-1-1 / 48.00元
全球商业领域中,许多企业曾叱咤风云,但面对市场变化及新技术的挑战,最终惨遭淘汰。究其原因,竟然是因为它们精于管理,信奉客户至上等传统商业观念。这就是所有企业如今都正面临的“创新者的窘境”。 在《创新者的窘境》中,管理大师克里斯坦森指出,一些看似很完美的商业动作——对主流客户所需、赢利能力最强的产品进行精准投资和技术研发——最终却很可能毁掉一家优秀的企业。他分析了计算机、汽车、钢铁等多个行业的......一起来看看 《创新者的窘境(全新修订版)》 这本书的介绍吧!
JSON 在线解析
在线 JSON 格式化工具
RGB转16进制工具
RGB HEX 互转工具