Show HN: Parsr – A toolchain to transform documents in usable structured text

栏目: IT技术 · 发布时间: 4年前

内容简介:Parsr, is a minimal-footprint document (image, pdf) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data for data scientists and developers.

Show HN: Parsr – A toolchain to transform documents in usable structured text

Turn your documents into data!

Français | 中文

Parsr, is a minimal-footprint document (image, pdf) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data for data scientists and developers.

It provides users with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysis automation, archival, and many others.

Currently, Parsr can perform:

  • Document Hierarchy Regeneration - Words, Lines and Paragraphs
  • Headings Detection
  • Table Detection and Reconstruction
  • Lists Detection
  • Text Order Detection
  • Named Entity Recognition (Dates, Percentages, etc)
  • Key-Value Pair Detection (for the extraction of specific form-based entries)
  • Page Number Detection
  • Header-Footer Detection
  • Link Detection
  • Whitespace Removal

Parsr takes as input an image (.JPG, .PNG, .TIFF, ...) or a PDF generates the following output formats:

  • JSON
  • Markdown
  • Text
  • CSV (for tables), or Pandas Dataframes (see here )
  • PDF

Table of Contents

  • Turn your documents into data!

Getting Started

Installation

-- The advanced installation guide is available here --

The quickest way to install and run the Parsr API is through the docker image :

docker pull axarev/parsr

If you also wish to install the GUI for sending documents and visualising results:

docker pull axarev/parsr-ui-localhost

Note: Parsr can also be installed bare-metal (not via Docker containers), the procedure for which is documented in the installation guide .

Usage

-- The advanced usage guide is available here --

To run the API , issue:

docker run -p 3001:3001 axarev/parsr

which will launch it on http://localhost:3001 .

Consult the documentation on the usage of the API .

  1. To use the Jupyter Notebook and the python interface to the Parsr API, follow here .
  2. To use the GUI tool (the API needs to already be running), issue:
    docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest
    Then, access it through http://localhost:8080 .

Refer to the Configuration documentation to interpret the configurable options in the GUI viewer.

The API based usage and the command line usage are documented in the advanced usage guide.

Documentation

All documentation files can be found here .

Contribute

Please refer to the contribution guidelines .

Third Party Licenses

Third Party Libraries licenses for its dependencies :

  1. QPDF : Apache http://qpdf.sourceforge.net
  2. GraphicsMagick : MIT http://www.graphicsmagick.org/index.html
  3. ImageMagick : Apache 2.0 https://imagemagick.org/script/license.php
  4. Pdfminer.six : MIT https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE
  5. PDF.js : Apache 2.0 https://github.com/mozilla/pdf.js
  6. Tesseract : Apache 2.0 https://github.com/tesseract-ocr/tesseract
  7. Camelot : MIT https://github.com/camelot-dev/camelot
  8. MuPDF (Optional dependency): AGPL https://mupdf.com/license.html
  9. Pandoc (Optional dependency): GPL https://github.com/jgm/pandoc

License

Copyright 2019 AXA Group Operations S.A.

Licensed under the Apache 2.0 license (see the LICENSE file).


以上所述就是小编给大家介绍的《Show HN: Parsr – A toolchain to transform documents in usable structured text》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

风吹江南之互联网金融

风吹江南之互联网金融

陈宇(江南愤青) / 东方出版社 / 2014-6-1 / 55元

随着中国互联网金融浪潮高涨,P2P、众筹、余额宝、微信支付等新生事物层出不穷,加之大数据等时髦概念助阵,简直是乱花渐欲迷人眼,令媒体兴奋,公众狂热。那么,互联网金融真的能“颠覆”传统金融吗?当互联网思维对撞传统金融观念,是互联网金融的一统天下,还是传统金融业的自我革新?究竟是谁动了金融业的奶酪? 本书作者早期试水创立具有互联网金融雏形的网站,后来成为互联网金融的资深投资人,基于其多年在该领域......一起来看看 《风吹江南之互联网金融》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具