jsoup 1.12.1 发布,最好的 Java HTML 解析器,没有之一

栏目: Html · 发布时间: 5年前

内容简介:jsoup 1.12.1发布了,该版本包含众多可用性的提升,提升了解析速度和内存效率,修复了不少 bug 。jsoup 是一款 Java 的HTML 解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于JQuery的操作方法来取出和操作数据。下载地址:

jsoup 1.12.1发布了,该版本包含众多可用性的提升,提升了解析速度和内存效率,修复了不少 bug 。

jsoup 是一款 Java 的HTML 解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于JQuery的操作方法来取出和操作数据。

下载地址: Download

完整的改进记录如下:

Changes

Connection.validateTLSCertificates()

Improvements

  • Improvement: documents now remember their parser, so when later manipulating them, the correct HTML or XML tree builder is reused, as are the parser settings like case preservation.
  • Improvement: Jsoup now detects the character set of the input if specified in an XML Declaration, when using the HTML parser. Previously that only happened when the XML parser was specified.
  • Improvement: if the document's input character set does not support encoding, flip it to one that does.
  • Improvement: if a start tag is missing a > and a new tag is seen with a  < , treat that as a new tag. (This differs from the HTML5 spec, which would make at attribute with a name beginning with  < , but in practice this impacts too many pages.
  • Improvement: performance tweaks when parsing start tags, data, tables.
  • Improvement: added  Element.nextElementSiblings()  and  Element.previousElementSiblings()
  • Improvement: treat  center  tags as block tags.
  • Improvement: allow forms to be submitted with  Content-Type=multipart/form-data  without requiring a file upload; automatically set the mime boundary.
  • Improvement: Jsoup will now detect if an input file or URL is binary, and will refuse to attempt to parse it, with an  IO Exception . This prevents runaway processing time and wasted effort creating meaningless parsed DOM trees.

Bug Fixes

  • Bugfix: when using the tag case preserving parsing settings, certain HTML tree building rules where not followed for upper case tags.
  • Bugfix: when converting a Jsoup document to a W3C DOM, if an element is namespaced but not in a defined namespace, set it to the global namespace.
  • Bugfix: attributes created with the Attribute constructor with just spaces for names would incorrectly pass validation.
  • Bugfix: some pseudo XML Declarations were incorrectly handled when using the XML Parser, leading to an IOOB exception when parsing.
  • Bugfix: when parsing URL parameter names in an attribute that is not correctly HTML encoded, and near the end of the current buffer, those parameters may be incorrectly dropped. (Improved  CharacterReader  mark/reset support.)
  • Bugfix: boolean attribute values would be returned as null, vs an empty string, when accessed via the  Attribute#getValue()  method.
  • Bugix: orphan  Attribute  objects (i.e. created outside of a parse or an Element) would throw an NPE on  Attribute#setValue(val)
  • Bugfix: Element.shallowClone() was not making a clone of its attributes.
  • Bugfix: fixed an  ArrayIndexOutOfBoundsException  in  HttpConnection.looksLikeUtf8()  when testing small strings in specific character ranges.

以上所述就是小编给大家介绍的《jsoup 1.12.1 发布,最好的 Java HTML 解析器,没有之一》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Java并发编程的艺术

Java并发编程的艺术

方腾飞、魏鹏、程晓明 / 机械工业出版社 / 2015-7-1 / 59.00元

并发编程领域的扛鼎之作,作者是阿里和1号店的资深Java技术专家,对并发编程有非常深入的研究,《Java并发编程的艺术》是他们多年一线开发经验的结晶。本书的部分内容在出版早期发表在Java并发编程网和InfoQ等技术社区,得到了非常高的评价。它选取了Java并发编程中最核心的技术进行讲解,从JDK源码、JVM、CPU等多角度全面剖析和讲解了Java并发编程的框架、工具、原理和方法,对Java并发编......一起来看看 《Java并发编程的艺术》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具