jsoup 1.12.1 发布,最好的 Java HTML 解析器,没有之一

栏目: 软件资讯 · 发布时间: 7年前

内容简介:jsoup 1.12.1 发布了,该版本包含众多可用性的提升,提升了解析速度和内存效率,修复了不少 bug 。 jsoup 是一款 Java 的HTML 解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过D...

jsoup 1.12.1 发布了,该版本包含众多可用性的提升,提升了解析速度和内存效率,修复了不少 bug 。

jsoup 是一款 Java 的HTML 解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于JQuery的操作方法来取出和操作数据。

下载地址:Download 

完整的改进记录如下:

Changes

  • Change: removed deprecated method to disable TLS cert checking in Connection.validateTLSCertificates().
  • Change: some internal methods have been rearranged; if you extended any of the Jsoup internals you may need to make updates.
  • Updated jetty-server (which is used for integration tests) to latest 9.2 series (9.2.28).

Improvements

  • Improvement: documents now remember their parser, so when later manipulating them, the correct HTML or XML tree builder is reused, as are the parser settings like case preservation.
  • Improvement: Jsoup now detects the character set of the input if specified in an XML Declaration, when using the HTML parser. Previously that only happened when the XML parser was specified.
  • Improvement: if the document's input character set does not support encoding, flip it to one that does.
  • Improvement: if a start tag is missing a > and a new tag is seen with a <, treat that as a new tag. (This differs from the HTML5 spec, which would make at attribute with a name beginning with <, but in practice this impacts too many pages.
  • Improvement: performance tweaks when parsing start tags, data, tables.
  • Improvement: added Element.nextElementSiblings() and Element.previousElementSiblings()
  • Improvement: treat center tags as block tags.
  • Improvement: allow forms to be submitted with Content-Type=multipart/form-data without requiring a file upload; automatically set the mime boundary.
  • Improvement: Jsoup will now detect if an input file or URL is binary, and will refuse to attempt to parse it, with an IO Exception. This prevents runaway processing time and wasted effort creating meaningless parsed DOM trees.

Bug Fixes

  • Bugfix: when using the tag case preserving parsing settings, certain HTML tree building rules where not followed for upper case tags.
  • Bugfix: when converting a Jsoup document to a W3C DOM, if an element is namespaced but not in a defined namespace, set it to the global namespace.
  • Bugfix: attributes created with the Attribute constructor with just spaces for names would incorrectly pass validation.
  • Bugfix: some pseudo XML Declarations were incorrectly handled when using the XML Parser, leading to an IOOB exception when parsing.
  • Bugfix: when parsing URL parameter names in an attribute that is not correctly HTML encoded, and near the end of the current buffer, those parameters may be incorrectly dropped. (Improved CharacterReader mark/reset support.)
  • Bugfix: boolean attribute values would be returned as null, vs an empty string, when accessed via the Attribute#getValue() method.
  • Bugix: orphan Attribute objects (i.e. created outside of a parse or an Element) would throw an NPE on Attribute#setValue(val)
  • Bugfix: Element.shallowClone() was not making a clone of its attributes.
  • Bugfix: fixed an ArrayIndexOutOfBoundsException in HttpConnection.looksLikeUtf8() when testing small strings in specific character ranges.

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

众包

众包

杰夫·豪 / 牛文静 / 中信出版社 / 2009-6 / 36.00元

本书是继《长尾理论》之后的重要商业书籍。本书回答了《长尾理论》遗留的一大悬念。在长尾中作者详细阐述了长尾之所以成为可能的一个基础,但是没有详细解读,本书就是对这一悬念的详细回答,是《长尾理论》作者强力推荐的图书,在国际上引起了不小的轰动,“众包”这一概念也成为一个标准术语被商界广泛重视。本书大致分为三个部分,介绍众包的现在、过去和未来,解释了它的缘起、普遍性、力量以及商业上的适用性,通俗易懂,精彩......一起来看看 《众包》 这本书的介绍吧!

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

SHA 加密
SHA 加密

SHA 加密工具