Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] DOM HTML5 parsing and serialization support #12111

Merged
merged 53 commits into from
Nov 13, 2023
Merged
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
006e0f1
Split off and wrap cloning API
nielsdos Aug 25, 2023
d081407
Update ext/libxml APIs so that non-libxml users can hook into the err…
nielsdos Aug 25, 2023
c031fff
Add is_html5_class field to document data in libxml
nielsdos Aug 25, 2023
32747ec
Implement HTML5Document
nielsdos Aug 25, 2023
c166391
Create aliases for DOM constants
nielsdos Sep 17, 2023
a61ff75
Alias dom_import_simplexml
nielsdos Sep 17, 2023
13012dd
Create class aliases
nielsdos Sep 17, 2023
5776832
Introduce common base class
nielsdos Sep 17, 2023
feca60b
Register prop handlers in common base class
nielsdos Sep 17, 2023
f150f32
Implement the new class hierarchy instead of HTML5Document extends DO…
nielsdos Sep 17, 2023
73dfdd2
Add libxml2 bug workaround
nielsdos Sep 22, 2023
8181d9e
Update tree error reporting
nielsdos Sep 23, 2023
04bea28
Amends
nielsdos Sep 24, 2023
3d848e3
DOMException name
nielsdos Sep 26, 2023
b14e1a3
Add interaction test with getElementsByTagName(NS)
nielsdos Sep 28, 2023
a8a7e96
Wire up documentURI
nielsdos Sep 28, 2023
dc13d17
Adjustment for namespace reconciliation revert
nielsdos Sep 28, 2023
28f302b
Test with noscript
nielsdos Oct 4, 2023
efc4369
Implement override_encoding
nielsdos Oct 4, 2023
0cd2449
Make libxml stream context externally visible and use it in html_docu…
nielsdos Oct 4, 2023
9edb7eb
Nope: BC
nielsdos Oct 7, 2023
1633f02
Implement MIME sniff
nielsdos Oct 7, 2023
cd00da0
Fix crash if document is uninitialized
nielsdos Oct 12, 2023
1a9d74c
Fix test output due to class changes in this RFC
nielsdos Oct 18, 2023
cd037b0
rename tests
nielsdos Oct 19, 2023
1da1a6e
Cleanup: encoding is always set for the new HTMLDocument class
nielsdos Oct 19, 2023
f1fd156
More and improved tests
nielsdos Oct 19, 2023
9fb0b12
Comment and indent cleanup
nielsdos Oct 20, 2023
871ebbc
Use libxml context for saveHTMLFile
nielsdos Oct 20, 2023
4651d96
[ci skip] UPGRADING
nielsdos Oct 20, 2023
a30e5a6
Propagate last error back into libxml
nielsdos Oct 20, 2023
d161e54
Propagate file name in libxml error
nielsdos Oct 20, 2023
23612b0
Update doctype hint in ext/xsl
nielsdos Oct 22, 2023
1410238
Update error message wording of abstract class
nielsdos Oct 22, 2023
44b9663
Add test for incompatible override_encoding and charset
nielsdos Oct 22, 2023
0b10fdc
Test behaviour of XML-style namespaces in HTMLDocument
nielsdos Oct 22, 2023
965732d
Process review feedback
nielsdos Oct 22, 2023
bf89e80
Use canonical names in the return types and argument types
nielsdos Oct 28, 2023
7ba177e
Use consistently camelCase for overrideEncoding
nielsdos Oct 28, 2023
3dbbf19
Split long lines in html_document
nielsdos Oct 31, 2023
f798f45
Give magic values descriptive names
nielsdos Oct 31, 2023
cf07ba9
Codestyle nit
nielsdos Oct 31, 2023
ea7d604
Reduce repetition
nielsdos Oct 31, 2023
d906c0d
Apply strict properties
nielsdos Oct 31, 2023
92e9064
Code style nits
nielsdos Nov 1, 2023
92faa43
Review remarks
nielsdos Nov 11, 2023
c645f5d
Code style linebreaks
nielsdos Nov 11, 2023
308313c
Make encoding explicit in XMLDocument
nielsdos Nov 11, 2023
b7f065f
Mark XMLDocument::validate() no longer as tentative-return-type
nielsdos Nov 11, 2023
dbc9afa
Only explicitly set encoding if not provided by the XML parser
nielsdos Nov 11, 2023
c7b55c2
Code review fixes
nielsdos Nov 12, 2023
557aa43
Defensively set stream to NULL
nielsdos Nov 13, 2023
46db0d2
NEWS
nielsdos Nov 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions NEWS
Original file line number Diff line number Diff line change
@@ -10,6 +10,7 @@ DOM:
. Implement #53655 (Improve speed of DOMNode::C14N() on large XML documents).
(nielsdos)
. Fix cloning attribute with namespace disappearing namespace. (nielsdos)
. Implement DOM HTML5 parsing and serialization RFC. (nielsdos)

FTP:
. Removed the deprecated inet_ntoa call support. (David Carlier)
8 changes: 8 additions & 0 deletions UPGRADING
Original file line number Diff line number Diff line change
@@ -80,6 +80,14 @@ PHP 8.4 UPGRADE NOTES
. Added constant DOMNode::DOCUMENT_POSITION_CONTAINS.
. Added constant DOMNode::DOCUMENT_POSITION_CONTAINED_BY.
. Added constant DOMNode::DOCUMENT_POSITION_IMPLEMENTATION_SPECIFIC.
. Implemented DOM HTML5 parsing and serialization.
RFC: https://wiki.php.net/rfc/domdocument_html5_parser.
This RFC adds the new DOM namespace along with class and constant aliases.
There are two new classes to handle HTML and XML documents:
DOM\HTMLDocument and DOM\XMLDocument.
These classes provide a cleaner API to handle HTML and XML documents.
Furthermore, the DOM\HTMLDocument class implements spec-compliant HTML5
parsing and serialization.

- Phar:
. Added support for the unix timestamp extension for zip archives.
4 changes: 4 additions & 0 deletions UPGRADING.INTERNALS
Original file line number Diff line number Diff line change
@@ -52,6 +52,10 @@ PHP 8.4 INTERNALS UPGRADE NOTES
- The function php_xsl_create_object() was removed as it was not used
nor exported.

d. ext/libxml
- Added php_libxml_pretend_ctx_error_ex() to emit errors as if they had come
from libxml.

========================
4. OpCode changes
========================
19 changes: 17 additions & 2 deletions ext/dom/config.m4
Original file line number Diff line number Diff line change
@@ -12,7 +12,21 @@ if test "$PHP_DOM" != "no"; then

PHP_SETUP_LIBXML(DOM_SHARED_LIBADD, [
AC_DEFINE(HAVE_DOM,1,[ ])
PHP_LEXBOR_CFLAGS="-I@ext_srcdir@/lexbor -DLEXBOR_STATIC"
LEXBOR_DIR="lexbor/lexbor"
LEXBOR_SOURCES="$LEXBOR_DIR/ports/posix/lexbor/core/memory.c \
$LEXBOR_DIR/core/array_obj.c $LEXBOR_DIR/core/array.c $LEXBOR_DIR/core/avl.c $LEXBOR_DIR/core/bst.c $LEXBOR_DIR/core/diyfp.c $LEXBOR_DIR/core/conv.c $LEXBOR_DIR/core/dobject.c $LEXBOR_DIR/core/dtoa.c $LEXBOR_DIR/core/hash.c $LEXBOR_DIR/core/mem.c $LEXBOR_DIR/core/mraw.c $LEXBOR_DIR/core/print.c $LEXBOR_DIR/core/serialize.c $LEXBOR_DIR/core/shs.c $LEXBOR_DIR/core/str.c $LEXBOR_DIR/core/strtod.c \
$LEXBOR_DIR/dom/interface.c $LEXBOR_DIR/dom/interfaces/attr.c $LEXBOR_DIR/dom/interfaces/cdata_section.c $LEXBOR_DIR/dom/interfaces/character_data.c $LEXBOR_DIR/dom/interfaces/comment.c $LEXBOR_DIR/dom/interfaces/document.c $LEXBOR_DIR/dom/interfaces/document_fragment.c $LEXBOR_DIR/dom/interfaces/document_type.c $LEXBOR_DIR/dom/interfaces/element.c $LEXBOR_DIR/dom/interfaces/node.c $LEXBOR_DIR/dom/interfaces/processing_instruction.c $LEXBOR_DIR/dom/interfaces/shadow_root.c $LEXBOR_DIR/dom/interfaces/text.c \
$LEXBOR_DIR/html/tokenizer/error.c $LEXBOR_DIR/html/tokenizer/state_comment.c $LEXBOR_DIR/html/tokenizer/state_doctype.c $LEXBOR_DIR/html/tokenizer/state_rawtext.c $LEXBOR_DIR/html/tokenizer/state_rcdata.c $LEXBOR_DIR/html/tokenizer/state_script.c $LEXBOR_DIR/html/tokenizer/state.c \
$LEXBOR_DIR/html/tree/active_formatting.c $LEXBOR_DIR/html/tree/error.c $LEXBOR_DIR/html/tree/insertion_mode/after_after_body.c $LEXBOR_DIR/html/tree/insertion_mode/after_after_frameset.c $LEXBOR_DIR/html/tree/insertion_mode/after_body.c $LEXBOR_DIR/html/tree/insertion_mode/after_frameset.c $LEXBOR_DIR/html/tree/insertion_mode/after_head.c $LEXBOR_DIR/html/tree/insertion_mode/before_head.c $LEXBOR_DIR/html/tree/insertion_mode/before_html.c $LEXBOR_DIR/html/tree/insertion_mode/foreign_content.c $LEXBOR_DIR/html/tree/insertion_mode/in_body.c $LEXBOR_DIR/html/tree/insertion_mode/in_caption.c $LEXBOR_DIR/html/tree/insertion_mode/in_cell.c $LEXBOR_DIR/html/tree/insertion_mode/in_column_group.c $LEXBOR_DIR/html/tree/insertion_mode/in_frameset.c $LEXBOR_DIR/html/tree/insertion_mode/in_head.c $LEXBOR_DIR/html/tree/insertion_mode/in_head_noscript.c $LEXBOR_DIR/html/tree/insertion_mode/initial.c $LEXBOR_DIR/html/tree/insertion_mode/in_row.c $LEXBOR_DIR/html/tree/insertion_mode/in_select.c $LEXBOR_DIR/html/tree/insertion_mode/in_select_in_table.c $LEXBOR_DIR/html/tree/insertion_mode/in_table_body.c $LEXBOR_DIR/html/tree/insertion_mode/in_table.c $LEXBOR_DIR/html/tree/insertion_mode/in_table_text.c $LEXBOR_DIR/html/tree/insertion_mode/in_template.c $LEXBOR_DIR/html/tree/insertion_mode/text.c $LEXBOR_DIR/html/tree/open_elements.c \
$LEXBOR_DIR/encoding/big5.c $LEXBOR_DIR/encoding/decode.c $LEXBOR_DIR/encoding/encode.c $LEXBOR_DIR/encoding/encoding.c $LEXBOR_DIR/encoding/euc_kr.c $LEXBOR_DIR/encoding/gb18030.c $LEXBOR_DIR/encoding/iso_2022_jp_katakana.c $LEXBOR_DIR/encoding/jis0208.c $LEXBOR_DIR/encoding/jis0212.c $LEXBOR_DIR/encoding/range.c $LEXBOR_DIR/encoding/res.c $LEXBOR_DIR/encoding/single.c \
$LEXBOR_DIR/html/encoding.c $LEXBOR_DIR/html/interface.c $LEXBOR_DIR/html/parser.c $LEXBOR_DIR/html/token.c $LEXBOR_DIR/html/token_attr.c $LEXBOR_DIR/html/tokenizer.c $LEXBOR_DIR/html/tree.c \
$LEXBOR_DIR/html/interfaces/anchor_element.c $LEXBOR_DIR/html/interfaces/area_element.c $LEXBOR_DIR/html/interfaces/audio_element.c $LEXBOR_DIR/html/interfaces/base_element.c $LEXBOR_DIR/html/interfaces/body_element.c $LEXBOR_DIR/html/interfaces/br_element.c $LEXBOR_DIR/html/interfaces/button_element.c $LEXBOR_DIR/html/interfaces/canvas_element.c $LEXBOR_DIR/html/interfaces/data_element.c $LEXBOR_DIR/html/interfaces/data_list_element.c $LEXBOR_DIR/html/interfaces/details_element.c $LEXBOR_DIR/html/interfaces/dialog_element.c $LEXBOR_DIR/html/interfaces/directory_element.c $LEXBOR_DIR/html/interfaces/div_element.c $LEXBOR_DIR/html/interfaces/d_list_element.c $LEXBOR_DIR/html/interfaces/document.c $LEXBOR_DIR/html/interfaces/element.c $LEXBOR_DIR/html/interfaces/embed_element.c $LEXBOR_DIR/html/interfaces/field_set_element.c $LEXBOR_DIR/html/interfaces/font_element.c $LEXBOR_DIR/html/interfaces/form_element.c $LEXBOR_DIR/html/interfaces/frame_element.c $LEXBOR_DIR/html/interfaces/frame_set_element.c $LEXBOR_DIR/html/interfaces/head_element.c $LEXBOR_DIR/html/interfaces/heading_element.c $LEXBOR_DIR/html/interfaces/hr_element.c $LEXBOR_DIR/html/interfaces/html_element.c $LEXBOR_DIR/html/interfaces/iframe_element.c $LEXBOR_DIR/html/interfaces/image_element.c $LEXBOR_DIR/html/interfaces/input_element.c $LEXBOR_DIR/html/interfaces/label_element.c $LEXBOR_DIR/html/interfaces/legend_element.c $LEXBOR_DIR/html/interfaces/li_element.c $LEXBOR_DIR/html/interfaces/link_element.c $LEXBOR_DIR/html/interfaces/map_element.c $LEXBOR_DIR/html/interfaces/marquee_element.c $LEXBOR_DIR/html/interfaces/media_element.c $LEXBOR_DIR/html/interfaces/menu_element.c $LEXBOR_DIR/html/interfaces/meta_element.c $LEXBOR_DIR/html/interfaces/meter_element.c $LEXBOR_DIR/html/interfaces/mod_element.c $LEXBOR_DIR/html/interfaces/object_element.c $LEXBOR_DIR/html/interfaces/o_list_element.c $LEXBOR_DIR/html/interfaces/opt_group_element.c $LEXBOR_DIR/html/interfaces/option_element.c $LEXBOR_DIR/html/interfaces/output_element.c $LEXBOR_DIR/html/interfaces/paragraph_element.c $LEXBOR_DIR/html/interfaces/param_element.c $LEXBOR_DIR/html/interfaces/picture_element.c $LEXBOR_DIR/html/interfaces/pre_element.c $LEXBOR_DIR/html/interfaces/progress_element.c $LEXBOR_DIR/html/interfaces/quote_element.c $LEXBOR_DIR/html/interfaces/script_element.c $LEXBOR_DIR/html/interfaces/select_element.c $LEXBOR_DIR/html/interfaces/slot_element.c $LEXBOR_DIR/html/interfaces/source_element.c $LEXBOR_DIR/html/interfaces/span_element.c $LEXBOR_DIR/html/interfaces/style_element.c $LEXBOR_DIR/html/interfaces/table_caption_element.c $LEXBOR_DIR/html/interfaces/table_cell_element.c $LEXBOR_DIR/html/interfaces/table_col_element.c $LEXBOR_DIR/html/interfaces/table_element.c $LEXBOR_DIR/html/interfaces/table_row_element.c $LEXBOR_DIR/html/interfaces/table_section_element.c $LEXBOR_DIR/html/interfaces/template_element.c $LEXBOR_DIR/html/interfaces/text_area_element.c $LEXBOR_DIR/html/interfaces/time_element.c $LEXBOR_DIR/html/interfaces/title_element.c $LEXBOR_DIR/html/interfaces/track_element.c $LEXBOR_DIR/html/interfaces/u_list_element.c $LEXBOR_DIR/html/interfaces/unknown_element.c $LEXBOR_DIR/html/interfaces/video_element.c $LEXBOR_DIR/html/interfaces/window.c \
$LEXBOR_DIR/selectors/selectors.c \
$LEXBOR_DIR/ns/ns.c \
$LEXBOR_DIR/tag/tag.c"
PHP_NEW_EXTENSION(dom, [php_dom.c attr.c document.c \
xml_document.c html_document.c html5_serializer.c html5_parser.c namespace_compat.c \
domexception.c parentnode.c \
processinginstruction.c cdatasection.c \
documentfragment.c domimplementation.c \
@@ -21,8 +35,9 @@ if test "$PHP_DOM" != "no"; then
nodelist.c text.c comment.c \
entityreference.c \
notation.c xpath.c dom_iterators.c \
namednodemap.c],
$ext_shared)
namednodemap.c \
$LEXBOR_SOURCES],
$ext_shared,,$PHP_LEXBOR_CFLAGS)
PHP_SUBST(DOM_SHARED_LIBADD)
PHP_INSTALL_HEADERS([ext/dom/xml_common.h])
PHP_ADD_EXTENSION_DEP(dom, libxml)
18 changes: 17 additions & 1 deletion ext/dom/config.w32
Original file line number Diff line number Diff line change
@@ -8,13 +8,29 @@ if (PHP_DOM == "yes") {
CHECK_HEADER_ADD_INCLUDE("libxml/parser.h", "CFLAGS_DOM", PHP_PHP_BUILD + "\\include\\libxml2")
) {
EXTENSION("dom", "php_dom.c attr.c document.c \
xml_document.c html_document.c html5_serializer.c html5_parser.c namespace_compat.c \
domexception.c parentnode.c processinginstruction.c \
cdatasection.c documentfragment.c domimplementation.c element.c \
node.c characterdata.c documenttype.c \
entity.c nodelist.c text.c comment.c \
entityreference.c \
notation.c xpath.c dom_iterators.c \
namednodemap.c");
namednodemap.c", null, "-Iext/dom/lexbor");

ADD_SOURCES("ext/dom/lexbor/lexbor/ports/windows_nt/lexbor/core", "memory.c", "dom");
ADD_SOURCES("ext/dom/lexbor/lexbor/core", "array_obj.c array.c avl.c bst.c diyfp.c conv.c dobject.c dtoa.c hash.c mem.c mraw.c print.c serialize.c shs.c str.c strtod.c", "dom");
ADD_SOURCES("ext/dom/lexbor/lexbor/dom", "interface.c", "dom");
ADD_SOURCES("ext/dom/lexbor/lexbor/dom/interfaces", "attr.c cdata_section.c character_data.c comment.c document.c document_fragment.c document_type.c element.c node.c processing_instruction.c shadow_root.c text.c", "dom");
ADD_SOURCES("ext/dom/lexbor/lexbor/html/tokenizer", "error.c state_comment.c state_doctype.c state_rawtext.c state_rcdata.c state_script.c state.c", "dom");
ADD_SOURCES("ext/dom/lexbor/lexbor/html/tree", "active_formatting.c open_elements.c error.c", "dom");
ADD_SOURCES("ext/dom/lexbor/lexbor/html/tree/insertion_mode", "after_after_body.c after_after_frameset.c after_body.c after_frameset.c after_head.c before_head.c before_html.c foreign_content.c in_body.c in_caption.c in_cell.c in_column_group.c in_frameset.c in_head.c in_head_noscript.c initial.c in_row.c in_select.c in_select_in_table.c in_table_body.c in_table.c in_table_text.c in_template.c text.c", "dom");
ADD_SOURCES("ext/dom/lexbor/lexbor/html", "encoding.c interface.c parser.c token.c token_attr.c tokenizer.c tree.c", "dom");
ADD_SOURCES("ext/dom/lexbor/lexbor/encoding", "big5.c decode.c encode.c encoding.c euc_kr.c gb18030.c iso_2022_jp_katakana.c jis0208.c jis0212.c range.c res.c single.c", "dom");
ADD_SOURCES("ext/dom/lexbor/lexbor/html/interfaces", "anchor_element.c area_element.c audio_element.c base_element.c body_element.c br_element.c button_element.c canvas_element.c data_element.c data_list_element.c details_element.c dialog_element.c directory_element.c div_element.c d_list_element.c document.c element.c embed_element.c field_set_element.c font_element.c form_element.c frame_element.c frame_set_element.c head_element.c heading_element.c hr_element.c html_element.c iframe_element.c image_element.c input_element.c label_element.c legend_element.c li_element.c link_element.c map_element.c marquee_element.c media_element.c menu_element.c meta_element.c meter_element.c mod_element.c object_element.c o_list_element.c opt_group_element.c option_element.c output_element.c paragraph_element.c param_element.c picture_element.c pre_element.c progress_element.c quote_element.c script_element.c select_element.c slot_element.c source_element.c span_element.c style_element.c table_caption_element.c table_cell_element.c table_col_element.c table_element.c table_row_element.c table_section_element.c template_element.c text_area_element.c time_element.c title_element.c track_element.c u_list_element.c unknown_element.c video_element.c window.c", "dom");
ADD_SOURCES("ext/dom/lexbor/lexbor/selectors", "selectors.c", "dom");
ADD_SOURCES("ext/dom/lexbor/lexbor/ns", "ns.c", "dom");
ADD_SOURCES("ext/dom/lexbor/lexbor/tag", "tag.c", "dom");
ADD_FLAG("CFLAGS_DOM", "/D LEXBOR_STATIC ");

AC_DEFINE("HAVE_DOM", 1, "DOM support");

Loading