Skip to content

HTML Decoder: Replace system-dependent ctype check with ASCII byte comparison#12286

Open
sirreal wants to merge 22 commits into
WordPress:trunkfrom
sirreal:fix/html-decoder-legacy-follower-ascii
Open

HTML Decoder: Replace system-dependent ctype check with ASCII byte comparison#12286
sirreal wants to merge 22 commits into
WordPress:trunkfrom
sirreal:fix/html-decoder-legacy-follower-ascii

Conversation

@sirreal

@sirreal sirreal commented Jun 23, 2026

Copy link
Copy Markdown
Member

Part of decoding HTML named character references in attribute values may involve checking the codepoint immediately following the named character reference:

13.2.5.73 Named character reference state

If there is a [named character reference] match

  • If the character reference was consumed as part of an attribute, and the last character matched is not a U+003B SEMICOLON character (;), and the next input character is either a U+003D EQUALS SIGN character (=) or an ASCII alphanumeric, then, for historical reasons, flush code points consumed as a character reference and switch to the return state.

The ASCII alphanumeric check was implemented using ctype_alnum(). The behavior of this depends on the host system and the locale. On my system (macOS) it returns true for characters outside of the desired ASCII alphanumeric range.

php -r 'echo ctype_alnum( "\xC2" ) ? "Affected" : "Unaffected";'
# Affected

This change compares the following byte with the well-defined ASCII alphanumeric ranges from the HTML specification.

This change also does some minor restructuring of the method to make it align clearly with the specification and to include an early return and avoid the byte comparison in the majority of cases.

Trac ticket: https://core.trac.wordpress.org/ticket/65372

Use of AI Tools


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

@github-actions

Copy link
Copy Markdown

Hi there! 👋

Thank you for your contribution to WordPress! 💖

It looks like this is your first pull request to wordpress-develop. Here are a few things to be aware of that may help you out!

No one monitors this repository for new pull requests. Pull requests must be attached to a Trac ticket to be considered for inclusion in WordPress Core. To attach a pull request to a Trac ticket, please include the ticket's full URL in your pull request description.

Pull requests are never merged on GitHub. The WordPress codebase continues to be managed through the SVN repository that this GitHub repository mirrors. Please feel free to open pull requests to work on any contribution you are making.

More information about how GitHub pull requests can be used to contribute to WordPress can be found in the Core Handbook.

Please include automated tests. Including tests in your pull request is one way to help your patch be considered faster. To learn about WordPress' test suites, visit the Automated Testing page in the handbook.

If you have not had a chance, please review the Contribute with Code page in the WordPress Core Handbook.

The Developer Hub also documents the various coding standards that are followed:

Thank you,
The WordPress Project

@sirreal sirreal changed the title HTML Decoder: Replace system-dependent ctype check ASCII byte comparison HTML Decoder: Replace system-dependent ctype check with ASCII byte comparison Jun 23, 2026
Comment on lines +35 to +42
$locale_candidates = array(
'C.UTF-8',
'C.utf8',
'en_US.UTF-8',
'en_US.utf8',
'en_GB.UTF-8',
'en_GB.utf8',
);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether it's worth checking multiple locales or all of these locales are likely to all have the same behavior on the same system. For example, my system has the issue with "C.UTF-8", the other .UTF-8 locales listed here, and more.

@github-actions

Copy link
Copy Markdown

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

*
* @ticket 65372
*/
public function test_semicolonless_legacy_reference_before_multibyte_attribute_follower( string $encoded_attribute_value, string $expected, string $expected_decode, int $expected_byte_length ): void {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the test that fails on trunk depending on the system.

@sirreal

sirreal commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

I'm trying a revert of the ctype_alnum() change to see if there are any failures on CI.

On my system, I get these failures from one of the new tests:

1) Tests_HtmlApi_WpHtmlDecoder::test_semicolonless_legacy_reference_before_multibyte_attribute_follower with data set #0 ('&copy¯\_(ツ)_/¯', '©¯\_(ツ)_/¯', '©', 5)
Failed to decode the full attribute value as expected.
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'©¯\_(ツ)_/¯'
+'&copy¯\_(ツ)_/¯'

tests/phpunit/tests/html-api/wpHtmlDecoder.php:121

2) Tests_HtmlApi_WpHtmlDecoder::test_semicolonless_legacy_reference_before_multibyte_attribute_follower with data set #1 ('&notಠ_ಠ', '¬ಠ_ಠ', '¬', 4)
Failed to decode the full attribute value as expected.
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'¬ಠ_ಠ'
+'&notಠ_ಠ'

tests/phpunit/tests/html-api/wpHtmlDecoder.php:121

3) Tests_HtmlApi_WpHtmlDecoder::test_semicolonless_legacy_reference_before_multibyte_attribute_follower with data set #2 ('&nbsp£20', ' £20', ' ', 5)
Failed to decode the full attribute value as expected.
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-' £20'
+'&nbsp£20'

tests/phpunit/tests/html-api/wpHtmlDecoder.php:121

4) Tests_HtmlApi_WpHtmlDecoder::test_semicolonless_legacy_reference_before_multibyte_attribute_follower with data set #3 ('&nbsp🎉', ' 🎉', ' ', 5)
Failed to decode the full attribute value as expected.
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-' 🎉'
+'&nbsp🎉'

tests/phpunit/tests/html-api/wpHtmlDecoder.php:121

5) Tests_HtmlApi_WpHtmlDecoder::test_semicolonless_legacy_reference_before_multibyte_attribute_follower with data set #4 ('&reg™', '®™', '®', 4)
Failed to decode the full attribute value as expected.
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'®™'
+'&reg™'

tests/phpunit/tests/html-api/wpHtmlDecoder.php:121

FAILURES!
Tests: 115, Assertions: 331, Failures: 5.

@sirreal sirreal marked this pull request as ready for review June 23, 2026 15:07
@sirreal sirreal requested a review from dmsnell June 23, 2026 15:07
@github-actions

Copy link
Copy Markdown

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

* - In _attribute context_, "&not" decodes to "¬". Condition 3 is not satisfied
* because there is no following code point to consider.
* - In _attribute context_, "&notme" decodes to "&notme" unchanged because it
* satisfies all three conditions above.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your expanded discussion is really helpful, but I find the inversion of logic really hard to follow. before we had “allowed under these circumstances” and now we have “not allowed when these circumstances are not met”

The “ambiguous” language might be helpful both to match the spec and to explain the intent behind the rule. the intent is that we are determining if it was likely that the missing semicolon was a typo vs. something never intended to be a character reference: it’s ambiguous.

For example:

Condition 3 is not satisfied

The reference is not not-rendered because a condition is not satisfied.

Perhaps phrasing could be more affirmative in describing what does happen.

In attribute context, "&not己" decodes to "¬己" because the character in the place of the missing semicolon is distinctly separate from the name; it is neither an ASCII alphanumeric or an equals sign.


If we are going to expand this so much, we might also consider explaining the other conditions, the ambiguous ones, to highlight why the rule is here. Specifically I see no mention of URL query arguments, which explains the equals sign.

  • Please notify all future &current students.
  • https://website.domain/search?q=html&not=regex

So these two cases I think capture the “error-handling” aspect and might clarify the complicated rules. I think the essence is that everything here is complicated to try and avoid these two cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants