HTML Decoder: Replace system-dependent ctype check with ASCII byte comparison#12286
HTML Decoder: Replace system-dependent ctype check with ASCII byte comparison#12286sirreal wants to merge 22 commits into
Conversation
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
Hi there! 👋 Thank you for your contribution to WordPress! 💖 It looks like this is your first pull request to No one monitors this repository for new pull requests. Pull requests must be attached to a Trac ticket to be considered for inclusion in WordPress Core. To attach a pull request to a Trac ticket, please include the ticket's full URL in your pull request description. Pull requests are never merged on GitHub. The WordPress codebase continues to be managed through the SVN repository that this GitHub repository mirrors. Please feel free to open pull requests to work on any contribution you are making. More information about how GitHub pull requests can be used to contribute to WordPress can be found in the Core Handbook. Please include automated tests. Including tests in your pull request is one way to help your patch be considered faster. To learn about WordPress' test suites, visit the Automated Testing page in the handbook. If you have not had a chance, please review the Contribute with Code page in the WordPress Core Handbook. The Developer Hub also documents the various coding standards that are followed:
Thank you, |
| $locale_candidates = array( | ||
| 'C.UTF-8', | ||
| 'C.utf8', | ||
| 'en_US.UTF-8', | ||
| 'en_US.utf8', | ||
| 'en_GB.UTF-8', | ||
| 'en_GB.utf8', | ||
| ); |
There was a problem hiding this comment.
I don't know whether it's worth checking multiple locales or all of these locales are likely to all have the same behavior on the same system. For example, my system has the issue with "C.UTF-8", the other .UTF-8 locales listed here, and more.
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
| * | ||
| * @ticket 65372 | ||
| */ | ||
| public function test_semicolonless_legacy_reference_before_multibyte_attribute_follower( string $encoded_attribute_value, string $expected, string $expected_decode, int $expected_byte_length ): void { |
There was a problem hiding this comment.
This is the test that fails on trunk depending on the system.
This reverts commit e2ed016.
|
I'm trying a revert of the On my system, I get these failures from one of the new tests: |
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN: To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
| * - In _attribute context_, "¬" decodes to "¬". Condition 3 is not satisfied | ||
| * because there is no following code point to consider. | ||
| * - In _attribute context_, "¬me" decodes to "¬me" unchanged because it | ||
| * satisfies all three conditions above. |
There was a problem hiding this comment.
your expanded discussion is really helpful, but I find the inversion of logic really hard to follow. before we had “allowed under these circumstances” and now we have “not allowed when these circumstances are not met”
The “ambiguous” language might be helpful both to match the spec and to explain the intent behind the rule. the intent is that we are determining if it was likely that the missing semicolon was a typo vs. something never intended to be a character reference: it’s ambiguous.
For example:
Condition 3 is not satisfied
The reference is not not-rendered because a condition is not satisfied.
Perhaps phrasing could be more affirmative in describing what does happen.
In attribute context, "¬己" decodes to "¬己" because the character in the place of the missing semicolon is distinctly separate from the name; it is neither an ASCII alphanumeric or an equals sign.
If we are going to expand this so much, we might also consider explaining the other conditions, the ambiguous ones, to highlight why the rule is here. Specifically I see no mention of URL query arguments, which explains the equals sign.
Please notify all future ¤t students.https://website.domain/search?q=html¬=regex
So these two cases I think capture the “error-handling” aspect and might clarify the complicated rules. I think the essence is that everything here is complicated to try and avoid these two cases.
Part of decoding HTML named character references in attribute values may involve checking the codepoint immediately following the named character reference:
The ASCII alphanumeric check was implemented using
ctype_alnum(). The behavior of this depends on the host system and the locale. On my system (macOS) it returnstruefor characters outside of the desired ASCII alphanumeric range.This change compares the following byte with the well-defined ASCII alphanumeric ranges from the HTML specification.
This change also does some minor restructuring of the method to make it align clearly with the specification and to include an early return and avoid the byte comparison in the majority of cases.
Trac ticket: https://core.trac.wordpress.org/ticket/65372
Use of AI Tools
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.