-
Notifications
You must be signed in to change notification settings - Fork 3.2k
HTML API: Auto-escape JavaScript and JSON script tag contents when necessary #10635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
HTML API: Auto-escape JavaScript and JSON script tag contents when necessary #10635
Conversation
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
| * > then let the script block's type string for this script element be "text/javascript". | ||
| */ | ||
| $type_attr = $this->get_attribute( 'type' ); | ||
| $language_attr = $this->get_attribute( 'language' ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like this has the potential to be clearer if we did one of two things:
- early-abort when both the type and language attributes are missing.
- null-coalesce to some value like
''which would be semantically the same in these checks asnullbut allow us to treat the values as strings.
or even something like this…
$type = $this->get_attribute( 'type' );
$type = is_string( $type ) ?: ''; // combine `true` and `null` into ''.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this can be simpler. The different cases seem clear so we can also add some unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed this and simplified or clarified it slightly, but I think it matches the specified behavior well and I'm not sure that further changes will improve things.
| /* | ||
| * > Otherwise, if the script block's type string is an ASCII case-insensitive match for | ||
| * > the string "module", then set el's type to "module". | ||
| * | ||
| * A module is evaluated as JavaScript. | ||
| */ | ||
| case 'module': | ||
| return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer this as-is, it allows quotes referencing different specifications to remain separate.
Co-authored-by: Weston Ruter <[email protected]>
|
I've created #10668 to align STYLE tag handling with this approach. |
| * Other types of script tags cannot be escaped safely because there is | ||
| * no general escaping mechanism for arbitrary types of content. | ||
| */ | ||
| return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we think about changing the design of the
is_javascript_script_tag()methods, we could expose something like a MIME type and allow custom filtering of escapes to allow plugins to provide their own escaping routines rather than reject the updates.for now I see no reason to block progress for this, especially since we already rejected these contents before, but I think we can imagine a situation where someone can replace their SCRIPT contents and we can perform the
<scriptand</scriptchecks after the filter, and only then reject if nothing was able to escape
| return strtr( | ||
| $text, | ||
| array( '<' => '\\u003C' ) | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linking to conversation about this escaping choice:
this is certainly easy, but I wonder what the relative value is compared to performing the same escape as with the JavaScript elements.
my guess is that
<scriptand</scriptoccur far less frequently within JSON than<does, so if we perform the same escaping then we might expect considerably fewer modifications to the JSON content.from a performance standpoint I expect the picture to be complicated and hard to quantify, as we’re comparing a relatively simple pair of string searches against string allocations and string building.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this as well. Ultimately, this seemed very reasonable, strtr is supposed to have good performance characteristics and there are no cases to worry about with <. This is close to the behavior of json_encode() with JSON_HEX_TAG, although that also escapes >.
Some other considerations:
- We could JSON decode/encode with the appropriate flags. That's likely expensive and can modify some types of values, I think it's best avoided.
- We could use the same escaping as JavaScript.
preg_replace_callbackis likely more expensive, but I don't have data on that. - I considered doing
strtrwith all the case variants, but the list ofstrtrcase variants would require something like 128 different cases for case-insensitive<scriptand</script, and that seems excessive.
There is a middle ground that is worth considering. This note on strtr() is relevant (bold mine):
If given two arguments, the second should be an array in the form
array('from' => 'to', ...)…The keys and the values may have any length… However, this function will be the most efficient when all the keys have the same size.
Every character added to the replacement becomes more targeted. We could use additional characters in the strtr pattern with something like the following:
function escape_json_script_contents_2( string $text ): string {
return strtr(
$text,
array(
'<s' => '\\u003Cs',
'<S' => '\\u003CS',
'</' => '\\u003C/',
)
);
}
function escape_json_script_contents_3( string $text ): string {
return strtr(
$text,
array(
'<sc' => '\\u003Csc',
'<Sc' => '\\u003CSc',
'<sC' => '\\u003CsC',
'<SC' => '\\u003CSC',
'</s' => '\\u003C/s',
'</S' => '\\u003C/S',
)
);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these seem better than how it stands now as I think the occurrence rate of <s, <S, and </ will also be much lower than of < by itself.
I wouldn’t want us to get caught up in the weeds of performance without any metrics to guide us. We can keep in mind that these operations are going to happen relatively infrequently during a page render. Maybe they happen ten times, vs. something like tag parsing which might happen thousands of times.
So the preg_replace_callback() seems just fine to me, even though personally I think both of these methods are probably non-ideal. I still prefer inching forward on stripos() followed by strspn().
The thing is, without data we just don’t know and so who cares if we are potentially slower on a minor part of the page render. If our intention is clear we can follow-up with improvements where we can make them soundly.
Ultimately here my intuition says that the difference between the basic strtr() and the stripos()+strspn() replacements is going to fall down to the relative improvement strtr() brings against the higher rate of modifying the string. I would expect < to appear in a reasonably high fraction of strings, whereas <script and variants are unlikely to regularly appear. Both approaches are scanning through the string to perform a search, and while strtr() is surely efficient, not allocating and not performing replacement is surely more efficient.
I would caution here against piling up verbose source code to eke out minor performance improvements when we don’t know if they matter and when they obscure intent. None of your suggestions obscure intent, but I think they may also not be essential.
| * | ||
| * † = Case insensitive 'script' followed by one of ' \t\f\r\n/>' | ||
| * | ||
| * The original source of this graph is included at the bottom of this file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be updated.
| } | ||
|
|
||
| /** | ||
| * Escape JavaScript script tag contents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to document that this is suitable for JavaScript and JSON escaping.
| return 'json'; | ||
|
|
||
| /** @todo Rely on a full MIME parser for determining JSON content. */ | ||
| case 'application/json': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also good to include here initially:
| case 'application/json': | |
| case 'application/json': | |
| case 'application/ld+json': | |
| case 'application/manifest+json': |
These and other such MIME types would be handled by the parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for these “MIME type essence” matches it’s okay to add, but I wanted to somewhat limit how much further we go parsing them until we do actual MIME parsing, otherwise we’ll just end up duplicating a bunch of effort.
perhaps we try and merge #10640 before this one and then we can eliminate the TODO as well as the special cases
|
@sirreal after my changes the tests could be merged, and in fact, can/should be refactored a fair amount. I think he same might be true of the long comment in the JavaScript escaping function. for now, I moved the at one point I added an optional parameter which was if you look at the commits, you will realize I learned that |

The HTML API currently rejects script tag contents that may be dangerous. This is a proposal to detect JavaScript and JSON script tags and automatically escape contents when necessary.
<scriptor</script(case-insensitive) is found.In JSON
<is replaced with\u003C. This eliminates the problematic strings and aligns with the approach described in #63851 and applied in r60681.This is proposed as a simple character replacement with
strtr. This should be highly performant. A less invasive replacement could be done to only replace<in<scriptor</scriptwhere it's really necessary. This would preserve more of the JSON string, but likely at the cost of performance. It would require either a regular expression with case-insensitive matching (see JavaScript example).In JavaScript
<scriptand</script(followed by a necessary tag termination character\t\n\r\f/>) thesis replaced with its Unicode escape. This should remain valid in all contexts where the text can appear and maintain identical behavior in all except a few edge cases (see ticket or quoted section below for full explanation and caveats).From the ticket:
Trac ticket: https://core.trac.wordpress.org/ticket/64419
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.