Skip to content

Conversation

@sirreal
Copy link
Member

@sirreal sirreal commented Dec 15, 2025

The HTML API currently rejects script tag contents that may be dangerous. This is a proposal to detect JavaScript and JSON script tags and automatically escape contents when necessary.

  • JSON and JavaScript script tags may be detected according to the HTML standard.
  • Script tag contents are escaped only when <script or </script (case-insensitive) is found.

In JSON

< is replaced with \u003C. This eliminates the problematic strings and aligns with the approach described in #63851 and applied in r60681.

This is proposed as a simple character replacement with strtr. This should be highly performant. A less invasive replacement could be done to only replace < in <script or </script where it's really necessary. This would preserve more of the JSON string, but likely at the cost of performance. It would require either a regular expression with case-insensitive matching (see JavaScript example).

In JavaScript

<script and </script (followed by a necessary tag termination character \t\n\r\f/>) the s is replaced with its Unicode escape. This should remain valid in all contexts where the text can appear and maintain identical behavior in all except a few edge cases (see ticket or quoted section below for full explanation and caveats).

From the ticket:

The HTML API prevents setting SCRIPT tag that could modify the tree either by closing the SCRIPT element prematurely, or by preventing the SCRIPT element from closing at the expected close tag.

This is handled by rejecting any script tag contents that are potentially dangerous and is safe. There are some improvements that could be made.

If the contents are found to be unsafe and the type of the script tag is JSON or JavaScript (this is well specified in the HTML standard), it should be possible to apply a syntactic transformation to the contents in such a way that the script contents become safe without semantically altering the script.

If the HTML API can safely and automatically escape the majority of SCRIPT tag contents, it can then be used to for SCRIPT tag creation and has the potential to eliminate the class of problem from #40737, #62797, and #63851. It also has the potential to address part of #51159 where SCRIPT tag escaping becomes less of an issue.

JSON

In JSON SCRIPT tags, the transformation is a simple replacement of < with its Unicode escape sequence \u003C. This can be applied to the entire contents of the script or specifically in case-insensitive matches for <script and </script.

JavaScript

JavaScript SCRIPT tags are more difficult because the language has vastly more syntax. Fortunately, there is prior art described in this 2022 blog post (external) from React team member Sophie Alpert. It's the same the JavaScript SCRIPT tag contents escaping strategy that React continues to employ today. In summary, the problematic text <script and </script syntactically appear in places where Unicode escape sequences can be used in the script part (Strings, Identifiers, and RegExp literals). React takes the approach of replacing the s character, resulting in <\u0073cript or </\u0073cript, completely safe in a Script tag.

There are a few notable exceptions where the transformed JavaScript has observably different runtime behavior. These are the only examples I'm aware of. They're more esoteric parts of the language and the likelihood of them being used in inline JavaScript with the problematic text sequences seems an acceptable tradeoff to me to enable cheap, automatic JavaScript escaping.

String.raw does not process escape sequences.

'<script>' === '<\u0073cript>'; // true
String.raw`<script>` === String.raw`<\u0073cript>`; // false

Tagged templates can also access the raw strings, again a form without processing escape sequences.

function taggedCooked( strings ) {
    return strings[0];
}
taggedCooked`<script>` === taggedCooked`<\u0073cript>`; // true

function taggedRaw( strings ) {
    return strings.raw[0];
}
taggedRaw`<script>` === taggedRaw`<\u0073cript>`; // false

The source property of RegExp contains a string representation of the pattern. JavaScript RegExp support Unicode escape sequences, but the Unicode escape sequence is not transformed in the source.

const rPlain = /<script>/;
const rEscaped = /<\u0073cript>/

rPlain.test('<script>'); // true
rEscaped.test('<script>'); // true

rPlain.source === rEscaped.source; // false
rPlain.source; // '<script>'
rEscaped.source; // '<\\u0073cript>'

Any better JavaScript escaping would likely require a complete JavaScript parser and much more invasive changes. It would be much more costly to perform. Even then, I'm not sure that the escaping could be done faithfully.

String.raw() could be split and joined:

String.raw`<script>` === String.raw`<s` + String.raw`cript>`; true 

Tagged template raw and RegExp source seem much more challenging.

Trac ticket: https://core.trac.wordpress.org/ticket/64419


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

@github-actions
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

* > then let the script block's type string for this script element be "text/javascript".
*/
$type_attr = $this->get_attribute( 'type' );
$language_attr = $this->get_attribute( 'language' );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like this has the potential to be clearer if we did one of two things:

  • early-abort when both the type and language attributes are missing.
  • null-coalesce to some value like '' which would be semantically the same in these checks as null but allow us to treat the values as strings.

or even something like this…

$type = $this->get_attribute( 'type' );
$type = is_string( $type ) ?: '';        // combine `true` and `null` into ''.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this can be simpler. The different cases seem clear so we can also add some unit tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed this and simplified or clarified it slightly, but I think it matches the specified behavior well and I'm not sure that further changes will improve things.

Comment on lines 4042 to 4049
/*
* > Otherwise, if the script block's type string is an ASCII case-insensitive match for
* > the string "module", then set el's type to "module".
*
* A module is evaluated as JavaScript.
*/
case 'module':
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PhpStorm complains about having a separate case here:

Image

It suggests moving the module case up above with the others.

This is purely stylistic, I acknowledge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this as-is, it allows quotes referencing different specifications to remain separate.

@sirreal sirreal requested a review from dmsnell December 22, 2025 18:28
sirreal added a commit to sirreal/wordpress-develop that referenced this pull request Dec 29, 2025
@sirreal
Copy link
Member Author

sirreal commented Dec 29, 2025

I've created #10668 to align STYLE tag handling with this approach.

Comment on lines 3833 to 3836
* Other types of script tags cannot be escaped safely because there is
* no general escaping mechanism for arbitrary types of content.
*/
return false;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linking to a comment.

if we think about changing the design of the is_javascript_script_tag() methods, we could expose something like a MIME type and allow custom filtering of escapes to allow plugins to provide their own escaping routines rather than reject the updates.

for now I see no reason to block progress for this, especially since we already rejected these contents before, but I think we can imagine a situation where someone can replace their SCRIPT contents and we can perform the <script and </script checks after the filter, and only then reject if nothing was able to escape

Comment on lines 4176 to 4179
return strtr(
$text,
array( '<' => '\\u003C' )
);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linking to conversation about this escaping choice:

this is certainly easy, but I wonder what the relative value is compared to performing the same escape as with the JavaScript elements.

my guess is that <script and </script occur far less frequently within JSON than < does, so if we perform the same escaping then we might expect considerably fewer modifications to the JSON content.

from a performance standpoint I expect the picture to be complicated and hard to quantify, as we’re comparing a relatively simple pair of string searches against string allocations and string building.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this as well. Ultimately, this seemed very reasonable, strtr is supposed to have good performance characteristics and there are no cases to worry about with <. This is close to the behavior of json_encode() with JSON_HEX_TAG, although that also escapes >.

Some other considerations:

  • We could JSON decode/encode with the appropriate flags. That's likely expensive and can modify some types of values, I think it's best avoided.
  • We could use the same escaping as JavaScript. preg_replace_callback is likely more expensive, but I don't have data on that.
  • I considered doing strtr with all the case variants, but the list of strtr case variants would require something like 128 different cases for case-insensitive <script and </script, and that seems excessive.

There is a middle ground that is worth considering. This note on strtr() is relevant (bold mine):

If given two arguments, the second should be an array in the form array('from' => 'to', ...)

The keys and the values may have any length… However, this function will be the most efficient when all the keys have the same size.

Every character added to the replacement becomes more targeted. We could use additional characters in the strtr pattern with something like the following:

function escape_json_script_contents_2( string $text ): string {
	return strtr(
		$text,
		array(
            '<s' => '\\u003Cs',
            '<S' => '\\u003CS',
            '</' => '\\u003C/',
	    )
	);
}

function escape_json_script_contents_3( string $text ): string {
	return strtr(
		$text,
		array(
            '<sc' => '\\u003Csc',
            '<Sc' => '\\u003CSc',
            '<sC' => '\\u003CsC',
            '<SC' => '\\u003CSC',
            '</s' => '\\u003C/s',
            '</S' => '\\u003C/S',
	    )
	);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these seem better than how it stands now as I think the occurrence rate of <s, <S, and </ will also be much lower than of < by itself.

I wouldn’t want us to get caught up in the weeds of performance without any metrics to guide us. We can keep in mind that these operations are going to happen relatively infrequently during a page render. Maybe they happen ten times, vs. something like tag parsing which might happen thousands of times.

So the preg_replace_callback() seems just fine to me, even though personally I think both of these methods are probably non-ideal. I still prefer inching forward on stripos() followed by strspn().

The thing is, without data we just don’t know and so who cares if we are potentially slower on a minor part of the page render. If our intention is clear we can follow-up with improvements where we can make them soundly.

Ultimately here my intuition says that the difference between the basic strtr() and the stripos()+strspn() replacements is going to fall down to the relative improvement strtr() brings against the higher rate of modifying the string. I would expect < to appear in a reasonably high fraction of strings, whereas <script and variants are unlikely to regularly appear. Both approaches are scanning through the string to perform a search, and while strtr() is surely efficient, not allocating and not performing replacement is surely more efficient.

I would caution here against piling up verbose source code to eke out minor performance improvements when we don’t know if they matter and when they obscure intent. None of your suggestions obscure intent, but I think they may also not be essential.

*
* † = Case insensitive 'script' followed by one of ' \t\f\r\n/>'
*
* The original source of this graph is included at the bottom of this file.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be updated.

}

/**
* Escape JavaScript script tag contents.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to document that this is suitable for JavaScript and JSON escaping.

return 'json';

/** @todo Rely on a full MIME parser for determining JSON content. */
case 'application/json':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also good to include here initially:

Suggested change
case 'application/json':
case 'application/json':
case 'application/ld+json':
case 'application/manifest+json':

These and other such MIME types would be handled by the parser.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for these “MIME type essence” matches it’s okay to add, but I wanted to somewhat limit how much further we go parsing them until we do actual MIME parsing, otherwise we’ll just end up duplicating a bunch of effort.

perhaps we try and merge #10640 before this one and then we can eliminate the TODO as well as the special cases

@dmsnell
Copy link
Member

dmsnell commented Dec 31, 2025

@sirreal after my changes the tests could be merged, and in fact, can/should be refactored a fair amount. I think he same might be true of the long comment in the JavaScript escaping function. for now, I moved the graphviz source code into a separate file in the test directory to get it out of the class. at least I didn’t like it at the bottom since it was so disconnected from the code, and I think you didn’t like it inline because of how noisy it was already. I think in the end we could link to your blog post or to the SPEC and just let it be a little reference instead of a full explanation.

at one point I added an optional parameter which was $script_type and when passed, would be set to things like classic, module, etc… but then I took it out because we didn’t have a clear need for it, plus I figured any determination of it could be done quickly after calling get_script_content_type().

if you look at the commits, you will realize I learned that script is six letters and not five 🤦‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants