Skip to content
Open
Show file tree
Hide file tree
Changes from 45 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
4a8ca98
Auto-escape JavaScript and JSON script tags when necessary
sirreal Dec 15, 2025
4a28ef2
Update ticket number
sirreal Dec 15, 2025
0bef687
Fix those lints
sirreal Dec 15, 2025
cdba027
No trailing function commas
sirreal Dec 15, 2025
ba54ae4
Remove JSON_THROW_ON_ERROR constant
sirreal Dec 15, 2025
a697e9e
fixup! Remove JSON_THROW_ON_ERROR constant
sirreal Dec 15, 2025
2b3d0d0
Merge branch 'trunk' into html-api/auto-escape-javascript-json
sirreal Dec 16, 2025
8246439
Remove JSON +json subtype handling
sirreal Dec 16, 2025
3f7f227
Merge branch 'trunk' into html-api/auto-escape-javascript-json
sirreal Dec 19, 2025
1362451
Add JS and JSON script tag tests
Copilot Dec 19, 2025
8d30680
Fix typo in comment
sirreal Dec 19, 2025
3b5ef4e
Improve comment
sirreal Dec 19, 2025
f8cfdf9
Clean up and fix tests
sirreal Dec 19, 2025
501d201
Clean up and improve type/language logic
sirreal Dec 19, 2025
bcc02ae
Fix svg SCRIPT tag tests
sirreal Dec 19, 2025
ea03441
Add todo on MIME type parsing
sirreal Dec 19, 2025
c7d1827
Update since tags 🤞
sirreal Dec 19, 2025
a134e82
Make is_{javascript,json}_script_tag methods private
sirreal Dec 19, 2025
d4bd4b3
Add language whitespace test
sirreal Dec 19, 2025
edae8d5
How'd that extra space get there, remove it!
sirreal Dec 19, 2025
02ca3c0
Name search parts
sirreal Dec 19, 2025
d058a78
Clean up escaping tests
sirreal Dec 19, 2025
dfb63af
Add ignore and todo tags to new private is_*_script_tag functions
sirreal Dec 19, 2025
11f51c9
Improve example comment
sirreal Dec 19, 2025
a3e0e27
Update regex comment with named groups
sirreal Dec 19, 2025
288b952
Add details to "other" failure to escape comment
sirreal Dec 19, 2025
a7495dd
Improve consistency of comment quoting and differentiate types of dou…
sirreal Dec 19, 2025
d8c320c
Describe JavaScript escaping strategy in detail
sirreal Dec 19, 2025
369eefc
Use the updated_html value in round-trip test
sirreal Dec 19, 2025
55d4e47
Merge branch 'trunk' into html-api/auto-escape-javascript-json
sirreal Dec 22, 2025
2ef0bf0
Fix demonstration comment
sirreal Dec 22, 2025
cb6b990
Extract and document escaping functions
sirreal Dec 22, 2025
e399bf6
Tweak workaround documentation
sirreal Dec 22, 2025
83c1fab
Add ASCII chart about HTML script tags with original source
sirreal Dec 22, 2025
1872681
Improve linking between escapes
sirreal Dec 22, 2025
83ff62f
Fix comments, typos, lints
sirreal Dec 22, 2025
5979782
fixup! Fix comments, typos, lints
sirreal Dec 22, 2025
d469ae4
fixup! fixup! Fix comments, typos, lints
sirreal Dec 22, 2025
402ae9f
Fix \c -> \r (carriage return) typo
sirreal Dec 22, 2025
6b2c9ba
Add note about not parsing MIME types for JS script tags
sirreal Dec 22, 2025
eef0ccb
Add todo comment to is_json_script_tag
sirreal Dec 22, 2025
71d2686
Re-order tag name termination chars to match elsewhere
sirreal Dec 22, 2025
d4693a2
Fix typo
sirreal Dec 22, 2025
eb4e091
Merge branch 'trunk' into html-api/auto-escape-javascript-json
sirreal Dec 29, 2025
4c3b0b2
Update comments on tag prefixes matching search pattern
sirreal Dec 29, 2025
bfa60cf
Use content-type identification and escape without PCRE
dmsnell Dec 30, 2025
3252160
fixup! Use content-type identification and escape without PCRE
dmsnell Dec 30, 2025
0b9b2bd
fixup! Use content-type identification and escape without PCRE
dmsnell Dec 30, 2025
b2533e1
fixup! Use content-type identification and escape without PCRE
dmsnell Dec 31, 2025
ddeb301
fixup! Use content-type identification and escape without PCRE
dmsnell Dec 31, 2025
b36fc32
fixup! Use content-type identification and escape without PCRE
dmsnell Dec 31, 2025
1249af0
fixup! Use content-type identification and escape without PCRE
dmsnell Dec 31, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
361 changes: 343 additions & 18 deletions src/wp-includes/html-api/class-wp-html-tag-processor.php
Original file line number Diff line number Diff line change
Expand Up @@ -3812,28 +3812,29 @@ public function set_modifiable_text( string $plaintext_content ): bool {
switch ( $this->get_tag() ) {
case 'SCRIPT':
/**
* This is over-protective, but ensures the update doesn't break
* the HTML structure of the SCRIPT element.
* Identify risky script contents to escape when possible or reject otherwise:
*
* More thorough analysis could track the HTML tokenizer states
* and to ensure that the SCRIPT element closes at the expected
* SCRIPT close tag as is done in {@see ::skip_script_data()}.
* - "</script" could close the SCRIPT element prematurely.
* - "<script" could enter the “script data double escaped state” and prevent the
* SCRIPT element from closing as expected.
*
* A SCRIPT element could be closed prematurely by contents
* like `</script>`. A SCRIPT element could be prevented from
* closing by contents like `<!--<script>`.
*
* The following strings are essential for dangerous content,
* although they are insufficient on their own. This trade-off
* prevents dangerous scripts from being sent to the browser.
* It is also unlikely to produce HTML that may confuse more
* basic HTML tooling.
* @see WP_HTML_Tag_Processor::escape_javascript_script_contents()
*/
if (
$needs_escaping =
false !== stripos( $plaintext_content, '</script' ) ||
false !== stripos( $plaintext_content, '<script' )
) {
return false;
false !== stripos( $plaintext_content, '<script' );
if ( $needs_escaping ) {
if ( $this->is_javascript_script_tag() ) {
$plaintext_content = $this->escape_javascript_script_contents( $plaintext_content );
} elseif ( $this->is_json_script_tag() ) {
$plaintext_content = $this->escape_json_script_contents( $plaintext_content );
} else {
/*
* Other types of script tags cannot be escaped safely because there is
* no general escaping mechanism for arbitrary types of content.
*/
return false;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linking to a comment.

if we think about changing the design of the is_javascript_script_tag() methods, we could expose something like a MIME type and allow custom filtering of escapes to allow plugins to provide their own escaping routines rather than reject the updates.

for now I see no reason to block progress for this, especially since we already rejected these contents before, but I think we can imagine a situation where someone can replace their SCRIPT contents and we can perform the <script and </script checks after the filter, and only then reject if nothing was able to escape

}
}

$this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement(
Expand Down Expand Up @@ -3891,6 +3892,293 @@ static function ( $tag_match ) {
return false;
}

/**
* Indicates if the currently matched tag is a JavaScript script tag.
*
* Note that this does not parse a MIME type. This behavior is well-documented in
* in the HTML standard and uses string comparisons, *not* actual MIME Types.
*
* @see https://html.spec.whatwg.org/multipage/scripting.html#prepare-the-script-element
*
* @ignore
* @todo Consider a public API that is clear and general.
*
* @since 7.0.0
*
* @return bool True if the script tag will be evaluated as JavaScript.
*/
private function is_javascript_script_tag(): bool {
if ( 'SCRIPT' !== $this->get_tag() || $this->get_namespace() !== 'html' ) {
return false;
}

/*
* > If any of the following are true:
* > - el has a type attribute whose value is the empty string;
* > - el has no type attribute but it has a language attribute and that attribute's
* > value is the empty string; or
* > - el has neither a type attribute nor a language attribute,
* > then let the script block's type string for this script element be "text/javascript".
*/
$type_attr = $this->get_attribute( 'type' );
$language_attr = $this->get_attribute( 'language' );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like this has the potential to be clearer if we did one of two things:

  • early-abort when both the type and language attributes are missing.
  • null-coalesce to some value like '' which would be semantically the same in these checks as null but allow us to treat the values as strings.

or even something like this…

$type = $this->get_attribute( 'type' );
$type = is_string( $type ) ?: '';        // combine `true` and `null` into ''.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this can be simpler. The different cases seem clear so we can also add some unit tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed this and simplified or clarified it slightly, but I think it matches the specified behavior well and I'm not sure that further changes will improve things.

if ( true === $type_attr || '' === $type_attr ) {
return true;
}
if (
null === $type_attr
&& ( null === $language_attr || true === $language_attr || '' === $language_attr )
) {
return true;
}

/*
* > Otherwise, if el has a type attribute, then let the script block's type string be
* > the value of that attribute with leading and trailing ASCII whitespace stripped.
* > Otherwise, el has a non-empty language attribute; let the script block's type string
* > be the concatenation of "text/" and the value of el's language attribute.
*/
$type_string = null !== $type_attr ? trim( $type_attr, " \t\f\r\n" ) : "text/{$language_attr}";

/*
* > If the script block's type string is a JavaScript MIME type essence match, then
* > set el's type to "classic".
*
* > A string is a JavaScript MIME type essence match if it is an ASCII case-insensitive
* > match for one of the JavaScript MIME type essence strings.
*
* > A JavaScript MIME type is any MIME type whose essence is one of the following:
* >
* > - application/ecmascript
* > - application/javascript
* > - application/x-ecmascript
* > - application/x-javascript
* > - text/ecmascript
* > - text/javascript
* > - text/javascript1.0
* > - text/javascript1.1
* > - text/javascript1.2
* > - text/javascript1.3
* > - text/javascript1.4
* > - text/javascript1.5
* > - text/jscript
* > - text/livescript
* > - text/x-ecmascript
* > - text/x-javascript
*
* @see https://mimesniff.spec.whatwg.org/#javascript-mime-type-essence-match
* @see https://mimesniff.spec.whatwg.org/#javascript-mime-type
*/
switch ( strtolower( $type_string ) ) {
case 'application/ecmascript':
case 'application/javascript':
case 'application/x-ecmascript':
case 'application/x-javascript':
case 'text/ecmascript':
case 'text/javascript':
case 'text/javascript1.0':
case 'text/javascript1.1':
case 'text/javascript1.2':
case 'text/javascript1.3':
case 'text/javascript1.4':
case 'text/javascript1.5':
case 'text/jscript':
case 'text/livescript':
case 'text/x-ecmascript':
case 'text/x-javascript':
return true;

/*
* > Otherwise, if the script block's type string is an ASCII case-insensitive match for
* > the string "module", then set el's type to "module".
*
* A module is evaluated as JavaScript.
*/
case 'module':
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PhpStorm complains about having a separate case here:

Image

It suggests moving the module case up above with the others.

This is purely stylistic, I acknowledge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this as-is, it allows quotes referencing different specifications to remain separate.

}

/*
* > Otherwise, if the script block's type string is an ASCII case-insensitive match for the string "importmap", then set el's type to "importmap".
* > Otherwise, if the script block's type string is an ASCII case-insensitive match for the string "speculationrules", then set el's type to "speculationrules".
*
* These conditions indicate JSON content.
*/

/*
* > Otherwise, return. (No script is executed, and el's type is left as null.)
*/
return false;
}

/**
* Indicates if the currently matched tag is a JSON script tag.
*
* @ignore
* @todo Consider a public API that is clear and general.
* @todo Use a MIME type parser when available.
*
* @since 7.0.0
*
* @return bool True if the script tag should be treated as JSON.
*/
private function is_json_script_tag(): bool {
if ( 'SCRIPT' !== $this->get_tag() || $this->get_namespace() !== 'html' ) {
return false;
}

$type = $this->get_attribute( 'type' );
if ( null === $type || true === $type || '' === $type ) {
return false;
}
$type = strtolower( trim( $type, " \t\f\r\n" ) );

/*
* > …
* > Otherwise, if the script block's type string is an ASCII case-insensitive match for the string "importmap", then set el's type to "importmap".
* > Otherwise, if the script block's type string is an ASCII case-insensitive match for the string "speculationrules", then set el's type to "speculationrules".
* @see https://html.spec.whatwg.org/#script-processing-model
*
* > A JSON MIME type is any MIME type whose subtype ends in "+json" or whose essence
* > is "application/json" or "text/json".
*
* @todo The JSON MIME type handling handles some common cases but when MIME type parsing is available it should be leveraged here.
*
* @see https://mimesniff.spec.whatwg.org/#json-mime-type
*/
if (
'importmap' === $type ||
'speculationrules' === $type ||
'application/json' === $type ||
'text/json' === $type
) {
return true;
}

return false;
}

/**
* Escape JavaScript script tag contents.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to document that this is suitable for JavaScript and JSON escaping.

*
* Prevent JavaScript text from modifying the HTML structure of a document and
* ensure that it's contained within its enclosing SCRIPT tag as intended.
*
* JavaScript can be safely escaped with a few exceptions. This is achieved by
* replacing dangerous sequences like "<script" and "</script" with a form
* using a Unicode escape sequence "<\u0073cript>" and "</\u0073cript>".
*
* This text may appear in the JavaScript in limited ways, all of which support
* the use of Unicode escape sequences on the "s" character. The escaping is safe
* to perform in all JavaScript and the modified JavaScript maintains identical
* behavior with a few exceptions:
*
* - Comments.
* - Tagged templates like `String.raw()` that access “raw” strings.
* - The `source` property of a RegExp object.
*
* For example, this input JavaScript:
*
* // A comment: "</script>"
*
* console.log( String.raw`</script>` );
*
* const regex = /<script>/;
* console.log( regex.source );
*
* Is transformed to:
*
* // A comment: "</\u0073cript>"
*
* console.log( String.raw`</\u0073cript>` );
*
* const regex = /<\u0073cript>/;
* console.log( regex.source );
*
* Note that the RegExp's matching behavior is equivalent, meaning that
* `regex.test( '<script>' ) === true` in both the unescaped and
* escaped versions.
*
* JavaScript that relies on behavior affected by this escaping must provide
* safe script contents in order to avoid this escaping. For example, a raw string
* may be split up to make its contents safe or avoided altogether:
*
* console.log( String.raw`</script>` ); // !!UNSAFE!! Will be escaped.
* console.log( String.raw`</\u0073cript>` ); // "</\u0073cript>"
* console.log( String.raw`</scr` + String.raw`ipt>` ); // "</script>"
* console.log( String.raw`</${"script"}>` ); // "</script>"
* console.log( "\x3C/script>" ); // "</script>"
* console.log( "<\/script>" ); // "</script>"
*
* The following graph is a simplified interpretation of how HTML interprets the contents
* of a SCRIPT tag and identifies the closing tag. It is useful to understand what text
* is dangerous inside of a SCRIPT tag and why different approaches to escaping work.
*
* Open script
* │
* │
* ▼
* ╔═════════════════════════════════════════╗ <!--(…)>
* ║ ║ (all dashes)
* ║ script ║ ───────────────┐
* ║ data ║ │
* ┌────────── ║ ║ ◀──────────────┘
* │ ╚═════════════════════════════════════════╝
* │ │ ▲ ▲
* │ │ <!-- │ --> └─────┐
* │ ▼ │ │
* │ ┌─────────────────────────────────────────┐ │
* │ </script† │ escaped │ │
* │ └─────────────────────────────────────────┘ │
* │ │ ▲ │ │ -->
* │ │ </script† │ </script† │ <script† │
* │ ▼ │ ▼ │
* │ ╔══════════════╗ │ ┌───────────┐ │
* │ ║ Close script ║ │ │ double │ │
* └─────────▶ ║ ║ └────────── │ escaped │ ─┘
* ╚══════════════╝ └───────────┘
*
* † = Case insensitive 'script' followed by one of ' \t\f\r\n/>'
*
* The original source of this graph is included at the bottom of this file.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be updated.

*
* @see https://html.spec.whatwg.org/#restrictions-for-contents-of-script-elements
*/
private function escape_javascript_script_contents( string $text ): string {
return preg_replace_callback(
'~(?P<HEAD></?)(?P<S_CHAR>s)(?P<TAIL>cript[ \\t\\f\\r\\n/>])~i',
static function ( $matches ) {
$escaped_s_char = 's' === $matches['S_CHAR']
? '\\u0073'
: '\\u0053';
return "{$matches['HEAD']}{$escaped_s_char}{$matches['TAIL']}";
},
$text
);
}

/**
* Escape JSON script tag contents.
*
* Prevent JSON text from modifying the HTML structure of a document and
* ensure that it's contained within its enclosing SCRIPT tag as intended.
*
* JSON can be escaped simply by replacing "<" with its Unicode escape
* sequence "\u003C". "<" is not part of the JSON syntax and only appears
* in JSON strings, so it's always safe to escape. Furthermore, JSON does
* not allow backslash escaping of "<", so there's no need to consider
* whether the "<" is preceded by an escaping backslash.
*
* For more details, see {@see WP_HTML_Tag_Processor::escape_javascript_script_contents()}.
* @see https://www.json.org/json-en.html
*/
private function escape_json_script_contents( string $text ): string {
return strtr(
$text,
array( '<' => '\\u003C' )
);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linking to conversation about this escaping choice:

this is certainly easy, but I wonder what the relative value is compared to performing the same escape as with the JavaScript elements.

my guess is that <script and </script occur far less frequently within JSON than < does, so if we perform the same escaping then we might expect considerably fewer modifications to the JSON content.

from a performance standpoint I expect the picture to be complicated and hard to quantify, as we’re comparing a relatively simple pair of string searches against string allocations and string building.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this as well. Ultimately, this seemed very reasonable, strtr is supposed to have good performance characteristics and there are no cases to worry about with <. This is close to the behavior of json_encode() with JSON_HEX_TAG, although that also escapes >.

Some other considerations:

  • We could JSON decode/encode with the appropriate flags. That's likely expensive and can modify some types of values, I think it's best avoided.
  • We could use the same escaping as JavaScript. preg_replace_callback is likely more expensive, but I don't have data on that.
  • I considered doing strtr with all the case variants, but the list of strtr case variants would require something like 128 different cases for case-insensitive <script and </script, and that seems excessive.

There is a middle ground that is worth considering. This note on strtr() is relevant (bold mine):

If given two arguments, the second should be an array in the form array('from' => 'to', ...)

The keys and the values may have any length… However, this function will be the most efficient when all the keys have the same size.

Every character added to the replacement becomes more targeted. We could use additional characters in the strtr pattern with something like the following:

function escape_json_script_contents_2( string $text ): string {
	return strtr(
		$text,
		array(
            '<s' => '\\u003Cs',
            '<S' => '\\u003CS',
            '</' => '\\u003C/',
	    )
	);
}

function escape_json_script_contents_3( string $text ): string {
	return strtr(
		$text,
		array(
            '<sc' => '\\u003Csc',
            '<Sc' => '\\u003CSc',
            '<sC' => '\\u003CsC',
            '<SC' => '\\u003CSC',
            '</s' => '\\u003C/s',
            '</S' => '\\u003C/S',
	    )
	);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these seem better than how it stands now as I think the occurrence rate of <s, <S, and </ will also be much lower than of < by itself.

I wouldn’t want us to get caught up in the weeds of performance without any metrics to guide us. We can keep in mind that these operations are going to happen relatively infrequently during a page render. Maybe they happen ten times, vs. something like tag parsing which might happen thousands of times.

So the preg_replace_callback() seems just fine to me, even though personally I think both of these methods are probably non-ideal. I still prefer inching forward on stripos() followed by strspn().

The thing is, without data we just don’t know and so who cares if we are potentially slower on a minor part of the page render. If our intention is clear we can follow-up with improvements where we can make them soundly.

Ultimately here my intuition says that the difference between the basic strtr() and the stripos()+strspn() replacements is going to fall down to the relative improvement strtr() brings against the higher rate of modifying the string. I would expect < to appear in a reasonably high fraction of strings, whereas <script and variants are unlikely to regularly appear. Both approaches are scanning through the string to perform a search, and while strtr() is surely efficient, not allocating and not performing replacement is surely more efficient.

I would caution here against piling up verbose source code to eke out minor performance improvements when we don’t know if they matter and when they obscure intent. None of your suggestions obscure intent, but I think they may also not be essential.

}

/**
* Updates or creates a new attribute on the currently matched tag with the passed value.
*
Expand Down Expand Up @@ -4681,3 +4969,40 @@ public function get_doctype_info(): ?WP_HTML_Doctype_Info {
*/
const TEXT_IS_WHITESPACE = 'TEXT_IS_WHITESPACE';
}

/*
# This is the original Graphviz source for the SCRIPT tag
# parsing behavior. It's used in the documentation for
# `WP_HTML_Tag_Processor::escape_javascript_script_contents()`.
# ====
digraph {
rankdir=TB;

// Entry point
entry [shape=plaintext label="Open script"];
entry -> script_data;

// Double-circle states arranged more compactly
data [shape=doublecircle label="Close script"];
script_data [shape=doublecircle color=blue label="script\ndata"];
script_data_escaped [shape=circle color=orange label="escaped"];
script_data_double_escaped [shape=circle color=red label="double\nescaped"];

// Group related nodes on same ranks where possible
{rank=same; script_data script_data_escaped script_data_double_escaped}

script_data -> script_data [label="<!--(…)>\n(all dashes)"];
script_data -> script_data_escaped [label="<!--"];
script_data -> data [label="</script†"];

script_data_escaped -> script_data [label="-->"];
script_data_escaped -> script_data_double_escaped [label="<script†"];
script_data_escaped -> data [label="</script†"];

script_data_double_escaped -> script_data [label="-->"];
script_data_double_escaped -> script_data_escaped [label="</script†"];

label="† = Case insensitive 'script' followed by one of ' \\t\\f\\r\\n/>'";
labelloc=b;
}
*/
Loading
Loading