Terence Eden’s Blog<p><strong>An opinionated HTML Serializer for PHP 8.4</strong></p><p><a href="https://shkspr.mobi/blog/2025/04/an-opinionated-html-serializer-for-php-8-4/" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">shkspr.mobi/blog/2025/04/an-op</span><span class="invisible">inionated-html-serializer-for-php-8-4/</span></a></p><p>A few days ago, <a href="https://shkspr.mobi/blog/2025/03/pretty-print-html-using-php-8-4s-new-html-dom/" rel="nofollow noopener noreferrer" target="_blank">I wrote a shitty pretty-printer</a> for PHP 8.4's new <a href="https://www.php.net/manual/en/class.dom-htmldocument.php" rel="nofollow noopener noreferrer" target="_blank">Dom\HTMLDocument class</a>.</p><p>I've since re-written it to be faster and more stylistically correct.</p><p>It turns this:</p><pre><code><html lang="en-GB"><head><title id="something">Test</title></head><body><h1 class="top upper">Testing</h1><main><p>Some <em>HTML</em> and an <img src="example.png" alt="Alternate Text"></p>Text not in an element<ol><li>List</li><li>Another list</li></ol></main></body></html></code></pre><p>Into this:</p><pre><code><!doctype html><html lang=en-GB> <head> <title id=something>Test</title> </head> <body> <h1 class="top upper">Testing</h1> <main> <p> Some <em>HTML</em> and an <img src=example.png alt="Alternate Text"> </p> Text not in an element <ol> <li>List</li> <li>Another list</li> </ol> </main> </body></html></code></pre><p>I say it is "opinionated" because it does the following:</p><ul><li>Attributes are unquoted unless necessary.</li><li>Every element is logically indented.</li><li>Text content of CSS and JS is unaltered. No pretty-printing, minification, or checking for correctness.</li><li>Text content of elements <em>may</em> have extra newlines and tabs. Browsers will tend to ignore multiple whitespaces unless the CSS tells them otherwise.<ul><li>This fucks up <code><pre></code> blocks which contain markup.</li></ul></li></ul><p>It is primarily designed to make the <em>markup</em> easy to read. Because <a href="https://libraries.mit.edu/150books/2011/05/11/1985/" rel="nofollow noopener noreferrer" target="_blank">according to the experts</a>:</p><blockquote><p>A computer language is not just a way of getting a computer to perform operations but rather … it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute.</p></blockquote><p>I'm <em>fairly</em> sure this all works properly. But feel free to argue in the comments or <a href="https://gitlab.com/edent/pretty-print-html-using-php/" rel="nofollow noopener noreferrer" target="_blank">send me a pull request</a>.</p><p>Here's how it works.</p><p><strong>When is an element not an element? When it is a void!</strong></p><p>Modern HTML has the concept of "<a href="https://developer.mozilla.org/en-US/docs/Glossary/Void_element" rel="nofollow noopener noreferrer" target="_blank">Void Elements</a>". Normally, something like <code><a></code> <em>must</em> eventually be followed by a closing <code></a></code>. But Void Elements don't need closing.</p><p>This keeps a list of elements which must not be explicitly closed.</p><pre><code>$void_elements = [ "area", "base", "br", "col", "embed", "hr", "img", "input", "link", "meta", "param", "source", "track", "wbr",];</code></pre><p><strong>Tabs 🆚 Space</strong></p><p>Tabs, obviously. Users can set their tab width to their personal preference and it won't get confused with semantically significant whitespace.</p><pre><code>$indent_character = "\t";</code></pre><p><strong>Setting up the DOM</strong></p><p>The new HTMLDocument should be broadly familiar to anyone who has used the previous one.</p><pre><code>$html = '<html lang="en-GB"><head><title id="something">Test</title></head><body><h1 class="top upper">Testing</h1><main><p>Some <em>HTML</em> and an <img src="example.png" alt="Alternate Text"></p>Text not in an element<ol><li>List</li><li>Another list</li></ol></main></body></html>>'$dom = Dom\HTMLDocument::createFromString( $html, LIBXML_NOERROR, "UTF-8" );</code></pre><p>This automatically adds <code><head></code> and <code><body></code> elements. If you don't want that, use the <a href="https://www.php.net/manual/en/libxml.constants.php#constant.libxml-html-noimplied" rel="nofollow noopener noreferrer" target="_blank"><code>LIBXML_HTML_NOIMPLIED</code> flag</a>:</p><pre><code>$dom = Dom\HTMLDocument::createFromString( $html, LIBXML_NOERROR | LIBXML_HTML_NOIMPLIED, "UTF-8" );</code></pre><p><strong>To Quote or Not To Quote?</strong></p><p>Traditionally, HTML attributes needed quotes:</p><pre><code><img src="example.png" class="avatar no-border" id="user-123"></code></pre><p>Modern HTML allows those attributes to be <em>un</em>quoted as long as they don't contain <a href="https://infra.spec.whatwg.org/#ascii-whitespace" rel="nofollow noopener noreferrer" target="_blank">ASCII Whitespace</a> or <a href="https://html.spec.whatwg.org/multipage/syntax.html#unquoted" rel="nofollow noopener noreferrer" target="_blank">certain other characters</a></p><p>For example, the above becomes:</p><pre><code><img src=example.png class="avatar no-border" id=user-123></code></pre><p>This function looks for the presence of those characters:</p><pre><code>function value_unquoted( $haystack ){ // Must not contain specific characters $needles = [ // https://infra.spec.whatwg.org/#ascii-whitespace "\t", "\n", "\f", "\n", " ", // https://html.spec.whatwg.org/multipage/syntax.html#unquoted "\"", "'", "=", "<", ">", "`" ]; foreach ( $needles as $needle ) { if ( str_contains( $haystack, $needle ) ) { return false; } } // Must not be null if ( $haystack == null ) { return false; } return true;}</code></pre><p><strong>Re-re-re-recursion</strong></p><p>I've tried to document this as best I can.</p><p>It traverses the DOM tree, printing out correctly indented opening elements and their attributes. If there's text content, that's printed. If an element needs closing, that's printed with the appropriate indentation.</p><pre><code>function serializeHTML( $node, $treeIndex = 0, $output = ""){ global $indent_character, $preserve_internal_whitespace, $void_elements; // Manually add the doctype to start. if ( $output == "" ) { $output .= "<!doctype html>\n"; } if( property_exists( $node, "localName" ) ) { // This is an Element. // Get all the Attributes (id, class, src, &c.). $attributes = ""; if ( property_exists($node, "attributes")) { foreach( $node->attributes as $attribute ) { $value = $attribute->nodeValue; // Only add " if the value contains specific characters. $quote = value_unquoted( $value ) ? "" : "\""; $attributes .= " {$attribute->nodeName}={$quote}{$value}{$quote}"; } } // Print the opening element and all attributes. $output .= "<{$node->localName}{$attributes}>"; } else if( property_exists( $node, "nodeName" ) && $node->nodeName == "#comment" ) { // Comment $output .= "<!-- {$node->textContent} -->"; } // Increase indent. $treeIndex++; $tabStart = "\n" . str_repeat( $indent_character, $treeIndex ); $tabEnd = "\n" . str_repeat( $indent_character, $treeIndex - 1); // Does this node have children? if( property_exists( $node, "childElementCount" ) && $node->childElementCount > 0 ) { // Loop through the children. $i=0; while( $childNode = $node->childNodes->item( $i++ ) ) { // Is this a text node? if ($childNode->nodeType == 3 ) { // Only print output if there's no HTML inside the content. // Ignore Void Elements. if ( !str_contains( $childNode->textContent, "<" ) && property_exists( $childNode, "localName" ) && !in_array( $childNode->localName, $void_elements ) ) { $output .= $tabStart . $childNode->textContent; } } else { $output .= $tabStart; } // Recursively indent all children. $output = serializeHTML( $childNode, $treeIndex, $output ); }; // Suffix with a "\n" and a suitable number of "\t"s. $output .= "{$tabEnd}"; } else if ( property_exists( $node, "childElementCount" ) && property_exists( $node, "innerHTML" ) ) { // If there are no children and the node contains content, print the contents. $output .= $node->innerHTML; } // Close the element, unless it is a void. if( property_exists( $node, "localName" ) && !in_array( $node->localName, $void_elements ) ) { $output .= "</{$node->localName}>"; } // Return a string of fully indented HTML. return $output;}</code></pre><p><strong>Print it out</strong></p><p>The serialized string hardcodes the <code><!doctype html></code> - which is probably fine. The full HTML is shown with:</p><pre><code>echo serializeHTML( $dom->documentElement );</code></pre><p><strong>Next Steps</strong></p><p>Please <a href="https://gitlab.com/edent/pretty-print-html-using-php/" rel="nofollow noopener noreferrer" target="_blank">raise any issues on GitLab</a> or leave a comment.</p><p><a rel="nofollow noopener noreferrer" class="hashtag u-tag u-category" href="https://shkspr.mobi/blog/tag/howto/" target="_blank">#HowTo</a> <a rel="nofollow noopener noreferrer" class="hashtag u-tag u-category" href="https://shkspr.mobi/blog/tag/html5/" target="_blank">#HTML5</a> <a rel="nofollow noopener noreferrer" class="hashtag u-tag u-category" href="https://shkspr.mobi/blog/tag/php/" target="_blank">#php</a></p>