XMLUtils
| Repository | https://github.com/BPBecker/XMLUtils |
| Copyright | Made with Material for MkDocs Copyright © 2025 Brian P Becker |
Overview
XMLUtils is a Dyalog APL namespace containing utilities designed to:
- Convert HTML to XHTML. XHTML is essentially HTML that is rewritten to follow the stricter rules of XML. XHTML can be processed by Dyalog APL's
⎕XMLsystem function to produce a matrix form of the content. This matrix form lends itself to easier manipulation in APL. - Manipulate matrix form XML data. There are utilities to search, select, insert, replace, and delete elements in the XML matrix data.
Note
While XMLUtils itself is ⎕IO and ⎕ML insensitive, the examples in this documentation assume an environment of (⎕IO ⎕ML)←1.
Obtaining XMLUtils
XMLUtils is a single namespace contained in a single APL source file. There are several ways to obtain XMLUtils.
- Released versions are available on GitHub.
XMLUtilsis also available as a Tatin package: bpbecker-XMLUtils.- The latest "pre-release" version, if any, can be found in the master branch of https://github.com/BPBecker/XMLUtils
- You can also use the
]getuser command to loadXMLUtilsinto your active workspace.
]get https://github.com/bpbecker/XMLUtils/raw/refs/heads/master/Source/XMLUtils.apln
Motivation
XMLUtils started as a simple project in a namespace named xhtml. I occasionally found data in web pages that I wanted to import into my workspace for further processing. Because HTML has a "looser" rule set than XHTML, a lot of times I couldn't simply run the HTML through ⎕XML. And thus was born HTMLtoXHTML - a utility that attempts to convert HTML, even malformed HTML, into XHTML and then use ⎕XML to return the data in matrix form.
Once I've got the XML matrix, what can I do with it? That's where the other part of XMLUtils comes in - to search and select
For example, let's suppose I wanted to find all the stylesheets that are linked to the Dyalog.com home page. (Note that the examples below reflect the content of the Dyalog home page at the time of this writing - it may change.)
First, I'll use HttpCommand to grab the HTML for the page
]load HttpCommand
html←(HttpCommand.Get 'dyalog.com').Data
50↑html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Trans
Next, I'll convert it to XML using HTMLtoXHTML
≢xml←XMLUtils.HTMLtoXHTML html
308
Now I'll Xfind to find all of the <link> elements that have a rel=stylesheet attribute
+/matches←xml XMLUtils.Xfind '//link//rel/stylesheet'
4
I can use Xsel to select those elements.
≢elements←xml XMLUtils.Xsel matches
4
Finally, I can get the URLs for all of the stylesheets using the href attribute.
⍪urls←elements XMLUtils.Xattr 'href'
https://fonts.googleapis.com/css2?family=Exo+2:wght@400;700&display=swap
//cdn.dyalog.com/css/apl385.css
/uploads/css/jquery.fancybox.css
https://www.dyalog.com/tmp/cache/stylesheet_combined_46f9f0f9723b06cff2425d65f29f50ab.css
Combining it all...
urls←(xml XMLUtils.Xsel xml XMLUtils.Xfind'//link//rel/stylesheet')XMLUtils.Xattr 'href'
HTML to XHTML Conversion
HTML follows a much "looser" set of rules than XHTML. For instance:
- Optional Tags: You can omit closing tags like
</li>,</p>, or even</html>and</body>and browsers will still render the page correctly. - Unquoted Attributes: HTML allows attributes without quotes if the value doesn’t contain spaces or special characters.
- Case Insensitivity: Tags and attributes aren’t case-sensitive —
<BODY>and<body>are treated the same. - "Tag Soup" Tolerance: Browsers use error-correcting algorithms to interpret malformed markup, often guessing what the author intended.
HTML to XHTML conversion may not "round trip". This means that if you apply ⎕XML to the XML result of the conversion, the resulting XHTML might not map identically to the original HTML.
Note
Be mindful that very messed up HTML may not be able to be parsed using either HTMLtoXHTML or HTMLtoXHTML2.
For instance '<a> <b </a>' will cause an error using either function.
HTMLtoXHTML
HTMLtoXHTML uses ⎕XML to parse HTML. When ⎕XML trips over a non-compliant element HTMLtoXHTML parses the error message and takes corrective measures and tries again. This continues until either the conversion completes or we encounter a situation that we couldn't handle. If you run into the latter, please log a GitHub issue.
| Syntax | xml ← {verbose} HTMLtoXHTML html |
html |
A character vector representing the HTML to be converted |
verbose |
(optional) A Boolean where a value of 1 indicates to display the error messages ⎕XML generates during processing |
xml |
The ⎕XML 5-column matrix result representing the converted HTML |
| Examples | HTMLtoXHTML '<h2 id="foo">Hello</h2>' |
HTMLtoXHTML2
HTMLtoXHTML2 uses HTMLRenderer and the JavaScript XMLSerializer object to generate its XHTML result by:
- creating an invisible HTMLRenderer with the supplied HTML argument
- injecting a WebSocket to communicate back to APL
- executing the JavaScript to parse the HTML
- sending the result back to APL over the WebSocket
- finally closing the
HTMLRenderer
| Syntax | xml ← HTMLtoXHTML2 html |
html |
A character vector representing the HTML to be converted |
xml |
The ⎕XML 5-column matrix result representing the converted HTML |
| Examples | HTMLtoXHTML '<h2 id="foo">Hello</h2>' |
Differences between HTMLtoXHTML and HTMLtoXHTML2
As you can see from the examples above, HTMLtoXHTML and HTMLtoXHTML2 return somewhat different results. In general, the core HTML that is converted will be structurally the same.
HTMLtoXHTML2requiresHTMLRendererwhich might not be installed or enabled.HTMLtoXHTML2adds<html>,<head>and<body>elements if they don't exist in the source HTML. This may produce different level numbers fromHTMLtoXHTML.HTMLtoXHTMLandHTMLtoXHTML2may return slightly different results when encountering "sloppy" or malformed HTML.
XMLUtils.HTMLtoXHTML '<p><span>hello</p>there</span>world'
┌─┬────┬──────────┬───┬─┐
│0│p │ │ │7│
├─┼────┼──────────┼───┼─┤
│1│span│hello │ │5│
├─┼────┼──────────┼───┼─┤
│1│ │thereworld│ │4│
└─┴────┴──────────┴───┴─┘
XMLUtils.HTMLtoXHTML2 '<p><span>hello</p>there</span>world'
┌─┬────┬──────────┬────────────────────────────────────┬─┐
│0│html│ │┌─────┬────────────────────────────┐│3│
│ │ │ ││xmlns│http://www.w3.org/1999/xhtml││ │
│ │ │ │└─────┴────────────────────────────┘│ │
├─┼────┼──────────┼────────────────────────────────────┼─┤
│1│head│ │ │1│
├─┼────┼──────────┼────────────────────────────────────┼─┤
│1│body│ │ │7│
├─┼────┼──────────┼────────────────────────────────────┼─┤
│2│p │ │ │3│
├─┼────┼──────────┼────────────────────────────────────┼─┤
│3│span│hello │ │5│
├─┼────┼──────────┼────────────────────────────────────┼─┤
│2│ │thereworld│ │4│
└─┴────┴──────────┴────────────────────────────────────┴─┘
XML Utilities
In addition to HTML conversion, XMLUtils has a number of utilities to search and manipulate ⎕XML's 5-column matrix format. This page presents only descriptions of the utilities. Examples are found on the examples page.
Xfind
Xfind finds XML elements matching criteria you specify. The criteria maps closely to the columns of a ⎕XML matrix.
| Syntax | bv ← xml Xfind criteria |
criteria |
A delimited character vector where each segment is a filter to be applied to a column of xml. The first character in criteria is used as the delimiter. Empty segments are not used in the search. All textual searches are case insensitive.specs ← '/level/element/content/attribute/attributeValue/type'Where:
|
xml |
A 5-column ⎕XML matrix |
bv |
A Boolean vector marking rows in xml that match the search criteria in criteria |
Xselect
Having found XML elements using Xfind, you can use Xselect to select them and optionally their immediate descendants.
| Syntax | elements ← {kids} (xml Xselect) bv |
bv |
A Boolean vector with as many elements as rows in xml. 1's indicate an element to retrieve. |
xml |
A 5-column ⎕XML matrix |
kids |
An optional Boolean scalar indicating whether to in include descendent elements. Valid values are:
|
elements |
A vector containing +/bv elements, each of which contains the xml element corresponding to a 1 in bv and, if kids is 1, its immediately following descendant elements. |
Xattr
Xattr retrieves the values of all attributes in xml that match the attribute name(s) you supply.
| Syntax | values ← {ic} (xml Xattr) attr |
attr |
A vector of space-separated attribute names, or a vector of such vectors. |
xml |
A single 5-column ⎕XML-style matrix or vector of such matrices |
ic |
An optional Boolean indicating whether to perform a case-insensitive comparison on the attribute names. Valid values are:
|
values |
The shape of values depends on the shapes of xml and attr.
|
| Notes | XML and XHTML attribute names are case sensitive, while HTML attribute names are not. |
Xattrval
Xattrval is a shortcut function that essentially does an Xfind, Xselect, and Xattr.
| Syntax | values ← {ic} (xml Xattrval) criteria |
criteria |
The same as criteria described in Xfind above.A delimited character vector where each segment is a filter to be applied to a column of xml. Empty segments are not used in the search. All textual searches are case insensitive.specs ← '/level/element/content/attribute/attributeValue/type'Where:
|
xml |
A single 5-column ⎕XML matrix |
ic |
An optional Boolean indicating whether to perform a case-insensitive comparison on the attribute names. Valid values are:
|
values |
The shape of values depends on the shapes of xml and attr.
|
| Notes | XML and XHTML attribute names are case sensitive, while HTML attribute names are not. |
Xdelete
Xdelete deletes elements and optionally their descendants from an XML structure
| Syntax | r ← {kids} (xml Xdelete) criteria |
criteria |
Either the same as criteria described in Xfind above or a Boolean vector marking the elements to be deleted |
xml |
Either a single 5-column ⎕XML matrix, or a string containing valid XML |
kids |
An optional Boolean indicating whether to delete just the elements specified in criteria or their descendants as well. Valid values are:
|
r |
The resulting 5-column ⎕XML matrix after the elements have been deleted |
Xinsert
Xinsert inserts elements after selected elements
| Syntax | r ← ins (xml Xinsert) elements |
elements |
Either the same as criteria described in Xfind above or a Boolean vector marking the elements after which the new content will be inserted |
xml |
Either a single 5-column ⎕XML matrix, or a string containing valid XML |
ins |
Either a single 5-column ⎕XML matrix, or a string containing valid XML to be inserted |
r |
The resulting 5-column ⎕XML matrix after the elements have been inserted |
Xreplace
Xreplace replaces XML elements and their descendents
| Syntax | r ← repl (xml Xreplace) elements |
elements |
Either the same as criteria described in Xfind above or a Boolean vector marking the elements to be replaced |
xml |
Either a single 5-column ⎕XML matrix, or a string containing valid XML |
repl |
Either a single 5-column ⎕XML matrix, or a string containing valid XML to be inserted |
r |
The resulting 5-column ⎕XML matrix after the elements have been inserted |
Examples
Manipulating XML
For the following examples, we'll be using the following XML...
x ← '<a class="top">Hello<b id="foo">World<c class="middle blue">!<b/></c><d/></b><b class="blue">How</b>Are You?</a>'
xml ← XMLUtils.HTMLtoXHTML x
⎕XML xml
<a class="top">
Hello
<b id="foo">
World
<c class="middle blue">
!
<b></b>
</c>
<d></d>
</b>
<b class="blue">How</b>
Are You?
</a>
'Level' 'Element' 'Content' 'Attributes' 'Type'⍪xml
┌─────┬───────┬────────┬───────────────────┬────┐
│Level│Element│Content │Attributes │Type│
├─────┼───────┼────────┼───────────────────┼────┤
│0 │a │ │┌─────┬───┐ │7 │
│ │ │ ││class│top│ │ │
│ │ │ │└─────┴───┘ │ │
├─────┼───────┼────────┼───────────────────┼────┤
│1 │ │Hello │ │4 │
├─────┼───────┼────────┼───────────────────┼────┤
│1 │b │ │┌──┬───┐ │7 │
│ │ │ ││id│foo│ │ │
│ │ │ │└──┴───┘ │ │
├─────┼───────┼────────┼───────────────────┼────┤
│2 │ │World │ │4 │
├─────┼───────┼────────┼───────────────────┼────┤
│2 │c │ │┌─────┬───────────┐│7 │
│ │ │ ││class│middle blue││ │
│ │ │ │└─────┴───────────┘│ │
├─────┼───────┼────────┼───────────────────┼────┤
│3 │ │! │ │4 │
├─────┼───────┼────────┼───────────────────┼────┤
│3 │b │ │ │1 │
├─────┼───────┼────────┼───────────────────┼────┤
│2 │d │ │ │1 │
├─────┼───────┼────────┼───────────────────┼────┤
│1 │b │How │┌─────┬────┐ │5 │
│ │ │ ││class│blue│ │ │
│ │ │ │└─────┴────┘ │ │
├─────┼───────┼────────┼───────────────────┼────┤
│1 │ │Are You?│ │4 │
└─────┴───────┴────────┴───────────────────┴────┘
Xfind, Xselect and Xattr
- Find all the
<b>elements (and their descendents)
⊢bv ← xml XMLUtils.Xfind '//b'
0 0 1 0 0 0 1 0 1 0
xml XMLUtils.Xselect bv
┌─────────────────────────────────┬────────────┬────────────────────────┐
│┌─┬─┬─────┬───────────────────┬─┐│┌─┬─┬┬───┬─┐│┌─┬─┬───┬────────────┬─┐│
││1│b│ │┌──┬───┐ │7│││3│b││ │1│││1│b│How│┌─────┬────┐│5││
││ │ │ ││id│foo│ │ ││└─┴─┴┴───┴─┘││ │ │ ││class│blue││ ││
││ │ │ │└──┴───┘ │ ││ ││ │ │ │└─────┴────┘│ ││
│├─┼─┼─────┼───────────────────┼─┤│ │└─┴─┴───┴────────────┴─┘│
││2│ │World│ │4││ │ │
│├─┼─┼─────┼───────────────────┼─┤│ │ │
││2│c│ │┌─────┬───────────┐│7││ │ │
││ │ │ ││class│middle blue││ ││ │ │
││ │ │ │└─────┴───────────┘│ ││ │ │
│├─┼─┼─────┼───────────────────┼─┤│ │ │
││3│ │! │ │4││ │ │
│└─┴─┴─────┴───────────────────┴─┘│ │ │
└─────────────────────────────────┴────────────┴────────────────────────┘
- Find all the
<b>elements (but not including their descendents)
0 (xml XMLUtils.Xselect) bv
┌─────────────────┬────────────┬────────────────────────┐
│┌─┬─┬┬────────┬─┐│┌─┬─┬┬───┬─┐│┌─┬─┬───┬────────────┬─┐│
││1│b││┌──┬───┐│7│││3│b││ │1│││1│b│How│┌─────┬────┐│5││
││ │ │││id│foo││ ││└─┴─┴┴───┴─┘││ │ │ ││class│blue││ ││
││ │ ││└──┴───┘│ ││ ││ │ │ │└─────┴────┘│ ││
│└─┴─┴┴────────┴─┘│ │└─┴─┴───┴────────────┴─┘│
└─────────────────┴────────────┴────────────────────────┘
- Find the "class" and "id" attributes for all
<b>elements, but not their descendents
(0 (xml XMLUtils.Xselect) bv) XMLUtils.Xattr 'class id'
┌────────┬───┬─────────┐
│┌┬─────┐│┌┬┐│┌──────┬┐│
│││┌───┐│││││││┌────┐│││
││││foo│││││││││blue││││
│││└───┘│││││││└────┘│││
│└┴─────┘│└┴┘│└──────┴┘│
└────────┴───┴─────────┘
HTMLtoXHTML and Xattrval
Find all of the linked resources on a web page. Linked resources and pages use the href attribute while images use the src attribute. HTMLtoXHTML2 could also be used instead of HTMLtoXHTML.
⍝ get the HTML content of the page
]load HttpCommand
html ← (HttpCommand.Get 'dyalog.com').Data
⍝ convert to XML (XHTML) (⍝ you could also use HTMLtoXHTML2)
⍴ xml ← XMLUtils.HTMLtoXHTML html
308 5
⍝ find all href and src attribute values
≢¨ linked ← xml XMLUtils.Xattrval '////href src'
95 18
There are 95 href and 18 src values.
Xinsert
Insert a <b> element after the <a> element.
'<b/>'('<a><c/></a>'XMLUtils.Xinsert)'//a'
┌─┬─┬┬───┬─┐
│0│a││ │3│
├─┼─┼┼───┼─┤
│1│b││ │1│
├─┼─┼┼───┼─┤
│1│c││ │1│
└─┴─┴┴───┴─┘
Xdelete
Using xml from above, delete all <c> and <d> elements and their descendants:
xml XMLUtils.Xdelete xml XMLUtils.Xfind '//c d'
┌─┬─┬────────┬────────────┬─┐
│0│a│ │┌─────┬───┐ │7│
│ │ │ ││class│top│ │ │
│ │ │ │└─────┴───┘ │ │
├─┼─┼────────┼────────────┼─┤
│1│ │Hello │ │4│
├─┼─┼────────┼────────────┼─┤
│1│b│ │┌──┬───┐ │7│
│ │ │ ││id│foo│ │ │
│ │ │ │└──┴───┘ │ │
├─┼─┼────────┼────────────┼─┤
│2│ │World │ │4│
├─┼─┼────────┼────────────┼─┤
│1│b│How │┌─────┬────┐│5│
│ │ │ ││class│blue││ │
│ │ │ │└─────┴────┘│ │
├─┼─┼────────┼────────────┼─┤
│1│ │Are You?│ │4│
└─┴─┴────────┴────────────┴─┘
Now, do the same, but do not delete the descendants:
0 (xml XMLUtils.Xdelete) xml XMLUtils.Xfind '//c d'
┌─┬─┬────────┬────────────┬─┐
│0│a│ │┌─────┬───┐ │7│
│ │ │ ││class│top│ │ │
│ │ │ │└─────┴───┘ │ │
├─┼─┼────────┼────────────┼─┤
│1│ │Hello │ │4│
├─┼─┼────────┼────────────┼─┤
│1│b│ │┌──┬───┐ │7│
│ │ │ ││id│foo│ │ │
│ │ │ │└──┴───┘ │ │
├─┼─┼────────┼────────────┼─┤
│2│ │World │ │4│
├─┼─┼────────┼────────────┼─┤
│3│ │! │ │4│
├─┼─┼────────┼────────────┼─┤
│3│b│ │ │1│
├─┼─┼────────┼────────────┼─┤
│1│b│How │┌─────┬────┐│5│
│ │ │ ││class│blue││ │
│ │ │ │└─────┴────┘│ │
├─┼─┼────────┼────────────┼─┤
│1│ │Are You?│ │4│
└─┴─┴────────┴────────────┴─┘
Xreplace
Replace the element with id="foo" (and its descendants) with new elements.
'<new>This is new</new>' (xml XMLUtils.Xreplace) xml XMLUtils.Xfind '////id/foo'
┌─┬───┬───────────┬────────────┬─┐
│0│a │ │┌─────┬───┐ │7│
│ │ │ ││class│top│ │ │
│ │ │ │└─────┴───┘ │ │
├─┼───┼───────────┼────────────┼─┤
│1│ │Hello │ │4│
├─┼───┼───────────┼────────────┼─┤
│1│new│This is new│ │5│
├─┼───┼───────────┼────────────┼─┤
│1│b │How │┌─────┬────┐│5│
│ │ │ ││class│blue││ │
│ │ │ │└─────┴────┘│ │
├─┼───┼───────────┼────────────┼─┤
│1│ │Are You? │ │4│
└─┴───┴───────────┴────────────┴─┘
About
MIT License
Copyright (c) 2021 Dyalog
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Release Notes
Version 1.0
- Initial release on BPBecker (moved and updated from Dyalog/bb)
Version 1.1
- Added "lastChance" which attempts to repair a damaged tag when all else has failed.