XML/XSL Portal

Multiple Stylesheet Aware
HideXML SlideshowHide

The following feeds are currently available for adding to your navigation bar.

Saxon diaries tag:dev.saxonica.com,2012-01-16:/blog/mike//3 2018-04-13T09:50:55Z Michael Kay's blog Movable Type 5.13-en Navigating XML trees using Java Streams tag:dev.saxonica.com,2018:/blog/mike//3.224 2018-04-13T08:31:44Z 2018-04-13T09:50:55Z Navigating XML trees using Java Streams For the next major Saxon release I am planning an extension to the s9api interface to exploit the facilities of Java 8 streams to allow powerful navigation of XDM trees: the idea is... Michael Kay <h2><font style="font-size: 1.25em;"><b>Navigating XML trees using Java Streams</b></font></h2> <p>For the next major Saxon release I am planning an extension to the s9api interface to exploit the facilities of Java 8 streams to allow powerful navigation of XDM trees: the idea is that navigation should be as easy as using XPath, but without the need to drop out of Java into a different programming language. To give a flavour, here is how you might select the elements within a document that have @class='hidden':</p> <p><code>doc.select(descendant(isElement())</code><br/><code>&nbsp; &nbsp;.where(attribute("class").eq("hidden")))</code></p> <p>We'll see how that works in due course.</p> <h2><font style="font-size: 1.25em;"><b>Why do we need it?</b></font></h2> <p>The combination of Java and XML is as powerful and ubiquitous today as it as been for nearly twenty years. Java has moved on considerably (notably, as far as this article is concerned, with the Java 8 Streams API), and the world of XML processing has also made great strides (we now have XSLT 3.0, XPath 3.1, and XQuery 3.1), but for some reason the two have not moved together. The bulk of Java programmers manipulating XML, if we can judge from the questions they ask on forums such as StackOverflow, are still using DOM interfaces, perhaps with a bit of XPath 1.0 thrown in.</p> <p>DOM shows its age. It was originally designed for HTML, with XML added as an afterthought, and XML namespaces thrown in as a subsequent bolt-on. Its data model predates the XML Infoset and the (XPath-2.0-defined) XDM model. It was designed as a cross-language API and so the designers deliberately eschewed the usual Java conventions and interfaces in areas such as the handling of collections and iterators, not to mention exceptions. It does everything its own way. As a navigational API it carries a lot of baggage because the underlying tree is assumed to be mutable. Many programmers only discover far too late that it's not even thread-safe (even when you confine yourself to retrieval-only operations).</p><p>There are better APIs than DOM available (for example JDOM2 and XOM) but they're all ten years old and haven't caught up with the times. There's nothing in the Java world that compares with Linq for C# users, or ElementTree in Python.</p> <p>The alternative of calling out from Java to execute XPath or XQuery expressions has its own disadvantages. Any crossing of boundaries from one programming language to another involves data conversions and a loss of type safety. Embedding a sublanguage in the form of character strings within a host language (as with SQL and regular expressions) means that the host language compiler can't do any static syntax checking or type checking of the expressions in the sublanguage. Unless users go to some effort to avoid it, it's easy to find that the cost of compiling XPath expressions is incurred on each execution, rather than being incurred once and amortized. And the API for passing context from the host language to the sublanguage can be very messy. It doesn't have to be quite as messy as the JAXP interface used for invoking XPath from Java, but it still has to involve a fair bit of complexity.</p> <p>Of course, there's the alternative of not using Java (or other general-purpose programming languages) at all: you can write the whole application in XSLT or XQuery. Given the capability that XSLT 3.0 and XQuery 3.1 have acquired, that's a real possibility far more often than most users realise. But it remains true that if only 10% of your application is concerned with processing XML input, and the rest is doing something more interesting, then writing the whole application in XQuery would probably be a poor choice.</p> <p>Other programming languages have developed better APIs. Javascript has JQuery, C# programmers have Linq, Scala programmers have something very similar, and PHP users have SimpleXML. These APIs all have the characteristic that they are much more deeply integrated into the host language, and in particular they exploit the host language primitives for manipulation of sequences through functional programming constructs, with a reasonable level of type safety given that the actual structure of the XML document is not statically known.</p> <p>That leads to the question of data binding interfaces: broadly, APIs that exploit static knowledge of the schema of the source document. Such APIs have their place, but I'm not going to consider them any further in this article. In my experience they can work well if the XML schema is very simple and very stable. If the schema is complex or changing, data binding can be a disaster.</p> <section> <h2><font style="font-size: 1.25em;"><b>The Java 8 Streams API</b></font></h2> <p>This is not the place for an extended tutorial on the new Streams API introduced in Java 8. If you haven't come across it, I suggest you find a good tutorial on the web and read it before you go any further.</p> <aside>Java Streams are quite unrelated to XSLT 3.0 streaming. Well, almost unrelated: they share the same high-level objectives of processing large collections of data in a declarative way, making maximum use of lazy evaluation to reduce memory use, and permitting parallel execution. But that's where the similarity ends. Perhaps the biggest difference is that Java 8 streams are designed to process linear data structures (sequences), whereas XSLT 3.0 streaming is designed to process trees.</aside><aside><br /></aside> <p>But just to summarise:</p> <ul> <li>Java 8 introduces a new interface, <code>Stream&lt;X&gt;</code>, representing a linear sequence of items of type <code>X</code></li> <li>Like iterators, streams are designed to be used once. Unlike iterators, they are manipulated using functional operations, most notably maps and filters, rather than being processed one item at a time. This makes for less error-prone programming, and allows parallel execution.</li> </ul> <p>The functional nature of the Java 8 Streams API means it has much in common with the processing model of XPath. The basic thrust of the API design presented in this article is therefore to reproduce the primitives of the XPath processing model, re-expressing them in terms of the constructs provided by the Java 8 Streams API.</p> <p>If the design appears to borrow concepts from other APIs such as LINQ and Scala and SimpleXML, that's not actually because I have a deep familiarity with those APIs: in fact, I have never used them in anger, and I haven't attempted to copy anything across literally. Rather, any similarity is because the functional concepts of XPath processing map so cleanly to this approach.</p> </section> <section> <h2><font style="font-size: 1.25em;"><b>The Basics of the Saxon s9api API</b></font></h2> <p>The Saxon product primarily exists to enable XSLT, XQuery, XPath, and XML Schema processing. Some years ago I decided that the standard APIs (JAXP and XQJ) for invoking such functionality were becoming unfit for purpose. They had grown haphazardly over the years, the various APIs didn't work well together, and they weren't being updated to exploit the newer versions of the W3C specifications. Some appalling design mistakes had been unleashed on the world, and the strict backwards compatibility policy of the JDK meant these could never be corrected. </p> <aside>To take one horrid example: the <code>NamespaceContext</code> interface is used to pass a set of namespace bindings from a Java application to an XPath processor. To implement this interface, you need to implement three methods, of which the XPath processor will only ever use one (<code>getNamespaceURI(prefix)</code>). Yet at the same time, there is no way the XPath processor can extract the full set of bindings defined in the <code>NamespaceContext</code> and copy them into its own data structures.</aside><aside><br /></aside> <p>So I decided some years ago to introduce a proprietary alternative called <b>s9api</b> into the Saxon product (retaining JAXP support alongside), and it has been a considerable success, in that it has withstood the test of time rather well. The changes to XSLT transformation in 3.0 were sufficiently radical that I forked the <code>XsltTransformer</code> interface to create a 3.0 version, but apart from that it has been largely possible to add new features incrementally. That's partly because of a slightly less obsessive attitude to backwards compatibility: if I decide that something was a bad mistake, I'm prepared to change it.</p> <p>Although s9api is primarily about invoking XSLT, XQuery, and XPath processing, it does include classes that represent objects in the XDM data model, and I will introduce these briefly because the new navigation API relies on these objects as its foundation. The table below lists the main classes.</p> <table> <thead> <tr> <th>Class</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td valign="top"><code>XdmValue</code></td> <td>Every value is the XDM model is a sequence of items. The <code>XdmValue</code> class is therefore the top of the class hierarchy. Because it's a sequence, it implements <code>Iterable&lt;XdmItem&gt;</code>, so you can use a Java <code>foreach</code> loop to process the items sequentially. In the latest version I have used Java generics to add a type parameter, so <code>XdmValue&lt;XdmNode&gt;</code> is a sequence of nodes, and <code>XdmValue&lt;XdmAtomicValue&gt;</code> is a sequence of atomic values. As well as an <code>iterator()</code> method, it has an <code>itemAt()</code> method to get the <i>N</i>th item, and a <code>size()</code> method to count the items. Internally an <code>XdmValue</code> might exist as an actual sequence in memory, or as a "promise": sufficient data to enable the items to be materialized when they are needed.</td> </tr> <tr> <td valign="top"><code>XdmItem</code></td> <td>This class represents an Item in the XDM model. As such it is both a component of an <code>XdmValue</code>, and also an <code>XdmValue</code> (of length one) in its own right. It's an abstract class, because every item is actually something more specific (a node, an atomic value, a function). Some of the methods inherited from <code>XdmValue</code> become trivial (for example <code>size()</code> always returns 1). </td> </tr> <tr> <td valign="top"><code>XdmNode</code></td> <td>This is a subclass of <code>XdmItem</code> used to represent nodes. Unlike many models of XML, we don't subclass this for different kinds of node: that's mainly because XDM has deliberately aimed at uniformity, with the same accessors available for all node kinds. Many of the methods on <code>XdmNode</code>, such as <code>getNodeName()</code>, <code>getStringValue()</code>, <code>getTypedValue()</code>, and <code>getNodeKind()</code>, are directly equivalent to accessors defined in the W3C XDM specification. But in addition, <code>XdmNode</code> has a method <code>axisIterator</code> to navigate the tree using any of the XPath axes, the result being returned as an iterator over the selected nodes.</td> </tr> <tr> <td valign="top"><code>XdmAtomicValue</code></td> <td>Another subclass of <code>XdmItem</code>, this is used to represent atomic values in the XDM model. As with <code>XdmNode</code>, we don't define further subclasses for different atomic types. There are convenience methods to convert <code>XdmAtomicValue</code> instances to and from equivalent (or near-equivalent) Java classes such as <code>String</code>, <code>Double</code>, <code>BigInteger</code>, and <code>Date</code>.</td> </tr> <tr> <td valign="top"><code>XdmFunctionItem</code></td> <td>From XPath 3.0, functions are first-class values alongside nodes and atomic values. These are represented in s9api as instances of <code>XdmFunctionItem</code>. Two specific subclasses of function, with their own behaviours, are represented using the subclasses <code>XdmMap</code> and <code>XdmArray</code>. I won't be saying much about these in this article, because I'm primarily concerned with navigating XML trees.</td> </tr> </tbody> </table> </section> <section> <h2><br /></h2><h2><font style="font-size: 1.25em;"><b>The new API: Steps and Predicates</b></font></h2> <p>The basic concept behind the new extensions to the s9api API is navigation using steps and predicates. I'll introduce these concepts briefly in this section, and then go on to give a more detailed exposition.</p> <p>The class <code>XdmValue&lt;T&gt;</code> acquires a new method:</p> <p><code>XdmStream select(Step step)</code></p> <p>The <code>Step</code> here is a function that takes an item of class <code>T</code> as its input, and returns a stream of items. If we consider a very simple <code>Step</code>, namely <code>child()</code>, this takes a node as input and returns a stream of nodes as its result. We can apply this step to an <code>XdmValue</code> consisting entirely of nodes, and it returns the concatenation of the streams of nodes obtained by applying the step to each node in the input value. This operation is equivalent to the "!" operator in XPath 3.0, or to the <code>flatMap()</code> method in many functional programming languages. It's not quite the same as the familiar "/" operator in XPath, because it doesn't eliminate duplicates or sort the result into document order. But for most purposes it does the same job.</p> <p>There's a class <code>net.sf.saxon.s9api.streams.Steps</code> containing static methods which provide commonly-used steps such as <code>child()</code>. In my examples, I'll assume that the Java application has <code>import net.sf.saxon.s9api.streams.Steps.*;</code> in its header, so it can use these fields and methods without further qualification.</p> <p>One of the steps defined by this class is <code>net.sf.saxon.s9api.streams.Steps.child()</code>: this step is a function which, given a node, returns its children. There are other similar steps for the other XPath axes. So you can find the children of a node <code>N</code> by writing <code>N.select(child())</code>.</p> <p>Any two steps <code>S</code> and <code>T</code> can be combined into a single composite step by writing <code>S.then(T)</code>: for example <code>Step grandchildren = child().then(child())</code> gives you a step which can be used in the expression <code>N.select(grandchildren)</code> to select all the grandchildren.</p> <p>The class <code>Step</code> inherits from the standard Java class <code>Function</code>, so it can be used more generally in any Java context where a <code>Function</code> is required.</p> <p><code>Predicate&lt;T&gt;</code> is a standard Java 8 class: it defines a function that can be applied to an object of type <code>T</code> to return true or false. The class <code>net.sf.saxon.s9api.streams.Predicates</code>&nbsp;defines some standard predicates that are useful when processing XML. For example <code>isElement()</code>&nbsp;gives you a predicate that can be applied to any <code>XdmItem</code> to determine if it is an element node.</p> <p>Given a <code>Step</code> <code>A</code> and a <code>Predicate</code> <code>P</code>, the expression <code>A.where(P)</code> returns a new <code>Step</code> that filters the results of <code>A</code> to include only those items that satisfy the predicate <code>P</code>. So, for example, <code>child().where(isElement())</code> is a step that selects the element children of a node, so that <code>N.select(child().where(isElement()))</code> selects the element children of <code>N</code>. This is sufficiently common that we provide a shorthand: it can also be written <code>N.select(child(isElement()))</code>.</p> <p>The predicate <code>hasLocalName("foo")</code> matches nodes having a local name of "foo": so <code>N.select(child().where(hasLocalName("foo"))</code> selects the relevant children. Again this is so common that we provide a shorthand: <code>N.select(child("foo"))</code>. There is also a two argument version <code>child(ns, "foo")</code> which selects children with a given namespace URI and local name.</p> <p>Another useful predicate is <code>exists(step)</code> which tests whether the result of applying a given step returns at least one item. So, for example <code>N.select(child().where(exists(attribute("id"))))</code> returns those children of <code>N</code> that have an attribute named "id".</p> <p>The result of the <code>select()</code> method is always a stream of items, so you can use methods from the Java Stream class such as filter() and flatMap() to process the result. Here are some of the standard things you can do with a stream of items in Java:</p> <ul> <li>You can get the results as an array: <code>N.select(child()).toArray()</code></li> <li>Or as a list: <code>N.select(child()).collect(Collectors.toList())</code></li> <li>You can apply a function to each item in the stream: <code>N.select(child()).forEach(System.err::println)</code></li> <li>You can get the first item in the stream: <code>N.select(child()).findFirst().get()</code></li> </ul> <p>However, Saxon methods such as <code>select()</code> always return a subclass of <code>Stream</code> called <code>XdmStream</code>, and this offers additional methods. For example:</p> <ul> <li>You can get the results as an <code>XdmValue</code>: <code>N.select(child()).asXdmValue()</code></li> <li>A more convenient way to get the results as a Java <code>List</code>: <code>N.select(child()).asList()</code></li> <li>If you know that the stream contains a single node (or nothing), you can get this using the methods <code>asNode()</code> or <code>asOptionalNode()</code></li> <li>Similarly, if you know that the stream contains a single atomic value (or nothing), you can get this using the methods <code>asAtomic()</code> or <code>asOptionalAtomic()</code></li> <li>You can get the last item in the stream: <code>N.select(child("para")).last()</code></li> </ul> <section> <h2><font style="font-size: 1.25em;"><b>More about Steps</b></font></h2> <p>The actual definition of the <code>Step</code> class is:</p> <p><code>public abstract class Step&lt;T extends XdmItem&gt; implements Function&lt;XdmItem, Stream&lt;? extends T&gt;&gt; </code></p> <p>What that means is that it's a function that any <code>XdmItem</code> as input, and delivers a stream of <code>U</code> items as its result (where <code>U</code> is <code>XdmItem</code> or some possibly-different subclass). (I experimented by also parameterizing the class on the type of items accepted, but that didn't work out well.)</p> <p>Because the types are defined, Java can make type inferences: for example it knows that <code>N.select(child())</code> will return nodes (because <code>child()</code> is a step that returns nodes).</p> <p>As a user of this API, you can define your own kinds of <code>Step</code> if you want to: but most of the time you will be able to do everything you need with the standard Steps available from the class <code>net.sf.saxon.s9api.stream.Steps</code>. The standard steps include:</p> <ul> <li>The axis steps <code>ancestor()</code>, <code>ancestor-or-self()</code>, <code>attribute()</code>, <code>child()</code>, <code>descendant()</code>, <code>descendantOrSelf()</code>, <code>following()</code>, <code>followingSibling()</code>, <code>namespace()</code>, <code>parent()</code>, <code>preceding()</code>, <code>precedingSibling()</code>, <code>self()</code>.</li> <li>For each axis, three filtered versions: for example <code>child("foo")</code> filters the axis to select elements by local name (ignoring the namespace if any); <code>child(ns, local)</code> filters the axis to select elements by namespace URI and local name, and <code>child(predicate)</code> filters the axis using an arbitrary predicate: this is a shorthand for <code>child().where(predicate)</code>.</li> <li>A composite step can be constructed using the method <code>step1.then(step2)</code>. This applies <code>step2</code> to every item in the result of <code>step1</code>, retaining the order of results and flattening them into a single stream.</li> <li>A filtered step can be constructed using the method <code>step1.where(predicate1)</code>. This selects those items in the result of <code>step1</code> for which <code>predicate1</code> returns true.</li> <li>A path with several steps can be constructed using a call such as<code>path(child(isElement()), attribute("id"))</code>. This returns a step whose effect is to return the <code>id</code> attributes of all the children of the target node.</li> <li>If the steps are sufficiently simple, a path can also by written means of a simple micro-syntax similar to XPath abbreviated steps. The previous example could also be written <code>path("*", "@id")</code>. Again, this returns a step that can be used like any other step. (In my own applications, I have found myself using this approach very extensively).</li> <li>The step <code>atomize()</code> extracts the typed values of nodes in the input, following the rules in the XPath specification. The result is a stream of atomic values</li> <li>The step <code>toString()</code> likewise extracts the string values, while <code>toNumber()</code> has the same effect as the XPath <code>number()</code> function</li> </ul> <p>Last but not least, <code>xpath(path)</code> returns a <code>Step</code> that evaluates an XPath expression. For example, <code>doc.select(xpath("//foo"))</code> has the same effect as <code>doc.select(descendant("foo"))</code>. A second argument to the <code>xpath()</code> method may be used to supply a static context for the evaluation. Note that compilation of the XPath expression occurs while the step is being created, not while it is being evaluated; so if you bind the result of <code>xpath("//foo")</code> to a variable, then the expression can be evaluated repeatedly without recompilation.</p> </section> <section> <h2><font style="font-size: 1.25em;"><b>More about Predicates</b></font></h2> <p>The <code>Predicate</code> class is a standard Java 8 interface: it is a function that takes any object as input, and returns a boolean. You can use any predicates you like with this API, but the class <code>net.sf.saxon.s9api.streams.Predicates</code> provides some implementations of <code>Predicate</code> that are particularly useful when navigating XML documents. These include the following:</p> <ul> <li><code>isElement()</code>, <code>isAttribute()</code>, <code>isText()</code>, <code>isComment()</code>, <code>isDocument()</code>, <code>isProcessingInstruction()</code>, <code>isNamespace()</code> test that the item is a node of a particular kind</li> <li><code>hasName("ns", "local")</code>, <code>hasLocalName("n")</code>, and <code>hasNamespaceUri("ns")</code> make tests against the name of the node</li> <li><code>hasType(t)</code> tests the type of the item: for example <code>hasType(ItemType.DATE)</code> tests for atomic values of type <code>xs:date</code></li> <li><code>exists(step)</code> tests whether the result of applying the given step is a sequence containing at least one item; conversely <code>empty(step)</code> tests whether the result of the step is empty. For example, <code>exists(CHILD)</code> is true for a node that has children.</li> <li><code>some(step, predicate)</code> tests whether at least one item selected by the step satisfies the given predicate. For example, <code>some(CHILD, IS_ELEMENT)</code> tests whether the item is a node with at least one element child. Similarly <code>every(step, predicate)</code> tests whether the predicate is true for every item selected by the step.</li> <li><code>eq(string)</code> tests whether the string value of the item is equal to the given string; while <code>eq(double)</code> does a numeric comparison. A two-argument version <code>eq(step, string)</code> is shorthand for <code>some(step, eq(string))</code>. For example, <code>descendant(eq(attribute("id"), "ABC"))</code> finds all descendant elements having an "id" attribute equal to "ABC".</li> <li>Java provides standard methods for combining predicates using <code>and</code>, <code>or</code>, and <code>not</code>. For example <code>isElement().and(eq("foo"))</code> is a predicate that tests whether an item is an element with string-value "foo".</li> </ul> </section> <section> <h2><font style="font-size: 1.25em;"><b>The XdmStream class</b></font></h2><div>The fact that all this machinery is built on Java 8 streams and functions is something that many users can safely ignore; they are essential foundations, but they are hidden below the surface. At the same time, a user who understands that steps and predicates are Java Functions, and that the result of the select() method is a Java Stream, can take advantage of this knowledge.</div> <div><br /></div> <div>One of the key ideas that made this possible was the idea of subclassing <code>Stream</code> with <code>XdmStream</code>. This idea was shamelessly stolen from the open-source <strong>StreamEx</strong> library by Tagir Valeev (though no StreamEx code is actually used). Subclassing <code>Stream</code> enables additional methods to be provided to handle the results of the stream, avoiding the need for clumsy calls on the generic <code>collect()</code> method. Another motivating factor here is to allow for early exit (short-circuit evaluation) when a result can be delivered without reading the whole stream. Saxon handles this by registering <code>onClose()</code> handlers with the stream pipeline, so that when the consumer of the stream calls the <code>XdmStream.close()</code> method, the underlying supplier of data to the stream is notified that no more data is needed.</div><div><br /></div> <h2><font style="font-size: 1.25em;"><b>Examples</b></font></h2> <p>This section provides some examples extracted from an actual program that uses s9api interfaces and does a mixture of Java navigation and XPath and XQuery processing to extract data from an input document.</p> <p>First, some very simple examples. Constructs like this are not uncommon:</p> <p><code>XdmNode testInput = (XdmNode) xpath.evaluateSingle("test", testCase);</code></p> <p>This can be replaced with the much simpler and more efficient:</p> <p><code>XdmNode testInput = testCase.selectFirst(child("test"));</code></p> <p>Similarly, the slightly more complex expression:</p> <p><code>XdmNode principalPackage = (XdmNode) xpath.evaluateSingle("package[@role='principal']", testInput);</code></p> <p>becomes:</p> <p><code>XdmNode principalPackage = testInput.selectFirst(child("package").where(eq(attribute("role"), "principal"));</code></p> <p>A more complex example from the same application is this one:</p> <p><code>boolean definesError = xpath.evaluate("result//error[starts-with(@code, 'XTSE')]", testCase).size() &gt; 0; </code></p> <p>Note here how the processing is split between XPath code and Java code. This is also using an XPath function for which we haven't provided a built-in predicate in s9api. But that's no problem, because we can invoke Java methods as predicates. So this becomes:</p> <pre><code>boolean definesError = testCase.selectFirst(child("result"), descendant("error").where( some(attribute("code"), (XdmNode n) -&gt; n.getStringValue().startsWith("XTSE"))) != null;</code></pre> </section> </section> Capturing Accumulators tag:dev.saxonica.com,2018:/blog/mike//3.223 2018-03-28T08:09:03Z 2018-03-28T15:41:40Z A recent post on StackOverflow made me realise that streaming accumulators in XSLT 3.0 are much harder to use than they need to be. A reminder about what accumulators do. The idea is that as you stream your way through... Michael Kay A <a href="https://stackoverflow.com/questions/48983320/conditional-streaming-accumulator-in-xslt-3/48985112">recent post on StackOverflow</a> made me realise that streaming accumulators in XSLT 3.0 are much harder to use than they need to be.<div><br /></div> <div>A reminder about what accumulators do. The idea is that as you stream your way through a large document, you can have a number of tasks running in the background (called accumulators) which observe the document as it goes past, and accumulate information which is then available to the "main" line of processing in the foreground. For example, you might have an accumulator that simply keeps a note of the most recent section heading in a document; that's useful because the foreground processing can't simply navigate around the document to find the current section heading when it finds that it's needed.</div><div><br /></div><div>Accumulator rules can fire either on start tags or end tags or both, or they can be associated with text nodes or attributes. But there's a severe limitation: a streaming accumulator must be motionless: that's XSLT 3.0 streaming jargon to say that it can only see what's on the parser's stack at the time the accumulator triggers. This affects both the pattern that controls when the accumulator is triggered, and the action that it can take when the rule fires.</div><div><br /></div> <div>For example, you can't fire a rule with the pattern <code>match="section[title='introduction']"</code> because navigation to child elements (title) is not allowed in a motionless pattern. Similarly, if the rule fires on &nbsp;<code style="font-size: 13px;">match="section"</code>, then you can't access the title in the rule action (<code>select="title"</code>) because the action too must be motionless. In some cases a workaround is to have an accumulator that matches the text nodes (<code>match="section/title/text()[.='introduction']"</code>) but that doesn't work if section titles can have mixed content.</div><div><br /></div> <div>It turns out there's a simple fix, which I call a <i>capturing accumulator rule</i>. A capturing accumulator rule is indicated by the extension attribute <code>&lt;xsl:accumulator-rule saxon:capture="yes" phase="end"&gt;</code>, which will always be a rule that fires on an end-element tag. For a capturing rule, the background process listens to all the parser events that occur between the start tag and the end tag, and uses these to build a snapshot copy of the node. A snapshot copy is like the result of the fn:snapshot function - it's a deep copy of the matched node, with ancestor elements and their attributes tagged on for good measure. This snapshot copy is then available to the action part of the rule processing the end tag. The match patterns that trigger the accumulator rule still need to be motionless, but the action part now has access to a complete copy of the element (plus its ancestor elements and their attributes).</div><div><br /></div><div>Here's an example. Suppose you've got a large document like the XSLT specification, and you want to produce a sorted glossary at the end, and you want to do it all in streamed mode. Scattered throughout the document are term definitions like this:</div><div><br /></div> <div><p style="margin-bottom: 0px; font-stretch: normal; font-size: 12px; line-height: normal; font-family: Helvetica;"><span style="color: #021da7">&lt;termdef</span><span style="color: #f9975e"> id</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"dt-stylesheet"</span><span style="color: #f9975e"> term</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"stylesheet"</span><span style="color: #021da7">&gt;</span>A&nbsp;&nbsp;<span style="color: #021da7">&lt;term&gt;</span>stylesheet<span style="color: #021da7">&lt;/term&gt;</span> consists of one or more packages: specifically, one<br /> &nbsp;&nbsp; &nbsp;<span style="color: #021da7">&lt;termref</span><span style="color: #f9975e"> def</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"dt-top-level-package"</span><span style="color: #021da7">&gt;</span>top-level package<span style="color: #021da7">&lt;/termref&gt;</span> and zero or<br /> &nbsp; &nbsp;more <span style="color: #021da7">&lt;termref</span><span style="color: #f9975e"> def</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"dt-library-package"</span><span style="color: #021da7">&gt;</span>library packages<span style="color: #021da7">&lt;/termref&gt;</span>.<span style="color: #021da7">&lt;/termdef&gt;</span></p></div><div><span style="color: #021da7"><br /></span></div><div><br /></div> Now we can write an accumulator which simply accumulates these term definitions as they are encountered: <div><br /></div><div><div><p style="margin-bottom: 0px; font-stretch: normal; font-size: 12px; line-height: normal; font-family: Helvetica; color: rgb(0, 112, 193);">&lt;xsl:accumulator<span style="color: #f9975e"> name</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"terms"</span><span style="color: #f9975e"> streamable</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"yes"</span><span style="color: #021da7">&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; </span>&lt;xsl:accumulator-rule<span style="color: #f9975e"> match</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"termdef"</span><span style="color: #f9975e"> phase</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"end"</span><span style="color: #f9975e"> select</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"($value, .)"</span><span style="color: #f9975e"> saxon:capture</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"yes"</span><span style="color: #021da7">/&gt;</span><span style="color: #000000"><br /> </span>&lt;/xsl:accumulator&gt;</p></div><div><br /></div><div>(the <code>select</code> expression here takes the existing value of the accumulator, <code>$value</code>, and appends the snapshot of the current termdef element, which is available as the context item ".")</div><div><br /></div><div>And now, at the end of the processing, we can output the glossary like this:</div></div><div><br /></div><div><p style="margin-bottom: 0px; font-stretch: normal; font-size: 12px; line-height: normal; font-family: Helvetica;"><span style="color: #0070c1">&lt;xsl:template</span><span style="color: #f9975e"> match</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"/"</span><span style="color: #f9975e"> mode</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"streamable-mode"</span><span style="color: #021da7">&gt;</span><br /> &nbsp; &nbsp; <span style="color: #021da7">&lt;html&gt;</span>&nbsp; <br /> &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #007400">&lt;!-- main foreground processing goes here --&gt;</span><br /> &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0070c1">&lt;xsl:apply-templates</span><span style="color: #f9975e"> mode</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"#current"</span><span style="color: #021da7">/&gt;</span><br /> &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #007400">&lt;!-- now output the glossary --&gt;</span><br /> &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #021da7">&lt;div</span><span style="color: #f9975e"> id</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"glossary"</span><span style="color: #f9975e"> class</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"glossary"</span><span style="color: #021da7">&gt;</span><br /> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0070c1">&lt;xsl:apply-templates</span><span style="color: #f9975e"> select</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"accumulator-after('terms')"</span><span style="color: #f9975e"> mode</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"glossary"</span><span style="color: #021da7">&gt;</span><br /> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0070c1">&lt;xsl:sort</span><span style="color: #f9975e"> select</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"@term"</span><span style="color: #f9975e"> lang</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"en"</span><span style="color: #021da7">/&gt;</span><br /> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0070c1">&lt;/xsl:apply-templates&gt;</span><br /> &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #021da7">&lt;/div&gt;</span><br /> &nbsp; &nbsp; <span style="color: #021da7">&lt;/html&gt;</span><br /> <span style="color: #0070c1">&lt;/xsl:template&gt;</span></p></div><div><span style="color: #0070c1"><br /></span></div><div><br /></div><div>The value of the accumulator is a list of snapshots of termdef elements, and because these are snapshots, the processing at this point does not need to be streamable (snapshots are ordinary trees held in memory).</div><div><br /></div><div>The amount of memory needed to accomplish this is whatever is needed to hold the glossary entries. This follows the design principle behind XSLT 3.0 streaming, which was not to do just those things that required zero working memory, but to enable the programmer to do things that weren't purely streamable, while having control over the amount of memory needed.</div><div><br /></div><div>I think it's hard to find an easy way to tackle this particular problem without the new feature of capturing accumulator rules, so I hope it will prove a useful extension.</div><div><br /></div><div>I've implemented this for Saxon 9.9. Interestingly, it only took about 25 lines of code: half a dozen to enable the new extension attribute, half a dozen to allow it to be exported to SEF files and re-imported, two or three to change the streamability analysis, and a few more to invoke the existing streaming implementation of the snapshot function from the accumulator watch code. Testing and documenting the feature was a lot more work than implementing it.</div><div><br /></div><div>Here's a complete stylesheet that fleshes out the creation of a (skeletal) glossary:</div><div><br /></div><div><p style="margin-bottom: 0px; font-stretch: normal; font-size: 12px; line-height: normal; font-family: Helvetica; color: rgb(0, 116, 0);"><span style="color: #9e44d3">&lt;?xml version="1.0" encoding="UTF-8"?&gt;</span><span style="color: #000000"><br /> </span><span style="color: #0070c1">&lt;xsl:package</span><span style="color: #000000"><br /> </span><span style="color: #f9975e">&nbsp; name</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"http://www.w3.org/xslt30-test/accumulator/capture-203"</span><span style="color: #000000"><br /> </span><span style="color: #f9975e">&nbsp; package-version</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"1.0"</span><span style="color: #000000"><br /> </span><span style="color: #f9975e">&nbsp; declared-modes</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"no"</span><span style="color: #000000"><br /> </span><span style="color: #f9975e">&nbsp; </span><span style="color: #00aad6">xmlns:xsl</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"http://www.w3.org/1999/XSL/Transform"</span><span style="color: #000000"><br /> </span><span style="color: #f9975e">&nbsp; </span><span style="color: #00aad6">xmlns:xs</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"http://www.w3.org/2001/XMLSchema"</span><span style="color: #f9975e"> </span><span style="color: #00aad6">xmlns:f</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"http://accum001/"</span><span style="color: #000000"><br /> </span><span style="color: #f9975e">&nbsp; </span><span style="color: #00aad6">xmlns:saxon</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"http://saxon.sf.net/"</span><span style="color: #000000"><br /> </span><span style="color: #f9975e">&nbsp; exclude-result-prefixes</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"#all"</span><span style="color: #f9975e"> version</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"3.0"</span><span style="color: #021da7">&gt;</span><span style="color: #000000"><br /> <br /> &nbsp; </span>&lt;!-- Stylesheet to produce a glossary using capturing accumulators --&gt;<span style="color: #000000"><br /> &nbsp; <br /> &nbsp; </span>&lt;!-- The source document is a W3C specification in xmlspec format, containing<span style="color: #000000"><br /> </span>&nbsp; &nbsp; term definitions in the form &lt;termdef term="banana"&gt;A soft &lt;termref def="fruit"/&gt;&lt;/termdef&gt; --&gt;<span style="color: #000000"><br /> &nbsp; <br /> &nbsp; </span>&lt;!-- This test case shows the essential principles of how to render such a document<span style="color: #000000"><br /> </span>&nbsp; &nbsp; in streaming mode, with an alphabetical glossary of defined terms at the end --&gt;<span style="color: #000000"><br /> &nbsp; <br /> &nbsp; </span><span style="color: #0070c1">&lt;xsl:param</span><span style="color: #f9975e"> name</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"streamable"</span><span style="color: #f9975e"> static</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"yes"</span><span style="color: #f9975e"> select</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"'yes'"</span><span style="color: #021da7">/&gt;</span><span style="color: #000000"><br /> &nbsp; <br /> &nbsp; </span><span style="color: #0070c1">&lt;xsl:accumulator</span><span style="color: #f9975e"> name</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"glossary"</span><span style="color: #f9975e"> as</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"element(termdef)*"</span><span style="color: #f9975e"> initial-value</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"()"</span><span style="color: #f9975e"> streamable</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"yes"</span><span style="color: #021da7">&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; </span><span style="color: #0070c1">&lt;xsl:accumulator-rule</span><span style="color: #f9975e"> match</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"termdef"</span><span style="color: #f9975e"> phase</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"end"</span><span style="color: #f9975e"> saxon:capture</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"yes"</span><span style="color: #f9975e"> select</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"($value, .)"</span><span style="color: #021da7">/&gt;</span><span style="color: #000000"><br /> &nbsp; </span><span style="color: #0070c1">&lt;/xsl:accumulator&gt;</span><span style="color: #000000"><br /> <br /> &nbsp; </span><span style="color: #0070c1">&lt;xsl:mode</span><span style="color: #f9975e"> streamable</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"yes"</span><span style="color: #f9975e"> on-no-match</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"shallow-skip"</span><span style="color: #f9975e"> use-accumulators</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"glossary"</span><span style="color: #021da7">/&gt;</span><span style="color: #000000"><br /> &nbsp; <br /> &nbsp; </span><span style="color: #0070c1">&lt;xsl:template</span><span style="color: #f9975e"> name</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"main"</span><span style="color: #021da7">&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; </span><span style="color: #0070c1">&lt;xsl:source-document</span><span style="color: #f9975e"> href</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"xslt.xml"</span><span style="color: #f9975e"> streamable</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"yes"</span><span style="color: #f9975e"> use-accumulators</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"glossary"</span><span style="color: #021da7">&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; &nbsp; </span><span style="color: #0070c1">&lt;xsl:apply-templates</span><span style="color: #f9975e"> select</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"."</span><span style="color: #021da7">/&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; </span><span style="color: #0070c1">&lt;/xsl:source-document&gt;</span><span style="color: #000000"><br /> &nbsp; </span><span style="color: #0070c1">&lt;/xsl:template&gt;</span><span style="color: #000000"><br /> &nbsp; <br /> &nbsp;</span><span style="color: #0070c1">&lt;xsl:template</span><span style="color: #f9975e"> match</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"/"</span><span style="color: #021da7">&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; </span><span style="color: #021da7">&lt;out&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; &nbsp; </span>&lt;!-- First render the body of the document --&gt;<span style="color: #000000"><br /> &nbsp; &nbsp; &nbsp; </span><span style="color: #0070c1">&lt;xsl:apply-templates/&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; &nbsp; </span>&lt;!-- Now generate the glossary --&gt;<span style="color: #000000"><br /> &nbsp; &nbsp; &nbsp; </span><span style="color: #021da7">&lt;table&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; &nbsp; &nbsp; </span><span style="color: #021da7">&lt;tbody&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </span><span style="color: #0070c1">&lt;xsl:apply-templates</span><span style="color: #f9975e"> select</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"accumulator-after('glossary')"</span><span style="color: #f9975e"> mode</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"glossary"</span><span style="color: #021da7">&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </span><span style="color: #0070c1">&lt;xsl:sort</span><span style="color: #f9975e"> select</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"@term"</span><span style="color: #f9975e"> lang</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"en"</span><span style="color: #021da7">/&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </span><span style="color: #0070c1">&lt;/xsl:apply-templates&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; &nbsp; &nbsp; </span><span style="color: #021da7">&lt;/tbody&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; &nbsp; </span><span style="color: #021da7">&lt;/table&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; </span><span style="color: #021da7">&lt;/out&gt;</span><span style="color: #000000"><br /> &nbsp; </span><span style="color: #0070c1">&lt;/xsl:template&gt;</span><span style="color: #000000"><br /> &nbsp; <br /> &nbsp; </span><span style="color: #0070c1">&lt;xsl:template</span><span style="color: #f9975e"> match</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"div1|inform-div1"</span><span style="color: #021da7">&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; </span><span style="color: #021da7">&lt;div</span><span style="color: #f9975e"> id</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"{@id}"</span><span style="color: #021da7">&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; &nbsp; </span><span style="color: #0070c1">&lt;xsl:apply-templates/&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; </span><span style="color: #021da7">&lt;/div&gt;</span><span style="color: #000000"><br /> &nbsp; </span><span style="color: #0070c1">&lt;/xsl:template&gt;</span><span style="color: #000000"><br /> &nbsp; <br /> &nbsp; </span>&lt;!-- Main document processing: just output the headings --&gt;<span style="color: #000000"><br /> &nbsp; <br /> &nbsp; </span><span style="color: #0070c1">&lt;xsl:template</span><span style="color: #f9975e"> match</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"div1/head | inform-div1/head"</span><span style="color: #021da7">&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; </span><span style="color: #0070c1">&lt;xsl:attribute</span><span style="color: #f9975e"> name</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"title"</span><span style="color: #f9975e"> select</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"."</span><span style="color: #021da7">/&gt;</span><span style="color: #000000"><br /> &nbsp; </span><span style="color: #0070c1">&lt;/xsl:template&gt;</span><span style="color: #000000"><br /> &nbsp; <br /> &nbsp; </span>&lt;!-- Glossary processing --&gt;<span style="color: #000000"><br /> &nbsp; <br /> &nbsp; </span><span style="color: #0070c1">&lt;xsl:mode</span><span style="color: #f9975e"> name</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"glossary"</span><span style="color: #f9975e"> streamable</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"no"</span><span style="color: #021da7">/&gt;</span><span style="color: #000000"><br /> &nbsp; <br /> &nbsp; </span><span style="color: #0070c1">&lt;xsl:template</span><span style="color: #f9975e"> match</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"termdef"</span><span style="color: #f9975e"> mode</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"glossary"</span><span style="color: #021da7">&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; </span><span style="color: #021da7">&lt;tr&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; &nbsp; </span><span style="color: #021da7">&lt;td&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; &nbsp; &nbsp; </span><span style="color: #0070c1">&lt;xsl:value-of</span><span style="color: #f9975e"> select</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"@term"</span><span style="color: #021da7">/&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; &nbsp; </span><span style="color: #021da7">&lt;/td&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; &nbsp; </span><span style="color: #021da7">&lt;td&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; &nbsp; &nbsp; </span><span style="color: #0070c1">&lt;xsl:value-of</span><span style="color: #f9975e"> select</span><span style="color: #ff9450">=</span><span style="color: #ab4500">"."</span><span style="color: #021da7">/&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; &nbsp; </span><span style="color: #021da7">&lt;/td&gt;</span><span style="color: #000000"><br /> &nbsp; &nbsp; </span><span style="color: #021da7">&lt;/tr&gt;</span><span style="color: #000000"><br /> &nbsp; </span><span style="color: #0070c1">&lt;/xsl:template&gt;</span><span style="color: #000000"><br /> <br /> </span><span style="color: #0070c1">&lt;/xsl:package&gt;</span></p></div><div><span style="color: #0070c1"><br /></span></div><div>&nbsp;&nbsp;</div> Diagnostics on Type Errors tag:dev.saxonica.com,2018:/blog/mike//3.222 2018-03-16T15:50:27Z 2018-03-16T16:55:59Z Providing good diagnostics for programming errors has always been a high priority in Saxon, second only to conformance with the W3C specifications. One important area of diagnostics is reporting on type errors: that is, cases where a particular context requires... Michael Kay Providing good diagnostics for programming errors has always been a high priority in Saxon, second only to conformance with the W3C specifications. One important area of diagnostics is reporting on type errors: that is, cases where a particular context requires a value of a given type, and the supplied value is the wrong type. A classic example would be providing a string as the first argument to format-date(), which requires an xs:date to be supplied.<div><br /></div><div>Of course, the more programmers follow the discipline of declaring the expected types of function parameters and variables, the more helpful the compiler can be in diagnosing programming errors caused by supplying the wrong type of value.<br /><div><br /></div><div>Type errors can be detected statically or dynamically. Saxon uses "optimistic type checking".&nbsp;</div><div><br /></div><div>At compile time, it a value of type R is required in a particular context, and the expression appearing in that context is E, then the compiler attempts to infer the static type of expression E: call this S. Sometimes this is straightforward, for example if E is a call on the node-name() function, then it knows that S is xs:QName. In other case the compiler has to be smarter: for example it knows that the static type of a call on remove() is the same as the static type of the first argument, with an adjustment to the occurrence indicator.</div></div><div><br /></div><div>Optimistic type checking reports an error at compile time only if there is nothing in common between the required type R and the inferred static type of E: that is, if there is no overlap between the set of instances of the two types. That would mean that a run-time failure is inevitable (assuming the code actually gets executed), and the W3C specifications allow early reporting of such an error.</div><div><br /></div><div>There's another interesting case where the types overlap only to the extent that both allow an empty sequence: for example if the required type is (xs:string*) and the supplied type is (xs:integer*). That's almost certainly an error, but W3C doesn't allow an error to be reported here because there is a faint chance that execution could succeed. So Saxon reports this as a warning. With maps and arrays, incidentally, there are analogous situations where the only overlap is an empty map or array, but Saxon isn't yet handling that case specially.</div><div><br /></div><div>If the types aren't completely disjoint, there are two other possibilities: the required type R might subsume the supplied type S, meaning that no run-time type checking is needed because the call will always succeed. The other possibility is that the types overlap: evaluating the supplied expression E might or might not produce a value that matches the required type R. In this case Saxon generates code to perform run-time type checking. (This is one reason why declaring the types of parameters and variables is such good practice: the code runs faster because there is no unnecessary run-time checking.)</div><div><br /></div><div>Until recently, the error message for a type error takes the form:</div><div><br /></div><div>Required item type of CCC is RRR; supplied value has item type SSS</div><div><br /></div><div>For example:</div><div><br /></div><div><b>Required item type of first argument to format-date() is xs:date; supplied value has item type xs:string</b></div><div><br /></div><div>which works pretty well in most cases. However, I'm finding that as I write more complex code involving maps and arrays, it's no longer good enough. The problem is that as the types become more complex, simply giving the required and actual types isn't enough to make it clear why they are incompatible. You end up with messages like this one:</div><div><br /></div><div><div><b>Required item type of first argument of local:x() is map(xs:integer, xs:date);&nbsp;</b><b>supplied value has item type map(xs:anyAtomicType, xs:date).</b></div></div><div><br /></div><div>where an expert user can probably work out that the problem is that the supplied map contains an entry whose key is not an integer; but it doesn't exactly point clearly to the source of the problem.</div><div><br /></div><div>The problem comes to a head particularly when tuple types are used (see <a href="http://dev.saxonica.com/blog/mike/2016/09/tuple-types-and-type-aliases.html">here</a>). If the required type is a tuple type, reporting the supplied type as a map type is particularly unhelpful.</div><div><br /></div><div>I'm therefore changing the approach: instead of reporting on the supplied type of the value (or the inferred type of the expression, in the case of static errors), I'm reporting an explanation of why it doesn't match. Here's the new version of the message:</div><div><br /></div><div><div><b>The required item type of the first argument of local:x() is map(xs:integer,&nbsp;</b><b>xs:date); the supplied value map{xs:date("2018-03-16Z"):5, "x":3} does not match. The map&nbsp;</b><b>contains a key (xs:date("2018-03-16Z")) of type xs:date that is not an instance of the&nbsp;</b><b>required type xs:integer.</b></div></div><div><br /></div><div>So firstly, I'm outputting the actual value, or an abbreviated form of it, rather than just its type (that only works, of course, for run-time errors). And secondly, I'm highlighting how the type-checker worked out that the value doesn't match the required type: it's saying explicitly which rule was broken.</div><div><br /></div><div>(Another minor change you can see here is that I'm making more effort to write complete English sentences.)</div><div><br /></div><div>This doesn't just benefit the new map and array types, you can also see the effect with node types. For example, if the required type is document-node(element(foo)), you might see the message:</div><div><br /></div><div><div><b>The required item type of the first argument of local:x() is&nbsp;</b><b>document-node(element(Q{}foo)); the supplied value doc() does not match. The supplied&nbsp;</b><b>document node has an element child (&lt;bar&gt;) that does not satisfy the element test. The&nbsp;</b><b>node has the wrong name.</b></div></div><div><br /></div><div>Another change I'm making is to distribute type-checking into a sequence constructor. At present, if a function is defined to return (say) a list of element nodes, and the function body contains a sequence of a dozen instructions, one of which returns a text node, you get a message saying that the type of the function result is wrong, but it doesn't pinpoint exactly why. By distributing the type checking (applying the principle that if the function must return element nodes, then each of the instructions must return element nodes) we can (a) identify the instruction in error much more precisely, and (b) avoid the run-time cost of checking the results of those instructions that we know statically are OK.</div><div><br /></div><div>Interestingly, all these changes were stimulated by my own recent experience in writing a complex stylesheet. I described the plans for this <a href="http://dev.saxonica.com/blog/mike/2018/02/could-we-write-an-xsd-schema-processor-in-xslt.html">here</a> and the coding has now been completed (I'll report on the outcome later). It's a classic case of dogfood: if you use your own products in anger, you find ways of improving them that you wouldn't have thought of otherwise, and that users wouldn't have suggested because they don't know what's possible.</div><div><br /></div> Could we write an XSD Schema Processor in XSLT? tag:dev.saxonica.com,2018:/blog/mike//3.221 2018-02-10T18:58:46Z 2018-02-10T19:07:38Z Many computing platforms are not well-served by up to date XML technology, and in consequence Saxonica has been slowly increasing its coverage of the major platforms: extending from Java to .NET, C++, PHP, Javascript using a variety of technical... Michael Kay <p style="margin-bottom: 0cm"><span style="font-size: 1em;">Many computing platforms are not well-served by up to date XML technology, and in consequence Saxonica has been slowly increasing its coverage of the major platforms: extending from Java to .NET, C++, PHP, Javascript using a variety of technical approaches. This makes it desirable to implement as much as possible using portable languages, and if we want to minimize our dependence on third-party technologies (IKVMC, for example, is now effectively unsupported) we should be writing in our own languages, notably XSLT.</span></p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">This note therefore asks the question, could one write an XSD Schema 1.1 processor in XSLT?</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">In fact a schema processor has two parts, compile time (compiling schema documents into the schema component model and SCM) and run-time (validating an instance document using the SCM). </p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">The first part, compiling, seems to pose no intrinsic difficulty. Some of the rules and constraints that need to be enforced are fairly convoluted, but the only really tricky part is compiling grammars into finite-state-machines, and checking grammars (or the resulting finite-state-machine) for conformance with rules such as the Unique Particle Attribution constraint. But since we already have a tool (written in Java) for compiling schemas into an XML-based SCM file, and since it wouldn't really inconvenience users too much for this tool to be invoked via an HTTP interface, the priority for a portable implementation is really the run-time part of the processor rather than the compile-time part. (Note that this means ignoring xsi:schemaLocation, since that effectively causes the run-time validator to invoke the schema compiler.)</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">There are two ways one could envisage implementing the run-time part in XSLT: either with a universal stylesheet that takes the SCM and the instance document as inputs, or by generating a custom XSLT stylesheet from the SCM, rather as is done with Schematron. For the moment I'll keep an open mind which of these two approaches is preferable.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Ideally, the XSLT stylesheet would use streaming so the instance document being validated does not need to fit in memory. We'll bear this requirement in mind as we look at the detail.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">The XSLT code, of course, cannot rely on any services from a schema processor, so it cannot be schema-aware.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Let's look at the main jobs the validator has to do.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm"><b><font style="font-size: 1.25em;">Validating strings against simple types</font></b></p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Validating against a primitive type can be done simply using the XPath castable operator.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Validating against a simple type derived by restriction involves checking the various facets. For the most part, the logic of each facet is easily expressed in XPath. There are a few exceptions:</p> <p style="margin-bottom: 0cm"><br /> </p> <ul> <li><p style="margin-bottom: 0cm">Patterns (regular expressions). The XPath regular expression syntax is a superset of the XSD syntax. To evaluate XSD regular expressions, we either need some kind of extension to the XPath matches() function, or we need to translate XSD regular expressions into XPath regular expressions. This translation is probably not too difficult. It mainly involves rejecting some disallowed constructs (such as back-references, non-capturing groups, and reluctant quantifiers), and escaping "^" and "$" with a backslash.</p> </li><li><p style="margin-bottom: 0cm">Length facets for hexBinary and base64Binary. Base646Binary can be cast to hexBinary, and the length of the value in octets can be computed by converting to string and dividing the string length by 2.</p> </li></ul> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Validating against a list type can be achieved by tokenizing, and testing each token against the item type.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Validating against a union type can be achieved by validating against each member type (and also validing against any constraining facets defined at the level of the union itself).</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm"><b><font style="font-size: 1.25em;">Validating elements against complex types</font></b></p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">The only difficult case here is complex content. It should be possible to achieve this by iterating over the child nodes using xsl:iterate, keeping the current state (in the FSM) as the value of the iteration parameter. On completion the element is valid if the state is a final state. As each element is processed, it needs to be checked against the state of its parent element's FSM, and in addition a new validator is established for validating its children. This is all streamable.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm"><b><font style="font-size: 1.25em;">Assertions and Conditional Type Assignment</font></b></p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Evaluating XPath expressions can be achieved using xsl:evaluate. The main difficulty is setting up the node-tree to which xsl:evaluate is applied. This needs to be a copy of the original source subtree, to ensure that the assertion cannot stray outside the relevant subtree. Making this copy consumes the source subtree, which makes streaming tricky: however, the ordinary complex type validation can also happen on the copy, so I think streaming is possible.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm"><b><font style="font-size: 1.25em;">Identity constraints (unique, key, keyref)</font></b></p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">This is where streaming really gets quite tricky - especially given the complexity of the specification for those rare keyref cases where the key is defined on a different element from the corresponding keyref.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">The obvious XSLT mechanism here is accumulators. But accumulator rules are triggered by patterns, and defining the patterns that correspond to the elements involved in a key definition is tricky. For example if sections nest recursively, a uniqueness constraint might say that for every section, its child section elements must have unique @section-number attributes. A corresponding accumulator would have to maintain a stack of sections, with a map of section numbers at each level of the stack, and the accumulator rule for a section would need to check the section number of that section at the current level, and start a new level.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">A further complication is that there may be multiple (global and/or local) element declarations with the same name, with different unique / key / keyref constraints. Deciding which of these apply by means of XSLT pattern matching is certainly difficult and may be impossible.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">The multiple xs:field elements within a constraint do not have to match components of the key in document order, but a streamed implementation would still be possible using the map constructor, which allows multiple downward selections - provided that the xs:field selector expressions are themselves streamable, which I think is probably always the case.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">The problem of streamability could possibly be solved with some kind of dynamic pipelining. The "main" validation process, when it encounters a start tag, is able to establish which element declaration it belongs to, and could in principle spawn another transformation (processing the same input stream) for each key / unique constraint defined in that element declaration: a kind of dynamic xsl:fork.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">I think as a first cut it would probably be wise not to attempt streaming in the case of a schema that uses unique / key / keyref constraints. More specifically, if any element has such constraints, it can be deep-copied, and validation can then switch to the in-memory subtree rather than the original stream. After all, we have no immediate plans to implement streaming other than in the Java product, and that will inevitably make an XSLT-based schema processor on other platforms unstreamed anyway.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm"><b><font style="font-size: 1.25em;">Outcome of validation</font></b></p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">There are two main scenarios we should support: validity checking, and type annotation. With validity checking we want to report many invalidities in a single validation episode, and the main output is the validation report. With type annotation, the main output is a validated version of the instance document, and a single invalidity can cause the process to terminate with a dynamic error.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">It is not possible for a non-schema-aware stylesheet to add type annotations to the result tree without some kind of extensions. The XSLT language only allows type annotations to be created as the result of schema validation. So we will need an extension for this purpose: perhaps a saxon:type-annotation="QName" attribute on instructions such as xsl:element, xsl:copy, xsl:attribute.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">For reporting validation errors, it's important to report the location of the invalidity. This also requires extensions, such as saxon:line-number().</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm"><b><font style="font-size: 1.25em;">Conclusion</font></b></p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">I don't think there are any serious obstacles to writing a validation engine in XSLT. Making it streamable is harder, especially for integrity constraints. A couple of extensions are needed: the ability to add type annotations to the result tree, and the ability to get line numbers of nodes in the source.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">I still have an open mind about whether a universal stylesheet should be used, or a generated stylesheet for a particular schema.</p> Transforming JSON tag:dev.saxonica.com,2017:/blog/mike//3.220 2017-11-13T13:02:13Z 2017-11-13T14:03:07Z In my conference paper at XML Prague in 2016 I examined a couple of use cases for transforming JSON structures using XSLT 3.0. The overall conclusion was not particularly encouraging: the easiest way to achieve the desired results was to... Michael Kay <p>In my <a href="https://www.saxonica.com/papers/xmlprague-2016mhk.pdf">conference paper at XML Prague</a> in 2016 I examined a couple of use cases for transforming JSON structures using XSLT 3.0. The overall conclusion was not particularly encouraging: the easiest way to achieve the desired results was to convert the JSON to XML, transform the XML, and then convert it back to JSON.</p> <p>Unfortunately this study came too late to get any new features into XSLT 3.0. However, I've been taking another look at the use cases to see whether we could design language extensions to handle them, and this is looking quite encouraging.</p> <h2><strong>Use case 1: bulk update</strong></h2> <p>We start with the JSON document</p> <pre><code>[ { "id": 3, "name": "A blue mouse", "price": 25.50, "dimensions": {"length": 3.1, "width": 1.0, "height": 1.0}, "warehouseLocation": {"latitude": 54.4, "longitude": -32.7 }}, { "id": 2, "name": "An ice sculpture", "price": 12.50, "tags": ["cold", "ice"], "dimensions": {"length": 7.0, "width": 12.0, "height": 9.5 }, "warehouseLocation": {"latitude": -78.75, "longitude": 20.4 } } ] </code></pre> <p>and the requirement: for all products having the tag "ice", increase the price by 10%, leaving all other data unchanged. I've prototyped a new XSLT instruction that allows this to be done as follows:</p> <pre><code>&lt;saxon:deep-update root="json-doc('input.json') select=" ?*[?tags?* = 'ice']" action="map:put(., 'price', ?price * 1.1)"/&gt; </code></pre> <p>How does this work?</p> <p>First the instruction evaluates the <code>root</code> expression, which in this case returns the map/array representation of the input JSON document. With this root item as context item, it then evaluates the <code>select</code> expression to obtain a sequence of contained maps or arrays to be updated: these can appear at any depth under the root item. With each of these selected maps or arrays as the context item, it then evaluates the action expression, and uses the returned value as a replacement for the selected map or array. This update then percolates back up to the root item, and the result of the instruction is a map or array that is the same as the original except for the replacement of the selected items.</p> <p>The magic here is in the way that the update is percolated back up to the root. Because maps and arrays are immutable and have no persistent identity, the only way to do this is to keep track of the maps and arrays selected en-route from the root item to the items selected for modification as we do the downward selection, and then modify these maps and arrays in reverse order on the way back up. Moreover we need to keep track of the cases where multiple updates are made to the same containing map or array. All this magic, however, is largely hidden from the user. The only thing the user needs to be aware of is that the select expression is constrained to use a limited set of constructs when making downward selections.</p> <p>The select expression <code>select="?*[?tags?* = 'ice']"</code> perhaps needs a little bit of explanation. The root of the JSON tree is an array of maps, and the initial <code>?*</code> turns this into a sequence of maps. We then want to filter this sequence of maps to include only those where the value of the "tags" field is an array containing the string "ice" as one of its members. The easiest way to test this predicate is to convert the value from an array of strings to a sequence of strings (so <code>?tags?*</code>) and then use the XPath existential "=" operator to compare with the string "ice".</p> <p>The action expression <code>map:put(., 'price', ?price * 1.1)</code> takes as input the selected map, and replaces it with a map in which the <code>price</code> entry is replaced with a new entry having the key "price" and the associated value computed as the old price multiplied by 1.1.</p> <h2><strong>Use case 2: Hierarchic Inversion</strong></h2> <p>The second use case in the XML Prague 2016 paper was a hierarchic inversion (aka grouping) problem. Specifically: we'll look at a structural transformation changing a JSON structure with information about the students enrolled for each course to its inverse, a structure with information about the courses for which each student is enrolled.</p> <p>Here is the input dataset:</p> <pre><code>[{ "faculty": "humanities", "courses": [ { "course": "English", "students": [ { "first": "Mary", "last": "Smith", "email": "mary_smith@gmail.com"}, { "first": "Ann", "last": "Jones", "email": "ann_jones@gmail.com"} ] }, { "course": "History", "students": [ { "first": "Ann", "last": "Jones", "email": "ann_jones@gmail.com" }, { "first": "John", "last": "Taylor", "email": "john_taylor@gmail.com"} ] } ] }, { "faculty": "science", "courses": [ { "course": "Physics", "students": [ { "first": "Anil", "last": "Singh", "email": "anil_singh@gmail.com"}, { "first": "Amisha", "last": "Patel", "email": "amisha_patel@gmail.com"}] }, { "course": "Chemistry", "students": [ { "first": "John", "last": "Taylor", "email": "john_taylor@gmail.com"}, { "first": "Anil", "last": "Singh", "email": "anil_singh@gmail.com"} ] } ] }] </code></pre> <p>The goal is to produce a list of students, sorted by last name then irst name, each containing a list of courses taken by that student, like this:</p> <pre><code>[ { "email": "anil_singh@gmail.com", "courses": ["Physics", "Chemistry" ]}, { "email": "john_taylor@gmail.com", "courses": ["History", "Chemistry" ]}, .... ] </code></pre> <p>The classic way of handling this is in two phases: first reduce the hierarchic input to a flat sequence in which all the required information is contained at one level, and then apply grouping to this flat sequence.</p> <p>To achieve the flattening we introduce another new XSLT instruction:</p> <pre><code>&lt;saxon:tabulate-maps root="json-doc('input.json')" select="?* ! map:find(., 'students)?*"/&gt; </code></pre> <p>Again the <code>root</code> expression delivers a representation of the JSON document as an array of maps. The <code>select</code> expression first selects these maps ("?*"), then for each one it calls map:find() to get an array of maps each representing a student. The result of the instruction is a sequence of maps corresponding to these student maps in the input, where each output map contains not only the fields present in the input (first, last, email), but also fields inherited from parents and ancestors (faculty, course). For good measure it also contains a field _keys containing an array of keys representing the path from root to leaf, but we don't actually use that in this example.</p> <p>Once we have this flat structure, we can construct a new hierarchy using XSLT grouping:</p> <pre><code>&lt;xsl:for-each-group select="$students" group-by="?email"&gt; &lt;xsl:map&gt; &lt;xsl:map-entry key="'email'" select="?email"/&gt; &lt;xsl:map-entry key="'first'" select="?first"/&gt; &lt;xsl:map-entry key="'last'" select="?last"/&gt; &lt;xsl:map-entry key="'courses'"&gt; &lt;saxon:array&gt; &lt;xsl:for-each select="current-group()"&gt; &lt;saxon:array-member select="?course"/&gt; &lt;/xsl:for-each&gt; &lt;/saxon:array&gt; &lt;/xsl:map-entry&gt; &lt;/xsl:map&gt; &lt;/xsl:for-each-group&gt; </code></pre> <p>This can then be serialized using the JSON output method to produce to required output.</p> <p>Note: the <code>saxon:array</code> and <code>saxon:array-member</code> instructions already exist in Saxon 9.8. They fill an obvious gap in the XSLT 3.0 facilities for handling arrays - a gap that exists largely because the XSL WG was unwilling to create a dependency XPath 3.1.</p> <h2><strong>Use Case 3: conversion to HTML</strong></h2> <p>This use case isn't in the XML Prague paper, but is included here for completeness.</p> <p>The aim here is to construct an HTML page containing the information from a JSON document, without significant structural alteration. This is a classic use case for the recursive application of template rules, so the aim is to make it easy to traverse the JSON structure using templates with appropriate match patterns.</p> <p>Unfortunately, although the XSLT 3.0 facilities allow patterns that match maps and arrays, they are cumbersome to use. Firstly, the syntax is awkward:</p> <pre><code>match=".[. instance of map(...)]" </code></pre> <p>We can solve this with a Saxon extension allowing the syntax</p> <pre><code>match="map()" </code></pre> <p>Secondly, the type of a map isn't enough to distinguish one map from another. To identify a map representing a student, for example, we aren't really interested in knowing that it is a <code>map(xs:string, item()*)</code>. What we need to know is that it has fields (email, first, last). Fortunately another Saxon extension comes to our aid: tuple types, described here: http://dev.saxonica.com/blog/mike/2016/09/tuple-types-and-type-aliases.html With tuple types we can change the match pattern to</p> <pre><code>match="tuple(email, first, last)" </code></pre> <p>Even better, we can use type aliases:</p> <pre><code>&lt;saxon:type-alias name="student" as="tuple(email, first, last)"/&gt; &lt;xsl:template match="~student"&gt;...&lt;/xsl:template&gt; </code></pre> <p>With this extension we can now render this input JSON into HTML using the stylesheet:</p> <pre><code>&lt;?xml version="1.0" encoding="utf-8"?&gt; &lt;xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:saxon="http://saxon.sf.net/" exclude-result-prefixes="#all" expand-text="yes" &lt;saxon:type-alias name="faculty" type="tuple(faculty, courses)"/&gt; &lt;saxon:type-alias name="course" type="tuple(course, students)"/&gt; &lt;saxon:type-alias name="student" type="tuple(first, last, email)"/&gt; &lt;xsl:template match="~faculty"&gt; &lt;h1&gt;{?faculty} Faculty&lt;/h1&gt; &lt;xsl:apply-templates select="?courses?*"/&gt; &lt;/xsl:template&gt; &lt;xsl:template match="~course"&gt; &lt;h2&gt;{?course} Course&lt;/h2&gt; &lt;p&gt;List of students:&lt;/p&gt; &lt;table&gt; &lt;thead&gt; &lt;tr&gt; &lt;th&gt;Name&lt;/th&gt; &lt;th&gt;Email&lt;/th&gt; &lt;/tr&gt; &lt;/thead&gt; &lt;tbody&gt; &lt;xsl:apply-templates select="?students?*"&gt; &lt;xsl:sort select="?last"/&gt; &lt;xsl:sort select="?first"/&gt; &lt;/xsl:apply-templates&gt; &lt;/tbody&gt; &lt;/table&gt; &lt;/xsl:template&gt; &lt;xsl:template match="~student"&gt; &lt;tr&gt; &lt;td&gt;{?first} {?last}&lt;/td&gt; &lt;td&gt;{?email}&lt;/td&gt; &lt;/tr&gt; &lt;/xsl:template&gt; &lt;xsl:template name="xsl:initial-template"&gt; &lt;xsl:apply-templates select="json-doc('courses.json')"/&gt; &lt;/xsl:template&gt; &lt;/xsl:stylesheet&gt; </code></pre> <h2><strong>Conclusions</strong></h2> <p>With only the facilities of the published XSLT 3.0 recommendation, the easiest way to transform JSON is often to convert it first to XML node trees, and then use the traditional XSLT techniques to transform the XML, before converting it back to JSON.</p> <p>With a few judiciously chosen extensions to the language, however, a wide range of JSON transformations can be achieved natively.</p> Bugs: How well are we doing? tag:dev.saxonica.com,2017:/blog/mike//3.218 2017-02-05T17:36:01Z 2017-02-05T19:13:01Z We're about to ship another Saxon 9.7 maintenance release, with another 50 or so bug clearances. The total number of patches we've issued since 9.7 was released in November 2015 has now reached almost 450. The number seems frightening and... Michael Kay We're about to ship another Saxon 9.7 maintenance release, with another 50 or so bug clearances. The total number of patches we've issued since 9.7 was released in November 2015 has now reached almost 450. The number seems frightening and the pace is relentless. But are we getting it right, or are we getting it badly wrong?<div><br /></div><div>There are frequently-quoted but poorly-sourced numbers you can find on the internet suggesting a norm of 10-25 bugs per thousand lines of code. Saxon is 300,000 lines of (non-comment) code, so that would suggest we can expect a release to have 3000 to 7500 bugs in it. One one measure that suggests we're doing a lot better than the norm. Or it could also mean that most of the bugs haven't been found yet.</div><div><br /></div><div>I'm very sceptical of such numbers. I remember a mature product in ICL that was been maintained by a sole part-time worker, handling half a dozen bugs a month. When she went on maternity leave, the flow of bugs magically stopped. No-one else could answer the questions, so users stopped sending them in. The same happens with Oracle and Microsoft. I submitted a Java bug once, and got a response 6 years later saying it was being closed with no action. When that happens, you stop sending in bug reports. So in many ways, a high number of bug reports doesn't mean you have a buggy product, it means you have a responsive process for responding to them. I would hate the number of bug reports we get to drop because people don't think there's any point in submitting them.</div><div><br /></div><div>And of course the definition of what is a bug is completely slippery. Very few of the bug reports we get are completely without merit, in the sense that the product is doing exactly what it says on the tin; at the same time, rather few are incontrovertible bugs either. If diagnostics are unhelpful, is that a bug?</div><div><br /></div><div>The only important test really is whether our users are satisfied with the reliability of the product. We don't really get enough feedback on that at a high level. Perhaps we should make more effort to find out; but I so intensely hate completing customer satisfaction questionnaires myself that I'm very reluctant to inflict it on our users. Given that open source users outnumber commercial users by probably ten-to-one, and that the satisfaction of our open source users is just as important to us as the satisfaction of our commercial customers (because it's satisfied open source users who do all the sales work for us); and given that we don't actually have any way of "reaching out" to our open source users (how I hate the marketing jargon); and given that we really wouldn't know what to differently if we discovered that 60% of our users were "satisfied or very satisfied": I don't really see very much value in the exercise. But I guess putting a survey form on the web site wouldn't be difficult, some people might interpret it as a signal that we actually care.</div><div><br /></div><div>With 9.7 there was a bit of a shift in policy towards fixing bugs pro-actively (more marketing speak). In particular, we've been in a phase where the XSLT and XQuery specs were becoming very stable but more test cases were becoming available all the time (many of them, I might add, contributed by Saxonica - often in reaction to queries from our users). So we've continuously been applying new tests to the existing release, which is probably a first. Where a test showed that we were handling edge cases incorrectly, and indeed when the spec was changed in little ways under our feet, we've raised bugs and fixes to keep the conformance level as high as possible (while also maintaining compatibility). So we've shifted the boundary a little between feature changes (which traditionally only come in the next release), and bug fixes, which come in a maintenance release. That shift also helps to explain why the gap between releases is becoming longer - though the biggest factor holding us back, I think, is the ever-increasing amount of testing that we do before a release.</div><div><br /></div><div>Fixing bugs pro-actively (that is before any user has hit the bug) has the potential to improve user satisfaction if it means that they never do hit the bug. I think it's always as well to remember also that for every user who reports a bug there may be a dozen users who hit it and don't report it. One reason we monitor StackOverflow is that a lot of users feel more confident about reporting a problem there, rather than reporting it directly to us. Users know that their knowledge is limited and they don't want to make fools of themselves, and you need a high level of confidence to tell your software vendor that you think the product is wrong.&nbsp;</div><div><br /></div><div>On the other hand, destabilisation is a risk. A fix in one place will often expose a bug somewhere else, or re-awaken an old bug that had been laid to rest. As a release becomes more mature, we try to balance the benefits of fixing problems with the risk of de-stabilisation.</div><div><br /></div><div>So, what about testing? Can we say that because we've fixed 450 bugs, we didn't run enough tests in the first place?</div><div><br /></div><div>Yes, in a sense that's true, but how many more tests would have had to write in order to catch them? We probably run about a million test cases (say, 100K tests in an average of ten product configurations each) and these days the last couple of months before a major release are devoted exclusively to testing. (I know that means we don't do enough continuous testing. But sorry, it doesn't work for me. If we're doing something radical to the internals of the product then things are going to break in the process, and my style is to get the new design working while it's still fresh in my head, then pick up the broken pieces later. If everything had to work in every nightly build, we would never get the radical things done. That's a personal take, and of course what works with a 3-4 person team doesn't necessarily work with a larger project. We're probably pretty unusual in developing a 300Kloc software package with 3-4 people, so lots of our experience might not extrapolate.)</div><div><br /></div><div>We've had a significant number of bug reports this time on performance regression. (This is of course another area where it's arguable whether it's a bug or not. Sometimes we will change the design in a way that we know benefits some workloads at the expense of others.) Probably most of these are extreme scenarios, for example compilation time for stylesheets where a single template declares 500 local variables. Should we have run tests to prevent that? Well, perhaps we should have more extreme cases in our test suite: the vast majority of our test cases are trivially small. But the problem is, there will always be users who do things that we would never have imagined. Like the user running an XSD 1.1 schema validation in which tens of thousands of assertions are expected to "fail", because they've written it in such a way that assertion failures aren't really errors, they are just a source of statistics for reporting on the data.</div><div><br /></div><div>The bugs we hate most (and therefore should to most to prevent) are bugs in bytecode generation, streaming, and multi-threading. The reason we hate them is that they can be a pig to debug, especially when the user-written application is large and complex.&nbsp;</div><div><br /></div><div><ul><li>For bytecode generation I think we've actually got pretty good test coverage, because we not only run every test in the QT3 and XSLT3 test suites with bytecode generation enabled, we also artificially complicate the tests to stop queries like 2+5 being evaluated by the compiler before bytecode generation kicks in. We've also got an internal recovery mechanism so if we detect that we've generated bad code, we fall back to interpreted mode and the user never notices (problem with that is of course that we never find out).</li><li>Streaming is tricky because the code is so convoluted (writing everything as inverted event-based code can be mind-blowing) and because the effects of getting it wrong often give very little clue as to the cause. But at least the failure is "in your face" for the user, who will therefore report the problem, and it's likely to be reproducible. Another difficulty with streaming is that because not all code is streamable, tests for streaming needed to be written from scratch.</li><li>Multi-threading bugs are horrible because they occur unpredictably. If there's a low probability of the problem happening then it can require a great deal of detective work to isolate the circumstances, and this often falls on the user rather than on ourselves. Fortunately we only get a couple of these a year, but they are a nightmare when they come. In 9.7 we changed our Java baseline to Java 6 and were able therefore to replace many of the hand-built multithreading code in Saxon with standard Java libraries, which I think has helped reliability a lot. But there are essentially no tools or techniques to protect you from making simple thread-safety blunders, like setting a property in a shared object without synchronization. Could we do more testing to prevent these bugs? I'm not optimistic, because the bugs we get are so few, and so particular to a specific workload, that searching the haystack just in case it contains a needle is unlikely to be effective.</li></ul><div>Summary: Having the product perceived as reliable by our users is more important to us than the actual bug count. Fixing bugs quickly before they affect more users is probably the best way of achieving that. If the bug count is high because we're raising bugs ourselves as a result of our own testing, then that's no bad thing. It hasn't yet got to the level where we can't cope with the volumes, or where we have to filter things through staff who are only employed to do support. If we can do things better, let us know.</div></div><div><br /></div><div><br /></div> Guaranteed Streamability tag:dev.saxonica.com,2016:/blog/mike//3.217 2016-12-09T21:56:27Z 2016-12-11T13:21:43Z The XSLT 3.0 specification in its current form provides a set of rules (that can be evaluated statically, purely by inspecting the stylesheet) for determining whether the code is (or is not) guaranteed streamable.If the code is guaranteed streamable then... Michael Kay The XSLT 3.0 specification in its current form provides a set of rules (that can be evaluated statically, purely by inspecting the stylesheet) for determining whether the code is (or is not) guaranteed streamable.<div><br /></div><div>If the code is guaranteed streamable then every processor (if it claims to support streaming at all) must use streaming to evaluate the stylesheet; if it is not guaranteed streamable then the processor can choose whether to use streaming or not.</div><div><br /></div><div>The tricky bit is that there's a requirement in the spec that if the code isn't guaranteed streamable, then a streaming processor (on request) has to detect this and report it. The status section of the spec says that this requirement is "at risk", meaning it might be removed if it proves too difficult to implement. There are people on the working group who believe passionately that this requirement is really important for interoperability; there are others (including me) who fully understand why users would like to have this, but have been arguing that it is extremely difficult to deliver.</div><div><br /></div><div>In this article I'm going to try to explain why it's so difficult to achieve this requirement, and to explore possibilities for overcoming these difficulties.</div><div><br /></div><div>Streamability analysis can't be performed until various other stages of static analysis are complete. It generally requires that names have been resolved (for example, names of modes and names of streamable functions). It also relies on rudimentary type analysis (determining the static type of constructs). For Saxon, this means that streamability analysis is done after parsing, name fixup, type analysis, and rewrite optimization.</div><div><br /></div><div>When Saxon performs these various stages of analysis, it modifies the expression tree as it goes: not just to record the information obtained from the analysis, but to make use of the information at execution time. It goes without saying that in modifying the expression tree, it's not permitted to replace a streamable construct with a non-streamable one, and that isn't too hard to achieve (though these things are relative...). But the requirement to report departures from guaranteed streamability imposes a second requirement, which is proving much harder. If we are to report any deviations from guaranteed streamability, then up to the point where we do the streamability analysis, we must never replace a non-streamable construct with a streamable one.</div><div><br /></div><div>There are various points at which we currently replace a non-streamable construct with a streamable one.</div><div><br /></div><div><ul><li>Very early in the process, the expression tree that is output by the parsing phase uses the same data structure on the expression tree to represent equivalent constructs in the source. For example, the expression tree produced by &lt;xsl:if test="$a=2"&gt;&lt;xsl:sequence select="3"/&gt;&lt;/xsl:if&gt; will be identical to the expression tree produced by &lt;xsl:sequence select="if ($a=2) then 3 else ()"/&gt;. But streamability analysis makes a distinction between these two constructs. It's not a big distinction (in fact, the only thing it affects is exactly where you are allowed to call the accumulator-after() function) but it's big enough to count.</li><li>At any stage in the process, if we spot a constant expression then we're likely to replace it with its value. For example if we see the expression $v+3, and $v is a global variable whose value is 5, we will replace the expression with the literal 8. This won't usually affect streamability one way or the other. However, there are a few cases where it does. The most obvious is where we work out that an expression is void (meaning it always returns an empty sequence). For example, according to the spec, the expression (author[0], author[1]) is not streamable because it makes two downward selections. But Saxon spots that author[0] is void and rewrites the expression as (author[1]), which is streamable. Void expressions often imply some kind of user error, so we often output a warning when this happens, but just because we think the user has written nonsense doesn't absolve us from the conformance requirement to report on guaranteed streamability. Void expressions are particularly likely to be found with schema-aware analysis.</li><li>Inlining of calls to user-defined functions will often make a non-streamable expression streamable.</li><li>Many other rewrites performed by the optimizer have a similar effect, for example replacing (X|Y) by *[self::X|self::Y].</li></ul><div>My first attempt to meet the requirement is therefore (a) to add information to the expression tree where it's needed to maintain a distinction that affects streamability, and (b) to try to avoid those rewrites that turn non-streamable expressions into streamable ones. As a first cut, skipping the optimization phase completely seems an easy way to achieve (b). But it turns out it's not sufficient, firstly because some rewrites are done during the type-checking phase, and secondly because it turns out that without an optimization pass, we actually end up finding that some expressions that should be streamable are not. The most common case for this is sorting into document order. Given the expression A/B, Saxon actually builds an expression in the form sort(A!B) relying on the sort operation to sort nodes into document order and eliminate duplicates. This relies on the subsequent optimization phase to eliminate the sort() operation when it can. If we skip the optimization phase, we are left with an unstreamable expression.</div></div><div><br /></div><div>The other issue is that the streamability rules rely on type inferencing rules that are much simpler than the rules Saxon uses. It's only in rare cases that this will make a difference, of course: in fact, it requires considerable ingenuity to come up with such cases. The most obvious case where types make a difference to streamability is with a construct like &lt;xsl:value-of select="$v"/&gt;: this is motionless if $v is a text or attribute node, but consuming if it is a document or element node. If a global variable with private visibility is initialized with select="@price", but has no "as" attribute, Saxon will infer a type of attribute(price) for the variable, but the rules in the spec will infer a type of item()*. So to get the same streamability answer as the spec gives, we need to downgrade the static type inferencing in Saxon.</div><div><br /></div><div>So I think the changes needed to replicate exactly the streamability rules of the XSLT 3.0 spec are fairly disruptive; moreover, implementing the changes by searching for all the cases that need to change is going to be very difficult to get right (and is very difficult to test unless there is another trustworthy implementation of the rules to test against).</div><div><br /></div><div>This brings us to Plan B. Plan B is to meet the requirement by writing a completely free-standing tool for streamability analysis that's completely separate from the current static analysis code. One way to do this would be to build on the tool written by John Lumley and demonstrated at Balisage a couple of years ago. Unfortunately that's incomplete and out of date, so it would be a significant effort to finish it. Meeting the requirement in the spec is different here from doing something useful for users: what the spec demands is a yes/no answer as to whether the code is streamable; what users want to know is why, and what they need to change to make the code streamable. The challenge is to do this without users having to understand the difficult abstractions in the spec (posture, sweep, and the rest). John's tool produces an annotated expression tree revealing all the properties: that's great for a user who understands the methodology but probably rather bewildering to the typical end user. Doing the minimum for conformance, a tool that just says yes or no without saying why, involves a lot of work to get a "tick in the box" with a piece of software that no-one will ever use, but would be a lot easier to produce. Conformance has always been a very high priority for Saxonica, but I can't see anyone being happy with this particular solution.</div><div><br /></div><div>So, assuming the WG maintains its insistence of having this feature (and it seems to me likely that it will), what should we do about it?</div><div><br /></div><div>One option is simply to declare a non-conformance. Once upon a time, standards conformance was very important to Saxon's reputation in the market, but I doubt that this particular non-conformance would affect our sales.</div><div><br /></div><div>Another option is to declare conformance, do our best to achieve it using the current analysis technology, and simply log bugs if anyone reports use cases where we get the answer wrong. That seems sloppy and dishonest, and could leave us with a continuing stream of bugs to be fixed or ignored.</div><div><br /></div><div>Another option is the "minimal Plan B" analyser - a separate tool for streamability analysis, that simply reports a yes/no answer (without explanation). It would be significant piece of work to create this and test it, and it's unclear that anyone would use it, but it's probably the cheapest way of getting the conformance tick-in-the-box.</div><div><br /></div><div>A final option is to go for a "fully featured" but free-standing streamability analysis tool, one which aims to not only answer the conformance question about guaranteed streamability, but also to provide genuinely useful feedback and advice helping users to create streamable stylesheets. Of course ideally such a tool would be integrated into an IDE rather than being free-standing. I've always argued that there's only a need for one such tool: it's not something that every XSLT 3.0 processor needs to provide. Doing this well would be a large project and involves different skills from those we currently have available.</div><div><br /></div><div>In the short term, I think the only honest and affordable approach would be the first option: declare a non-conformance. Unfortunately that could threaten the viability of the spec, because we can only get a spec to Recommendation status if all features have been shown to be implementable.</div><div><br /></div><div>No easy answers.</div><div><br /></div><div><b>LATER</b></div><div><b><br /></b></div><div>I've been thinking about a Plan C which might fly...</div><div><br /></div><div>The idea here is to try and do the streamability analysis using the current expression tree structure and the current streamability logic, but applying the streamability rules to an expression tree that faithfully represents the stylesheet as parsed, with no modifications from type checking or optimization.</div><div><br /></div><div>To do this, we need to:</div><div><br /></div><div>* Define a configuration flag --strictStreamability which invokes the following logic.</div><div><br /></div><div>* Fix places where the initial expression tree loses information that's needed for streamability analysis. The two that come to mind are (a) losing the information that something is an instruction rather than an expression (e.g. we lose the distinction between xsl:map-entry and a singleton map expression) - this distinction is needed to assess calls on accumulator-after(); (b) turning path expressions A/B into docSort(A!B). There may be other cases that we will discover along the road (or fail to discover, since we may not have a complete set of test cases...)</div><div><br /></div><div>* Write a new type checker that attaches type information to this tree according to the rules in the XSLT 3.0 spec. This will be much simpler than the existing type checker, partly because the rules are much simpler, but more particularly because the only thing it will do is to assign static types: it will never report any type errors, and it will never inject any code to do run-time type checking or conversion.</div><div><br /></div><div>* Immediately after this type-checking phase, run the existing streamability rules against the expression tree. As far as I'm aware, the streamability rules in Saxon are equivalent to the W3C rules (at any rate, most of the original differences have now been eliminated).</div><div><br /></div><div>There are then two options. We could stop here: if the user sets the --strictStreamability flag, they get the report on streamability, but they don't get an executable that can actually be run. The alternative would be, if the streamability analysis succeeds, attempt to convert the expression tree into a form that we can actually use, by running the existing simplify / typecheck / optimize phases. The distinctions introduced to the expression tree by the changes described above would be eliminated by the simplify() phase, and we would then proceed along the current lines, probably including a rerun of the streamability analysis against the optimised expression tree (because the posture+sweep annotations are occasionally needed at run-time).</div><div><br /></div><div>I will do some further exploration to see whether this all looks feasible. It will be very hard to prove that we've got it 100% right. But in a sense that doesn't matter, so long as the design is sound and we're passing known tests then we can report honestly that to the best of our knowledge the requirement is satisfied, which is not the case with the current approach.</div><div><br /></div> Tuple types, and type aliases tag:dev.saxonica.com,2016:/blog/mike//3.216 2016-09-08T11:44:15Z 2016-09-08T12:28:21Z I've been experimenting with some promising Saxon extensions.Maps and arrays greatly increase the flexibility and power of the XPath / XSLT / XQuery type system. But one drawback is that the type declarations can be very cumbersome, and very uninformative.Suppose... Michael Kay I've been experimenting with some promising Saxon extensions.<div><br /></div><div>Maps and arrays greatly increase the flexibility and power of the XPath / XSLT / XQuery type system. But one drawback is that the type declarations can be very cumbersome, and very uninformative.</div><div><br /></div><div>Suppose you want to write a library to handle arithmetic on complex numbers. How are you going to represent a complex number? There are several possibilities: as a sequence of two doubles (<b>xs:double*</b>); as an array of two doubles (<b>array(xs:double)</b>), or as a map, for example <b>map{"r": 0.0e0, "i": 0.0e0}</b> (which has type <b>map(xs:string, xs:double)</b>).</div><div><br /></div><div>Note that whichever of these choices you make, (a) your choice is exposed to the user of your library by the way you declare the type in your function signatures, (b) the type allows many values that aren't legitimate representations of complex numbers, and (c) there's nothing in the type declaration that tells the reader of your code that this has anything to do with complex numbers.</div><div><br /></div><div>I think we can tackle these problems with two fairly simple extensions to the language.</div><div><br /></div><div>First, we can define type aliases. For XSLT, I have implemented an extension that allows you to declare (as a top-level element anywhere in the stylesheet):</div><div><br /></div> <pre><b>&lt;saxon:type-alias name="complex"&nbsp;</b></pre><pre><b> type="map(xs:string, xs:double)"/&gt;</b></pre> <div>and then you can use this type alias (prefixed by a tilde) anywhere an item type is allowed, for example</div><div><br /></div> <pre style="font-size: 13px;"><b>&lt;xsl:variable name="i" as="~complex"&nbsp;</b></pre><pre style="font-size: 13px;"><b> select="cx:complex(0.0, 1.0)"/&gt;</b></pre> <div>Secondly, we can define tuple types. So we can instead define our complex numbers as:</div><div><br /></div> <pre style="font-size: 13px;"><b>&lt;saxon:type-alias name="complex"&nbsp;</b></pre><pre style="font-size: 13px;"><b> type="tuple(r: xs:double, i: xs:double)"/&gt;</b></pre> <div>We're not actually introducing tuples here as a fundamental new type with their own set of functions and operators. Rather, a tuple declaration defines constraints on a map. It lists the keys that must be present in the map, and the type of the value to be associated with each key. The keys here are the strings "r" and "i", and in both cases the value must be an xs:double. The keys are always NCNames, which plays well with the map lookup notation M?K; if $c is a complex number, then the real and imaginary parts can be referenced as $c?r and $c?i respectively.</div><div><br /></div><div>For this kind of data structure, tuple types provide a much more precise constraint over the contents of the map than the current map type does. It also provides much better static type checking: an expression such as $c?i can be statically checked (a) to ensure that "i" is actually a defined field in the tuple declaration, and (b) that the expression is used in a context where an xs:double value is expected.</div><div><br /></div><div>I've been a little wary in the past of putting syntax extensions into Saxon; conformance to standards has always been a primary goal. But the standards process seems to be running out of steam, and I'm beginning to feel that it's time to push a few innovative ideas out in product to keep things moving forward. For those who would prefer to stick entirely to stuff defined by W3C, rest assured that these features will only be available if you explicitly enable extensions.</div> Improving Compile-Time Performance tag:dev.saxonica.com,2016:/blog/mike//3.215 2016-06-22T10:04:28Z 2016-06-22T11:51:14Z For years we've been putting more and more effort into optimizing queries and stylesheets so that they would execute as fast as possible. For many workloads, in particular high throughput server-side transformations, that's a good strategy. But over the last... Michael Kay For years we've been putting more and more effort into optimizing queries and stylesheets so that they would execute as fast as possible. For many workloads, in particular high throughput server-side transformations, that's a good strategy. But over the last year or two we've become aware that for some other workloads, it's the wrong thing to do.<div><br /></div><div>For example, if you're running a DocBook or DITA transformation from the command line, and the source document is only a couple of KB in size, then the time taken to compile the stylesheet greatly exceeds the actual transformation time. It might take 5 seconds to compile the stylesheet, and 50 milliseconds to execute it. (Both DocBook and DITA stylesheets are vast.) For many users, that's not an untypical scenario.</div><div><br /></div><div>If we look at the XMark benchmarks, specifically a query such as Q9, which is a fairly complex three-way join, the query executes against a 10Mb source document in just 9ms. But to achieve that, we spend 185ms compiling and optimizing the query. We also spend 380ms parsing the source document. So in an ad-hoc processing workflow, where you're compiling the query, loading a source document, and then running a query, the actual query execution cost is about 2% of the total. But it's that 2% that we've been measuring, and trying to reduce.</div><div><br /></div><div>We haven't entirely neglected the other parts of the process. For example, one of the most under-used features of the product is document projection, which enables you during parsing, to filter out the parts of the document that the query isn't interested in. For query Q9 that cuts down the size of the source document by 65%, and reduces the execution time of the query to below 8ms. Unfortunately, although the memory saving is very useful, it actually increases the parsing time to 540ms. Some cases are even more dramatic: with Q2, the size of the source document is reduced by 97%; but parsing is still slowed down by the extra work of deciding which parts of the document to retain, and since the query only takes 2ms to execute anyway, there's no benefit other than the memory saving.</div><div><br /></div><div>For the DocBook and DITA scenarios (unlike XMark) it's the stylesheet compilation time that hurts, rather than the source document parsing time. For a typical DocBook transformation of a small document, I'm seeing a stylesheet compile time of around 3 seconds, source document parsing time of around 0.9ms, and transformation time also around 0.9ms. Clearly, compile time here is far more important than anything else.</div><div><br /></div><div>The traditional answer to this has always been to compile the stylesheet once and then use it repeatedly. That works if you're running hundreds of transformations using the same stylesheet, but there are many workflows where this is impractical.</div><div><br /></div><div>Saxon 9.7 makes a big step forward by allowing the compiled form of a stylesheet to be saved to disk. This work was done as part of the implementation of XSLT 3.0 packages, but it doesn't depend on packages in any way and works just as well with 1.0 and 2.0 stylesheets. If we export the docbook stylesheets as a compiled package, and then run from this version rather than from source, the time taken for loading the compiled stylesheet is around 550ms rather than the original 3 seconds. That's a very useful saving especially if you're processing lots of source documents using a pipeline written say using a shell script or Ant build where the tools constrain you to run one transformation at a time. (To ensure that exported stylesheet packages work with tools such as Ant, we've implemented it so that in any API where a source XSLT stylesheet is accepted, we also accept an exported stylesheet package).</div><div><br /></div><div>But the best performance improvements are those where you don't have to do anything different to get the benefits (cynically, only about 2% of users will ever read the release notes.) So we've got a couple of further projects in the pipeline.</div><div><br /></div><div>The first is simply raw performance tuning of the optimizer. There's vast potential for this once we turn our minds to it. What we have today has grown organically, and the focus has always been on getting the last ounce of run-time performance regardless how long it takes to achieve it. One approach is to optimize a bit less thoroughly: we've done a bit of that recently in response to a user bug report showing pathological compilation times on an extremely large (20Mb) automatically generated stylesheet. But a better approach is to think harder about the data structures and algorithms we are using.</div><div><br /></div><div>Over the last few days I've been looking at how we do loop-lifting: that is, identifying subexpressions that can be moved out of a loop because each evaluation will deliver the same result. The current approach is that the optimizer does a recursive walk of the expression tree, and at each node in the tree, the implementation of that particular kind of expression looks around to see what opportunities there are for local optimization. Many of the looping constructs (xsl:for-each, xsl:iterate, for expressions, filter expressions, path expressions) at this point initiate a search of the subtree for expressions that can be lifted out of the loop. This means that with nested loops (a) we're examining the same subtrees once for each level of loop nesting, and (b) we're hoisting the relevant expressions up the tree one loop at a time, rather than moving them straight to where they belong. This is not only a performance problem; the code is incredibly complex, it's hard to debug, and it's hard to be sure that it's doing as effective a job as it should (for example, I only found during this exercise that we aren't loop-lifting subexpressions out of xsl:for-each-group.)</div><div><br /></div><div>In 9.7, as reported in previous blog posts, we made some improvements to the data structures used for the expression tree, but so far we've been making rather little use of this. One improvement was to add parent pointers, which enables optimizations to work bottom-up rather than top-down. Another improvement was a generic structure for holding the links from a parent node to its children, using an Operand object that (a) holds properties of the relationship (e.g. it tells you when the child expression is evaluated with a different focus from the parent), and (b) is updatable, so a child expression can replace itself by some different expression without needing the parent expression to get involved. These two improvements have enabled a complete overhaul of the way we do loop-lifting. Without knowing anything about the semantics of different kinds of expressions, we can now do a two-phase process: first we do a scan over the expression tree for a function or template to identify, for each node in the tree, what its "innermost scoping node" is: for example an expression such as "$i + @x" is scoped both by the declaration of $i and by the instruction (e.g. xsl:for-each) that sets the focus, and the innermost scoping expression is the inner one of these two. Then, in a second pass, we hoist every expression that's not at the same looping level as its innermost scoping expression to be evaluated (lazily) outside that loop. The whole process is dramatically simpler and faster than what we were doing before, and at least as effective - possibly in some cases more so.</div><div><br /></div><div>The other project we're just starting on is to look at just-in-time compilation. The thing about stylesheets like DocBook is that they contain zillions of template rules for processing elements which typically don't appear in your average source document. So why waste time compiling template rules that are never used? All we really need to do is make a note of the match patterns, build the data structures we use to identify which rule is the best match for a node, and then do the work of compiling that rule the first time it is used. Indeed, the optimization and byte-code generation work can be deferred until we know that the rule is going to be used often enough to make it worthwhile. We're starting this project (as one should start all performance projects) by collecting instrumentation, so we can work out exactly how much time we are spending in each phase of compilation; that will tell us how much we should be doing eagerly and how much we should defer. There's a trade-off with usability here: do users want to be told about errors found while type-checking parts of the stylesheet that aren't actually exercised by a particular run?</div><div><br /></div><div>Plenty of ideas to keep us busy for a while to come.</div> Introducing Saxon-JS tag:dev.saxonica.com,2016:/blog/mike//3.214 2016-02-13T14:15:04Z 2016-02-13T14:17:07Z At XML Prague yesterday we got a spontaneous round of applause when we showed the animated Knight's tour application, reimplemented to use XSLT 3.0 maps and arrays, running in the browser using a new product called Saxon-JS. So, people... Michael Kay <p style="margin-bottom: 0cm"><span style="font-size: 1em;">At XML Prague yesterday we got a spontaneous round of applause when we showed the animated Knight's tour application, reimplemented to use XSLT 3.0 maps and arrays, running in the browser using a new product called Saxon-JS.</span></p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">So, people will be asking, what exactly is Saxon-JS?</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Saxon-EE 9.7 introduces a new option -export which allows you to export a compiled stylesheet, in XML format, to a file: rather like producing a .so file from a C compiler, or a JAR file from a Java compiler. The compiled stylesheet isn't executable code, it's a decorated abstract syntax tree containing, in effect, the optimized stylesheet execution plan. There are two immediate benefits: loading a compiled stylesheet is much faster than loading the original source code, so if you are executing the same stylesheet repeatedly the cost of compilation is amortized; and in addition, it enables you to distribute XSLT code to your users with a degree of intellectual property protection analogous to that obtained from compiled code in other languages. (As with Java, it's not strong encryption - it wouldn't be too hard to write a fairly decent decompiler - but it's strong enough that most people won't attempt it.)</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Saxon-JS is an interpreted, written in pure Javascript, that takes these compiled stylesheet files and executes them in a Javascript environment - typically in the browser, or on Node.js. Most of our development and testing is actually being done using Nashorn, a Javascript engine bundled with Java 8, but that's not a serious target environment for Saxon-JS because if you've got Nashorn then you've got Java, and if you've got Java then you don't need Saxon-JS.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Saxon-JS can also be seen as a rewrite of Saxon-CE. Saxon-CE was our first attempt at doing XSLT 2.0 in the browser. It was developed by producing a cut-down version of the Java product, and then cross-compiling this to Javascript using Google's GWT cross-compiler. The main drawbacks of Saxon-CE, at a technical level, were the size of the download (800Kb or so), and the dependency on GWT which made testing and debugging extremely difficult - for example, there was no way of testing our code outside a browser environment, which made running of automated test scripts very time-consuming and labour-intensive. There were also commercial factors: Saxon-CE was based on a fork of the Saxon 9.3 Java code base and re-basing to a later Saxon version would have involved a great deal of work; and there was no revenue stream to fund this work, since we found a strong expectation in the market that this kind of product should be free. As a result we effectively allowed the product to become dormant.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">We'll have to see whether Saxon-JS can overcome these difficulties, but we think it has a better chance. Because it depends on Saxon-EE for the front-end (that is, there's a cost to developers but the run-time will be free) we're hoping that there'll be a reveue stream to finance support and ongoing development; and although the JS code is not just a fork but a complete rewrite of the run-time code the fact that it shares the same compiler front end means that it should be easier to keep in sync.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Development has been incredibly rapid - we only started coding at the beginning of January, and we already have about 80% of the XSLT 2.0 tests running - partly because Javascript is a powerful language, but mainly because there's little new design involved. We know how an XSLT engine works, we only have to decide which refinements to leave out. We've also done client-side XSLT before so we can take the language extensions of Saxon-CE (how to invoke templates in response to mouse events, for example) the design of its Javascript APIs, and also some of its internal design (like the way event bubbling works) and reimplement these for Saxon-JS.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">One of the areas where we have to make design trade-offs is deciding how much standards conformance, performance, and error diagnostics to sacrifice in the interests of keeping the code small. There are some areas where achieving 100% conformance with the W3C specs will be extremely difficult, at least until JS6 is available everywhere: an example is support for Unicode in regular expressions. For performance, memory usage (and therefore expression pipelining) is important, but getting the last ounce of processor efficiency less so. An important factor (which we never got quite right for Saxon-CE) is asynchronous access to the server for the doc() and document() functions - I have ideas on how to do this, but it ain't easy.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">It will be a few weeks before the code is robust enough for an alpha release, but we hope to get this out as soon as possible. There will probably then be a fairly extended period of testing and polishing - experience suggests that when the code is 90% working, you're less than half way there.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">I haven't yet decided on the licensing model. Javascript by its nature has no technical protection, but that doesn't mean we have to give it an open source license (which would allow anyone to make changes, or to take parts of the code for reuse in other projects).</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">All feedback is welcome: especially on opportunities for exploiting the technology in ways that we might not have thought of.</p> Parent pointers in the Saxon expression tree tag:dev.saxonica.com,2015:/blog/mike//3.213 2015-09-11T19:38:54Z 2015-09-11T20:21:38Z A while ago (http://dev.saxonica.com/blog/mike/2014/11/redesigning-the-saxon-expression-tree.html) I wrote about my plans for the Saxon expression tree. This note is an update.We've made a number of changes to the expression tree for 9.7.Every node in the tree (every expression) now references a Location... Michael Kay A while ago (<a href="http://dev.saxonica.com/blog/mike/2014/11/redesigning-the-saxon-expression-tree.html">http://dev.saxonica.com/blog/mike/2014/11/redesigning-the-saxon-expression-tree.html</a>) I wrote about my plans for the Saxon expression tree. This note is an update.<div><br /></div><div>We've made a number of changes to the expression tree for 9.7.</div><div><br /></div><div><ul><li><span style="font-size: 1em;">Every node in the tree (every expression) now references a Location object, providing location information for diagnostics (line number, column number, etc). Previously the expression node implemented the SourceLocator interface, which meant it provided this information directly. The benefit is that we can now have different kinds of Location object. In XQuery we will typically hold the line and column and module URI. In XSLT, for a subexpression within an XPath expression, we can now hold both the offset within the XPath expression, and the path to the containing node within the XSLT stylesheet. Hopefully debuggers and editing tools such as oXygen and Stylus Studio will be able to take advantage of the improved location information to lead users straight to the error location in the editor. Where an expression has the same location information as its parent or sibling expressions, the Location object is shared.</span></li></ul></div><div><span style="font-size: 1em;"><br /></span></div><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><div><div>Another reason for changing the way we hold location information is connected with the move to separately-compiled packages in XSLT 3.0. This means that the system we previously used, of globally-unique integer "location identifiers" which are translated into real location information by reference to a central "location provider" service, is no longer viable.&nbsp;</div></div></blockquote><div><br /><ul><li>Every node in the tree now points to a RetainedStaticContext object which holds that part of the static context which can vary from one expression to another, and which can be required at run-time. Previously we only attempted to retain the parts of the static context that each kind of expression actually used. The parts of the static context that this covers include the static base URI, in-scope namespaces, the default collation, and the XPath 1.0 compatibility flag. Retaining the whole static context might seem extravagent. But in fact, it very rarely changes, so a child expression will nearly always point to the same RetainedStaticContext object as its parent and sibling expressions.</li></ul><div><br /></div></div><div><ul><li>Every node in the tree now points to its parent node. This choice has proved tricky. It gives many advantages: it means that the code for every expression can easily find details of the containing package, the configuration options, and a host of details about the query or stylesheet as a whole. The fact that we have a parent node eliminates the need for the "container" object (typically the containing function or template) which we held in previous releases. It also reduces the need to pass additional information to methods on the Expression class, for example methods to determine the item type and cardinality of the expression. There is a significant downside to holding this information, which is the need to keep it consistent. Some of the tree rewrite operations performed by the optimizer are complex enough without having to worry about keeping all the parent pointers correct. And it turns out to be quite difficult to enforce consistency through the normal "private data, public methods" encapsulation techniques: those work when you have to keep the data in a single object consistent, but they aren't much use for maintaining mutual consistency between two different objects. In any case it seems to be unavoidable that to achieve the kind of tree rewrites we want to perform, the tree has to be temporarily inconsistent at various stages.</li></ul><span style="font-size: 1em;">&nbsp;</span></div><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><div><span style="font-size: 1em;">Using parent pointers means that you can't share subtrees. It means that when you perform operations like inlining a function, you can't just reference the subtree that formed the body of the function, you have to copy it. This might seem a great nuisance. But actually, this is not a new constraint. It never was safe to share subtrees, because the&nbsp;</span>optimiser would happily make changes to a subtree without knowing that there were other interested parties. The bugs this caused have been an irritation for years. The introduction of parent pointers makes the constraint more explicit, and makes it possible to perform integrity checking on the tree to discover when we have inadvertently violated the constraints.</div><div><br /></div><div>During development we've had diagnostic code switched on that checks the integrity of the tree and outputs warnings if problems are found. We've gradually been examining these and eliminating them. The problems can be very hard to diagnose, because the detection of a problem in the data may indicate an error that occurred in a much earlier phase of processing. We've developed some diagnostic tools for tracing the changes made to a particular part of the tree and correlating these with the problems detected later. Most of the problems, as one might expect, are connected with optimization rewrites. A particular class of problem occurs with rewrites that are started but then not completed, (because problems are found) or with "temporary" rewrites that are designed to create an equivalent expression suitable for analysis (say for streamability analysis or for schema-aware static type-checking) but which are not actually intended to affect the run-time interpreted tree. The discipline in all such cases is to copy the part of the tree you want to work on, rather than making changes in-situ.</div><div><br /></div><div>For some non-local rewrites, such as loop-lifting optimizations, the best strategy seems to be to ignore the parent pointers until the rewrite is finished, and then restore them during a top-down tree-walk.</div><div><br /></div><div>The fact that we now have parent pointers makes context-dependent optimizations much easier. Checking, for example, whether &nbsp;a variable reference occurs within a loop (a "higher-order expression" as the XSLT 3.0 spec calls it) is now much easier: it can be done by searching upwards from the variable reference rather than retaining context information in an expression visitor as you walk downwards. Similarly, if there is a need to replace one expression by another (a variable reference by a literal constant, say), the fact that the variable reference knows its own parent makes the substitution much easier.</div><div><br /></div><div>So although the journey has had a few bumps, I'm reasonably confident that we will see long-term benefits.</div></blockquote><div><div><br /></div></div><div><br /></div><div><br /></div><div><br /></div> Lazy Evaluation tag:dev.saxonica.com,2015:/blog/mike//3.212 2015-06-28T23:15:13Z 2015-06-28T23:27:43Z We've seen some VM dumps recently that showed evidence of contention problems when multiple threads (created, for example, using &lt;xsl:for-each&gt; with the saxon:threads attribute) were attempting lazy evaluation of the same local variable. So I've been looking at the... Michael Kay <p style="margin-bottom: 0cm"><span style="font-size: 1em;">We've seen some VM dumps recently that showed evidence of contention problems when multiple threads (created, for example, using </span><font face="American Typewriter, monospace" style="font-size: 1em;">&lt;xsl:for-each&gt;</font><span style="font-size: 1em;"> with the </span><font face="American Typewriter, monospace" style="font-size: 1em;">saxon:threads</font><span style="font-size: 1em;"> attribute) were attempting lazy evaluation of the same local variable. So I've been looking at the lazy evaluation code in Saxon to try and understand all the permutations of how it works. A blog posting is a good way to try and capture that understanding before I forget it all again. But I won't go into the extra complexities of parallel execution just yet: I'll come back to that at the end.</span></p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Lazy evaluation applies when a variable binding, for example "<font face="American Typewriter, monospace">let $v := //x[@y=3]</font>" isn't evaluated immediately when the variable declaration is encountered, but only when the variable is actually referenced. This is possible in functional languages because evaluating an expression has no side-effects, so it doesn't matter when (or how often) it is done. In some functional languages such as Scheme, lazy evaluation happens only if you explicitly request it. In others, such as Haskell, lazy evaluation is mandated by the language specification (which means that a variable can hold an infinite sequence, so long as you don't try to process its entire value). In XSLT and XQuery, lazy evaluation is entirely at the discretion of the compiler, and in this post I shall try to summarize how Saxon makes use of this freedom.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Internally, when a local variable is evaluated lazily, Saxon instead of putting the variable's value in the relevant slot on the stack, will instead put a data structure that contains all the information needed to evaluate the variable: that is, the expression itself, and any part of the evaluation context on which it depends. In Saxon this data structure is called a Closure. The terminology isn't quite right, because it's not quite the same thing as the closure of an inline function, but the concepts are closely related: in some languages, lazy evaluation is implemented by storing, as the value of the variable, not the variable's actual value, but a function which delivers that value when invoked, and the data needed by this function to achieve that task is correctly called a closure. (If higher-order functions had been available in Saxon a few years earlier, we might well have implemented lazy evaluation this way.) </p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">We can distinguish two levels of lazy evaluation. We might use the term "deferred evaluation" to indicate that a variable is not evaluated until it is first referenced, and "incremental evaluation" to indicate that when it is referenced, it is only evaluated to the extent necessary. For example, if the first reference is the function call <font face="American Typewriter, monospace">head($v)</font>, only the first item in the sequence $v will be evaluated; remaining items will only be evaluated if a subsequent reference to the variable requires them.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Lazy evaluation can apply to global variables, local variables, parameters of templates and functions, and return values from templates and functions. Saxon handles each case slightly differently.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">We should mention some static optimizations which are not directly related to lazy evaluation, but are often confused with it. First, a variable that is never referenced is eliminated at compile-time, so its initializing expression is never evaluated at all. Secondly, a variable that is only referenced once, and where the reference is not in any kind of loop, is inlined: that is, the variable reference is replaced by the expression used to initialize the variable, and the variable itself is then eliminated. So when someone writes "<font face="American Typewriter, monospace">let $x := /a/b/c return $x[d=3]</font>", Saxon turns this into the expression "<font face="American Typewriter, monospace">(/a/b/c)[d=3]</font>". (Achieving this of course requires careful attention to the static and dynamic context, but we won't go into the details here.)</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Another static optimization that interacts with variable evaluation is loop-lifting. If an expression within a looping construct (for example the content of xsl:for-each, or of a predicate, or the right-hand-side of the "/" operator) will have the same value for every iteration of the loop, then a new local variable bound to this expression is created outside the loop, and the original expression is replaced by a reference to the variable. In this situation we need to take care that the expression is not evaluated unless the loop is executed at least once (both to avoid wasted evaluation cost, and to give the right behaviour in the event that evaluating the expression fails with a dynamic error.) So lazy evaluation of such a variable becomes mandatory.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">The combined effect of these static optimizations, together with lazy evaluation, is that the order of evaluation of expressions can be quite unintuitive. To enable users to understand what is going on when debugging, it is therefore normal for some of these rewrites to be suppressed if debugging or tracing are enabled.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">For global variables, Saxon uses deferred evaluation but not incremental evaluation. A global variable is not evaluated until it is first referenced, but at that point it is completely evaluated, and the sequence representing its value is held in memory in its entirety.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">For local variables, evaluation is generally both deferred and incremental. However, the rules are quite complex.</p> <p style="margin-bottom: 0cm"><br /> </p> <ul> <li><p style="margin-bottom: 0cm">If the static type shows that the value will be a singleton, then it will be evaluated eagerly. [It's not at all clear that this rule makes sense. Certainly, incremental evaluation makes no sense for singletons. But deferred evaluation could still be very useful, for example if the evaluation is expensive and the variable is only referenced within a branch of a conditional, so the value is not always needed.]</p> </li><li><p style="margin-bottom: 0cm">Eager evaluation is used when the binding expression is very simple: in particular when it is a literal or a reference to another variable.</p> </li><li><p style="margin-bottom: 0cm">Eager evaluation is used for binding expressions that depend on <font face="American Typewriter, monospace">position()</font> or <font face="American Typewriter, monospace">last()</font>, to avoid the complexities of saving these values in the Closure.</p> </li><li><p style="margin-bottom: 0cm">There are some optimizations which take precedence over lazy evaluation. For example if there are variable references using predicates, such as <font face="American Typewriter, monospace">$v[@x=3]</font>, then the variable will not only be evaluated eagerly, but will also be indexed on the value of the attribute <font face="American Typewriter, monospace">@x</font>. Another example: if a variable is initialized to an expression such as <font face="American Typewriter, monospace">($v, x)</font> - that is, a sequence that appends an item to another variable - then we use a "shared append expression" which is a data structure that allows a sequence to be constructed by appending to an existing sequence without copying the entire sequence, which is a common pattern in algorithms using head-tail recursion.</p> </li><li><p style="margin-bottom: 0cm">Lazy evaluation (and inlining) need special care if the variable is declared outside a try/catch block, but is referenced within it. In such a case a dynamic error that occurs while evaluating the initialization expression must not be caught by the try/catch; it is logically outside its scope. (Writing this has made me realise that this is not yet implemented in Saxon; I have written a test case and it currently fails.)</p> </li></ul> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">If none of these special circumstances apply, lazy evaluation is chosen. There is one more choice to be made: between a <font face="American Typewriter, monospace">Closure</font> and a <font face="American Typewriter, monospace">MemoClosure</font>. The common case is a <font face="American Typewriter, monospace">MemoClosure</font>, and in this case, as the variable is incrementally evaluated, the value is saved for use when evaluating subsequent variable references. A (non-memo) closure is used when it is known that the value will only be needed once. Because most such cases have been handled by variable inlining, the main case where a non-memo closure is used is for the return value of a function. Functions, like variables, are lazily evaluated, so that the value returned to the caller is not actually a sequence in memory, but a closure containing all the information needed to materialize the sequence. (Like most rules in this story, there is an important exception: tail-call optimization, where the last thing a function does is to call itself, takes precedence over lazy evaluation).</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">So let's look more closely at the <font face="American Typewriter, monospace">MemoClosure</font>. A <font face="American Typewriter, monospace">MemoClosure</font> is a data structure that holds the following information:</p> <p style="margin-bottom: 0cm"><br /> </p> <ul> <li><p style="margin-bottom: 0cm">The Expression itself (a pointer to a node in the expression tree). The Expression object also holds any information from the static context that is needed during evaluation, for example namespace bindings.</p> </li><li><p style="margin-bottom: 0cm">A copy of the dynamic context at the point where the variable is bound. This includes the context item, and values of any local variables referenced by the expression.</p> </li><li><p style="margin-bottom: 0cm">The current evaluation state: one of UNREAD (no access to the variable has yet been made), MAYBE_MORE (some items in the value of the variable are available, but there may be more to come), ALL_READ (the value of the variable is fully available), BUSY (the variable is being evaluated), or EMPTY (special case of ALL_READ in which the value is known to be an empty sequence).</p> </li><li><p style="margin-bottom: 0cm">An <font face="American Typewriter, monospace">InputIterator</font>: an iterator over the results of the expression, relevant when evaluation has started but has not finished</p> </li><li><p style="margin-bottom: 0cm">A reservoir: a list containing the items delivered by the InputIterator so far.</p> </li></ul> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm"> Many variable references, for example <font face="American Typewriter, monospace">count($v)</font>, or <font face="American Typewriter, monospace">index-of($v, 'z')</font> result in the variable being evaluated in full. If this is the first reference to the variable, that is if the state is UNREAD, the logic is essentially</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm"><font face="American Typewriter, monospace">inputIterator = expression.iterate(savedContext);</font></p> <p style="margin-bottom: 0cm"><font face="American Typewriter, monospace">for item in inputIterator {</font></p> <p style="margin-bottom: 0cm"> <font face="American Typewriter, monospace">&nbsp; &nbsp;reservoir.add(item);</font></p> <p style="margin-bottom: 0cm"><font face="American Typewriter, monospace">}</font></p> <p style="margin-bottom: 0cm"><font face="American Typewriter, monospace">state = ALL_READ;</font></p> <p style="margin-bottom: 0cm"><font face="American Typewriter, monospace">return new SequenceExtent(reservoir);</font></p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">(However, Saxon doesn't optimize this case, and it occurs to me on writing this that it could.)</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Other variable references, such as <font face="American Typewriter, monospace">head($v)</font>, or <font face="American Typewriter, monospace">$v[1]</font>, or <font face="American Typewriter, monospace">subsequence($v, 1, 5)</font>, require only partial evaluation of the expression. In such cases Saxon creates and returns a <font face="American Typewriter, monospace">ProgressiveIterator</font>, and the requesting expression reads as many items from the <font face="American Typewriter, monospace">ProgressiveIterator</font> as it needs. Requests to get items from the <font face="American Typewriter, monospace">ProgressiveIterator</font> fetch items from the reservoir to the extent they are available; on exhaustion of the reservoir, they then attempt to fetch items from the InputIterator until either enough items are available, or the InputIterator is exhausted. Items delivered from the <font face="American Typewriter, monospace">InputIterator</font> are copied to the reservoir as they are found.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">So far so good. This has all been in place for years, and works well. We have no evidence that it is in any way optimal, but it has been carefully tweaked over the years to deal with particular cases where it was performing badly. What has changed recently is that local variables can be referenced from multiple threads. There are two particular cases where this happens today: when <font face="American Typewriter, monospace">xsl:result-document</font> is used in Saxon-EE, it executes by default asynchronously in a new thread; and when the extension attribute <font face="American Typewriter, monospace">saxon:threads</font> is used on <font face="American Typewriter, monospace">xsl:for-each</font>, the items selected by the <font face="American Typewriter, monospace">xsl:for-each</font> are processed in parallel rather than sequentially.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">The effect of this is that the MemoClosure object needs to be thread-safe: multiple requests to access the variable can come simultaneously from different threads. To achieve this a number of methods are synchronized. One of these is the <font face="American Typewriter, monospace">next()</font> method of the <font face="American Typewriter, monospace">ProgressiveIterator</font>: if two threads reference the variable at the same time, each gets its own <font face="American Typewriter, monospace">ProgressiveIterator</font>, and the <font face="American Typewriter, monospace">next()</font> method on one of these iterators is forced to wait until the other has finished.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">This works, but it is risky. Brian Goetz in his excellent book <i>Java Concurrency in Practice</i> recommends that a method should not be synchronized unless (a) its execution time is short, and (b) as the author of the method, you know exactly what code will execute while it is active. In this case neither condition is satisfied. The <font face="American Typewriter, monospace">next()</font> method of <font face="American Typewriter, monospace">ProgressiveIterator</font> calls the <font face="American Typewriter, monospace">next()</font> method of the <font face="American Typewriter, monospace">InputIterator</font>, and this may perform expensive computation, for example retrieving and parsing a document using the <font face="American Typewriter, monospace">doc()</font> function. Further, we have no way of analyzing exactly what code is executed: in the worst case, it may include user-written code (for example, an extension function or a <font face="American Typewriter, monospace">URIResolver</font>). The mechanism can't deadlock with itself (because there cannot be a cycle of variable references) but it is practically impossible to prove that it can't deadlock with other subsystems that use synchronization, and in the face of maliciously-written used code, it's probably safe to assume that deadlock <b>can</b> occur. We haven't seen deadlock happen in practice, but it's unsatisfactory that we can't prove its impossibility.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">So what should we do about it?</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">I think the answer is, add yet another exception to the list of cases where lazy evaluation is used: specifically, don't use it for a variable that can be referenced from a different thread. I'm pretty sure it's possible to detect such cases statically, and they won't be very common. In such cases, use eager evaluation instead.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">We must be careful not to do this in the case of a loop-lifted variable, where the correct error semantics depend on lazy evaluation. So another tweak to the rules is, don't loop-lift code out of a multithreaded execution block.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">This investigation also suggests a few other refinements we might make.</p> <p style="margin-bottom: 0cm"><br /> </p> <ul> <li><p style="margin-bottom: 0cm">It seems worth optimizing for the case where the entire value of a variable is needed, since this case is so common. The problem is, it's not easy to detect this case: a calling expression such as <font face="American Typewriter, monospace">count($v)</font> will ask for an iterator over the variable value, without giving any indication that it intends to read the iterator to completion.</p> </li><li><p style="margin-bottom: 0cm">We need to reassess the rule that singleton local variables are evaluated eagerly.</p> </li><li><p style="margin-bottom: 0cm">We currently avoid using lazy evaluation for expressions with certain dependencies on the dynamic context (for example, <font face="American Typewriter, monospace">position()</font> and <font face="American Typewriter, monospace">last()</font>). But in the course of implementing higher-order functions, we have acquired the capability to hold such values in a saved copy of the dynamic context.</p> </li><li><p style="margin-bottom: 0cm">We could look at a complete redesign that takes advantage of higher-order functions and their closures. This might be much simpler than the current design; but it would discard the benefits of years of fine-tuning of the current design.</p> </li><li><p style="margin-bottom: 0cm">I'm not convinced that it makes sense for a <font face="American Typewriter, monospace">MemoClosure</font> to defer creation of the <font face="American Typewriter, monospace">InputIterator</font> until the first request for the variable value. It would be a lot simpler to call <font face="American Typewriter, monospace">inputIterator = Expression.iterate(context)</font> at the point of variable declaration; in most cases the implementation will defer evaluation to the extent that this makes sense, and this approach saves the cost of the elaborate code to save the necessary parts of the dynamic context. It's worth trying the other approach and making some performance measurements. </p> </li></ul> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm"><br /> </p> A redesign of the NamePool tag:dev.saxonica.com,2015:/blog/mike//3.211 2015-06-24T14:39:33Z 2015-06-24T15:19:25Z As explained in my previous post, the NamePool in Saxon is a potential problem for scaleability, both because access can cause contention, and also because it has serious limits on the number of names it can hold: there's a maximum... Michael Kay As explained in my previous post, the NamePool in Saxon is a potential problem for scaleability, both because access can cause contention, and also because it has serious limits on the number of names it can hold: there's a maximum of one million QNames, and performance starts getting seriously bad long before this limit is reached.<div><br /></div><div>Essentially, the old NamePool is a home-grown hash table. It uses a fixed number of buckets (1024), and when hash collisions occur, the chains of hash duplicates are searched serially. The fact that the number of buckets is fixed, and entries are only added to the end of a chain, is what makes it (reasonably) safe for read access to the pool to occur without locking.</div><div><br /></div><div>One thing I have been doing over a period of time is to reduce the amount of unnecessary use of the NamePool. Most recently I've changed the implementation of the schema component model so that references from one schema component to another are no longer implemented using NamePool fingerprints. But this is peripheral: the core usage of the NamePool for comparing names in a query against names in a source document will always remain the dominant usage, and we need to make this scaleable as parallelism increases.</div><div><br /></div><div>Today I've been exploring an alternative design for the NamePool (and some variations on the implementation of the design). The new design has at its core two Java ConcurrentHashMaps, one from QNames to fingerprints, and one from fingerprints to QNames. The ConcurrentHashMap, which was introduced in Java 5, doesn't just offer safe multi-threaded access, it also offers very low contention: it uses fine-grained locking to ensure that multiple writers, and any number of readers, can access the data structure simulaneously.</div><div><br /></div><div>Using two maps, one of which is the inverse of the other, at first seemed a problem. How can we ensure that the two maps are consistent with each other, without updating both under an exclusive lock, which would negate all the benefits? The answer is that we can't completely, but we can get close enough.</div><div><br /></div><div>The logic is like this:</div><div><br /></div><div><pre><pre style="color: rgb(0, 0, 0); font-size: 12pt;"><font style="font-size: 0.8em;"><span style="color: rgb(0, 0, 128); font-weight: bold;">private final </span>ConcurrentHashMap&lt;StructuredQName, Integer&gt; <span style="color: rgb(102, 14, 122); font-weight: bold;">qNameToInteger </span>= <span style="color: rgb(0, 0, 128); font-weight: bold;">new </span>ConcurrentHashMap&lt;StructuredQName, Integer&gt;(<span style="color: rgb(0, 0, 255);">1000</span>);<br /><span style="color: rgb(0, 0, 128); font-weight: bold;">private final </span>ConcurrentHashMap&lt;Integer, StructuredQName&gt; <span style="color: rgb(102, 14, 122); font-weight: bold;">integerToQName </span>= <span style="color: rgb(0, 0, 128); font-weight: bold;">new </span>ConcurrentHashMap&lt;Integer, StructuredQName&gt;(<span style="color: rgb(0, 0, 255);">1000</span>);<br /><span style="color: rgb(0, 0, 128); font-weight: bold;">private </span>AtomicInteger <span style="color: rgb(102, 14, 122); font-weight: bold;">unique </span>= <span style="color: rgb(0, 0, 128); font-weight: bold;">new </span>AtomicInteger();</font></pre><pre style="color: rgb(0, 0, 0);">// Allocate fingerprint to QName</pre><pre style="color: rgb(0, 0, 0); font-size: 12pt;"><font style="font-size: 0.8em;">Integer existing = <span style="color: rgb(102, 14, 122); font-weight: bold;">qNameToInteger</span>.get(qName);<br /><span style="color: rgb(0, 0, 128); font-weight: bold;">if </span>(existing != <span style="color: rgb(0, 0, 128); font-weight: bold;">null</span>) {<br /> <span style="color: rgb(0, 0, 128); font-weight: bold;">return </span>existing;<br />}<br />Integer next = <span style="color: rgb(102, 14, 122); font-weight: bold;">unique</span>.getAndIncrement();<br />existing = <span style="color: rgb(102, 14, 122); font-weight: bold;">qNameToInteger</span>.putIfAbsent(qName, next);<br /><span style="color: rgb(0, 0, 128); font-weight: bold;">if </span>(existing == <span style="color: rgb(0, 0, 128); font-weight: bold;">null</span>) {<br /> <span style="color: rgb(102, 14, 122); font-weight: bold;">integerToQName</span>.put(next, qName);<br /> <span style="color: rgb(0, 0, 128); font-weight: bold;">return </span>next;<br />} <span style="color: rgb(0, 0, 128); font-weight: bold;">else </span>{<br /> <span style="color: rgb(0, 0, 128); font-weight: bold;">return </span>existing;<br />}</font></pre><pre style="color: rgb(0, 0, 0); font-size: 12pt;"><span style="color: rgb(51, 51, 51); font-family: arial, helvetica, hirakakupro-w3, osaka, 'ms pgothic', sans-serif; font-size: 13px; white-space: normal;">Now, there are several things slightly unsafe about this. We might find that the QName doesn't exist in the map on our first look, but by the time we get to the "putIfAbsent" call, someone else has added it. The worst that happens here is that we've used up an integer from the "unique" sequence unnecessarily. Also, someone else doing concurrent read access might see the NamePool in a state where one map has been updated and the other hasn't. But I believe this doesn't matter: clients aren't going to look for a fingerprint in the map unless they have good reason to believe that fingerprint exists, and it's highly implausible that this knowledge comes from a different thread that has only just added the fingerprint to the map.</span></pre><pre><font face="arial, helvetica, hirakakupro-w3, osaka, ms pgothic, sans-serif"><span style="white-space: normal;">There's another ConcurrentHashMap involved as well, which is a map from URIs to lists of prefixes used in conjunction with that URI. I won't go into that detail.</span></font></pre><pre><font face="arial, helvetica, hirakakupro-w3, osaka, ms pgothic, sans-serif"><span style="white-space: normal;">The external interface to the NamePool doesn't change at all by this redesign. We still use 20-bit fingerprints plus 10-bit prefix codes, so we still have the limit of a million distinct names. But performance no longer degrades when we get close to that limit; and the limit is no longer quite so hard-coded.</span></font></pre><pre><font face="arial, helvetica, hirakakupro-w3, osaka, ms pgothic, sans-serif"><span style="white-space: normal;">My first attempt at measuring the performance of this found the expected benefits in scalability as the concurrency increases and as the size of the vocabulary increases, but the performance under more normal conditions was worse than the existing design: execution time of 5s versus 3s for executing 100,000 cycles each of which performed an addition (from a pool of 10,000 distinct names so 90% of the additions were already present) followed by 20 retrievals.</span></font></pre><pre><font face="arial, helvetica, hirakakupro-w3, osaka, ms pgothic, sans-serif"><span style="white-space: normal;">I suspected that the performance degradation was caused by the need to update two maps, whereas the existing design only uses one (it's cleverly done so that the fingerprint generated for a QName is closely related to its hash key, which enables us to use the fingerprint to navigate back into the hash table to reconstruct the original QName).</span></font></pre><pre><font face="arial, helvetica, hirakakupro-w3, osaka, ms pgothic, sans-serif"><span style="white-space: normal;">But it turned out that the cause was somewhere else. The old NamePool design was hashing QNames by considering only the local part of the name and ignoring the namespace URI, whereas the new design was computing a hash based on both the local name and the URI. Because URIs are often rather long, computing the hash code is expensive, and in this case it adds very little value: it's unusual for the same local name to be associated with more than one URI, and when it happens, the hash table is perfectly able to cope with the collision. By changing the hashing on QName objects to consider only the local name, the costs for the new design came down slightly below the current implementation (about 10% better, not enough to be noticeable).</span></font></pre><pre><font face="arial, helvetica, hirakakupro-w3, osaka, ms pgothic, sans-serif"><span style="white-space: normal;">So I feel comfortable putting this into production. There are a dozen test cases failing (out of 20,000) which I need to sort out first, but it all looks very promising.</span></font></pre></pre><pre style="color: rgb(0, 0, 0); font-family: Menlo; font-size: 12pt;"><br /></pre></div> Another look in the NamePool tag:dev.saxonica.com,2015:/blog/mike//3.210 2015-06-22T08:59:45Z 2015-06-22T09:09:59Z I've been looking again at the implementation of some of the parallel processing features in Saxon (see my XMLPrague 2015 paper) and at how best to make use of the facilities in the Java platform to support them. In... Michael Kay <p style="margin-bottom: 0cm"><span style="font-size: 1em;">I've been looking again at the implementation of some of the parallel processing features in Saxon (see my <a href="http://www.saxonica.com/papers/xmlprague-2015mhk.pdf">XMLPrague 2015 paper</a></span><span style="font-size: 1em;">) and at how best to make use of the facilities in the Java platform to support them. In the course of this I've been studying Brian Goetz's excellent book Java Concurrency in Practice, which although dated, is still probably the best text available on the subject; and the fact that it only takes you up to Java 6 is an advantage in our case, because we still want Saxon to run on Java 6.</span></p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Reading the book has made me think again about the venerable design of Saxon's NamePool, which is the oldest thing in the product where multithreading is relevant. The NamePool is basically a shared data structure holding QNames.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">The design of the NamePool hasn't changed much over the years. On the whole it works well, but there are a number of limitations:</p> <p style="margin-bottom: 0cm"><br /> </p> <ul> <li><p style="margin-bottom: 0cm">Updating the NamePool requires an exclusive lock, and on occasions there has been heavy contention on that lock, which reduces the effectiveness of running more threads.</p> </li><li><p style="margin-bottom: 0cm">We read the NamePool without acquiring a lock. All the textbooks say that's bad practice because of the risk of subtle bugs. We've been using this design for over a dozen years without a single of these subtle bugs coming to the surface, but that doesn't mean they aren't there. It's very hard to prove that the design is thread-safe, and it might only take an unusual JVM, or an unusual hardware architecture, or a massively parallel application, or pathological data (such as a local name that appears in hundreds of namespaces) for a bug to suddenly appear: which would be a serious embarassment.</p> </li><li><p style="margin-bottom: 0cm">The fact that we read the NamePool with no lock means that the data structure itself is very conservative. We use a fixed number of hash buckets (1024), and a chain of names within each bucket. We only ever append to the end of such a chain. If the vocabulary is large, the chains can become long, and searching then takes time proportional to the length of the chains. Any attempt to change the number of buckets on the fly is out of the question so long as we have non-locking readers. So performance degrades with large vocabularies.</p> </li><li><p style="margin-bottom: 0cm">We've got a problem coming up with XSLT 3.0 packages. We want packages to be independently compiled, and we want them to be distributed. That means we can't bind names to fingerprints during package construction, because the package will have to run with different namepools at run-time. We can probably solve this problem by doing the binding of names at package "load" or "link" time; but it's a change of approach and that invites a rethink about how the NamePool works.</p> </li></ul> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Although the NamePool design hasn't changed much over the years, we've changed the way it is used: which essentially means, we use it less. Some while ago we stopped using the NamePool for things such as variable names and function names: it is now used only for element names, attribute names, and type names. Around Saxon 9.4 we changed the Receiver interface (Saxon's ubiquitous interface for passing push-mode events down a pipeline) so that element and attribute names were represented by a NodeName object instead of an integer fingerprint. The NodeName can hold a name either in string form or as an integer code, or both, so this change meant that we didn't have to allocate a NamePool fingerprint just to pass events down this pipeline, which in turn meant that we didn't have to allocate fingerprints to constructed elements and attributes that were going to be immediately serialized. We also stopped using the NamePool to allocate codes to (prefix, uri) pairs representing namespace bindings.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">These changes have been pretty effective and it's a while since we have seen a workload suffer from NamePool contention. However, we want to increase the level of parallelism that Saxon can support, and the NamePool remains a potential pinch point.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">There are a number of things we can do that would make a difference. We could for example use a Java <a href="http://tutorials.jenkov.com/java-util-concurrent/readwritelock.html">ReadWriteLock</a>&nbsp;to allow either a single writer or multiple readers; this would allow us to introduce operations such as reconfiguring the hash table as the size of the vocabulary increases, without increasing contention because of the high frequency of read access.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">But let's first try and remind ourselves what the NamePool is actually for. It is there, first and foremost, to allow fast run-time testing of whether a particular node satisfies a NameTest. Because we use the same NamePool when constructing a source tree and when compiling a stylesheet or query, the node on the tree and the NameTest in the compiled code both know the integer fingerprint of the name, and testing the node against the NameTest therefore reduces to a fast integer comparison. This is undoubtedly far faster than doing a string comparison, especially one involving long URIs.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">If that was the only thing we used the NamePool for, then we would only need a single method, <b>allocateFingerprint(namespace, localName)</b>. What are all the other methods there for?</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Well, firstly, the NamePool knows the mapping of fingerprints to names, so we have read-only get methods to get the fingerprint corresponding to a name, or the name corresponding to a fingerprint. These are a convenience, but it seems that they are not essential. The client code that calls the NamePool to allocate a fingerprint to a name could retain the mapping somewhere else, so that there is no need to go back to a shared NamePool to rediscover what is already known.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">The most obvious case where this happens is with the TinyTree. The TinyTree holds the names of elements and attributes as fingerprints, not as strings, so operations like the XPath <b>local-name()</b> and <b>namespace-uri()</b> functions get the fingerprint from the TinyTree and then call on the NamePool to translate this back to a string. We could avoid this by keeping a map from integers to strings within the TinyTree itself. This could potentially have other benefits: we could make fewer calls on the NamePool to allocate fingerprints during tree construction; and retargeting a TinyTree to work with a different NamePool would be easier.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Secondly, there's a lot of code in the NamePool to manage prefixes. This isn't needed for the core function of matching a node against a NameTest, since that operation ignores namespace prefixes. The detail here is that when we call <b>NamePool.allocate()</b>, we actually supply prefix, uri, and local-name, and we get back a 32-bit <b>nameCode</b> which uniquely represents this triple; the bottom 20 bits uniquely represent the local-name/uri pair, and it is these 20 bits (called the <b>fingerprint</b>) that are used in QName comparisons. The purpose of this exercise has nothing to do with making name comparisons faster; rather it is mainly concerned with saving space in the TinyTree. By packing the prefix information into the same integer as the local-name and URI, we save a few useful bits. But there are other ways of doing this without involving the NamePool; we could use the same few bits to index into a table of prefixes that is local to the TinyTree itself. There are of course a few complications; one of the benefits of the NamePool knowing about prefixes is that it can provide a service of suggesting a prefix to use with a given URI when the system is required to invent one: users like it when the prefix that emerges is one that has previously been associated with that URI by a human being. But there are probably less expensive ways of achieving this.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Let's suppose that we reduced the functionality of the NamePool to a single method, <b>allocate(QName) → int</b>. How would we then implement it to minimize contention? A simple and safe implementation might be</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm"><br /></p><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><p style="margin-bottom: 0cm">HashMap&lt;QName, Integer&gt; map;</p><p style="margin-bottom: 0cm">int next = 0;</p><p style="margin-bottom: 0cm"><br /></p><p style="margin-bottom: 0cm">public synchronized int allocate(QName q) {</p></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><span style="font-size: 1em;">Integer n = map.get(q);</span></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><span style="font-size: 1em;">if (n == null) {</span></blockquote></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><span style="font-size: 1em;">int m = ++next;</span></blockquote></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><span style="font-size: 1em;">map.put(q, m);</span></blockquote></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><span style="font-size: 1em;">return m;</span></blockquote></blockquote></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><span style="font-size: 1em;">} else {</span></blockquote></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><span style="font-size: 1em;">return n;</span></blockquote></blockquote></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><span style="font-size: 1em;">}</span></blockquote></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><p style="margin-bottom: 0cm">}</p></blockquote> <p style="margin-bottom: 0cm"><br /></p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">This still serializes all allocate operations, whether or not a new fingerprint is allocated. We can almost certainly do better by taking advantage of Java's concurrent collection classes, though it's not immediately obvious what the best way of doing it is. But in any case, if we can achieve this then we've reduced the NamePool to something much simpler than it is today, so optimization becomes a lot easier. It's worth noting that the above implementation still gives us the possibility to discover the fingerprint for a known QName, but not to (efficiently) get the QName for a known fingerprint.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">To get here, we need to start doing two things:</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">(a) get prefixes out of the NamePool, and handle them some other way.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">(b) stop using the NamePool to discover the name associated with a known fingerprint.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">After that, redesign becomes relatively straightforward.</p> How long is a (piece of) string? tag:dev.saxonica.com,2015:/blog/mike//3.209 2015-02-09T14:06:09Z 2015-02-10T16:25:57Z As I explained in my previous post, I've been re-examining the way functions work in Saxon. In particular, over the last week or two, I've been changing the way system functions (such as fn:string-length) work. There's a terrific amount... Michael Kay <p><br /> As I explained in my previous post, I've been re-examining the way functions work in Saxon. In particular, over the last week or two, I've been changing the way system functions (such as fn:string-length) work. There's a terrific amount of detail and complexity here, but I thought it might be interesting to take one simple function (fn:string-length) as an example, to see where the complexity comes from and how it can be reduced.</p> <p>At first sight, fn:string-length looks pretty simple. How long is a (piece of) string? Just ask Java to find out: surely it should just map to a simple call on java.lang.String.length(). Well, no actually.</p> <p>If we look to the specification, there are two complications we have to deal with. Firstly we are counting the number of Unicode characters, not (as Java does) the number of 16-bit UTF16 codepoints. In the case of surrogate pairs, one character occupies two codepoints, and that means that a naïve implementation of string-length() takes time proportional to the length of the string.</p> <p>Secondly, there are two forms of the string-length() function. With zero arguments, it's defined to mean string-length(string(.)). That's different from nearly all other functions that have 0-argument and 1-argument forms, where (for example) name() means name(.). Saxon handles functions like name() by converting them statically to name(.), and that conversion doesn't work in this case. To illustrate the difference, consider an attribute code="003", defined in the schema as an xs:integer. The function call string-length(@code) returns 1 (it atomizes the attribute to produce an integer, converts the integer to the string "3", and then returns the length of this string. But @code!string-length() returns 3 - the length of the string value of the attribute node.</p> <p>The other complexity applies specifically to string-length#0 (that is, the zero-argument form). Dynamic calls to context-dependent functions bind the context at the point where the function is created, not where it is called. Consider:</p> <p> &lt;xsl:for-each select="0 to 9"><br /> &lt;xsl:variable name="f" select="string-length#0"/><br /> &lt;xsl:for-each select="21 to 50"><br /> &lt;xsl:value-of select="$f()"/><br /> &lt;/xsl:for-each><br /> &lt;/xsl:for-each></p> <p>This will print the value "1" three hundred times. In each case the context item at the point where $f is bound is a one-digit integer, so $f() returns the length of that integer, which is always one. The context item at the point where $f() is evaluated is irrelevant.</p> <p>Now let's take a look at the Saxon implementation. There's a Java class StringLength which in Saxon 9.6 is about 200 lines of code (including blank lines, comments, etc), and this does most of the work. But not all: in the end all it does is to call StringValue.getStringLength(), which is what really does the work. Atomic values of type xs:string are represented in Saxon by an instance of the class StringValue, which encapsulates a Java CharSequence: often, but not always, a String. The reason for the encapsulating class is to provide type safety on methods like Function.call() which returns a Sequence; StringValue implements AtomicValue which implements Item which implements Sequence, so the XDM data model is faithfully represented in the Java implementation classes.</p> <p>In addition there's a class StringLengthCompiler which generates a bytecode implementation of the string-length function. This is another 60 or so lines.</p> <p>Some functions also have a separate streaming implementation to accept streamed input, and one or two (string-join() and concat(), for example), have an implementation designed to produce streamed output. That's designed to ensure that an instruction like <xsl:value-of select="//emp/name" separator=","/>, which compiles down to a call on string-join() internally, doesn't actually assemble the whole output in memory, but rather writes each part of the result string to the output stream as it becomes available.</p> <p>Since the introduction of dynamic function calls, many system functions have two separate implementations, one for static calls and one for dynamic calls. That's the case for string-length: the evaluateItem() method used for static calls is almost identical to the call() method used for dynamic calls. One reason this happened was because of a fear of performance regression that might occur if the existing code for static calls was generalized, rather than introducing a parallel path.</p> <p>In 9.6, the implementation of dynamic calls to context-dependent functions like string-length#0 is rather fudged. In fact, the expression string-length#0 compiles into a call on function-lookup("fn:string", 0). The implementation of function-lookup() keeps a copy of both the static and dynamic context at the point where it is called, and this is then used when evaluating the resulting function. This is vastly more expensive than it needs to be: for functions like string-length#0 where there are no arguments other than the context, the function can actually be pre-evaluated at the point of creation. In the new 9.7 implementation, the result of the expression string-length#0 is a function implemented by the class ConstantFunction, which encapsulates its result and returns this result when it is called. (It's not quite as simple as this, because the constant function also has to remember its name and arity, just in case the user asks.) </p> <p>The method StringValue.getStringLength() attempts to recognize cases where walking through the codepoints of the string to look for surrogate pairs is not actually necessary. In previous releases there was an extra bit kept in StringValue, set when the string was known to contain no surrogate pairs: so having walked the string once, it would never be done again. In 9.6 this mechanism is replaced with a different approach: Saxon includes several implementations of CharSequence that maintain the value as an array of fixed-size integers (8-bit, 16-bit, or 32-bit, as necessary). If the CharSequence within a StringValue is one of these classes (known collectively as UnicodeString), then the length of the string is the length of the array. And when getStringLength() is called on a string the first time, the string is left in this form, in the hope that future operations on the string will benefit. Of course, this will in some cases be counter-productive (and there's a further refinement in the implementation, which I won't go into, that's designed to overcome this).</p> <p>There are a few other optimizations in the implementation of string-length() that are worth mentioning. Firstly, it's quite common for users to write</p> <p> &lt;xsl:if test="string-length($x) != 0"></p> <p>Here we don't need to count surrogate pairs in the string: the string is zero-length if and only if the underlying CharSequence is zero-length. Saxon therefore does a static rewrite of such an expression to boolean(string($x)). (If $x is statically known to be a string, the call string($x) will then be further rewritten as $x.)</p> <p>If string-length#1 is applied to a value that can be computed statically, then the string-length function is itself computed statically. (This optimization, for odd historical reasons, is often called "constant folding". It's possible only when there are no context dependencies.)</p> <p>During type-checking, the implementation of string-join#0 keeps a note of whether a context item is known to exist. This is used during byte-code generation; if it's known that the context item won't be absent, then there is no need to generate code to check for this error condition. It's through tiny optimizations like this that generated bytecode ends up being faster than interpreted code.</p> <p>In my current exercise refactoring the implementation of system functions such as string-length, I've been looking at how much of logic is duplicated either across the different implementations of a single function (streamed and unstreamed, static and dynamic, bytecode and interpreted) or across the implementations of functions that have a lot in common (such as string(), string-length(), and normalize-space()). I've found that with the exception of the core code in StringValue.getStringLength, and the optimization of string-length()=0, everything else can be vastly reduced. In place of the original StringLength class, there are now two (inner) classes StringLength_0 and StringLength_1 each of which consists of a single one-line method. The code for generating byte-code can also be considerably simplified by achieving more reuse across different functions.</p> <p>The main essence of the reorganization is that the class StringLength (or rather, its two variants) are no longer Expressions, they are now Functions. Previously a call onto string-length($x) compiled to an expression, held as a node on the expression tree. Now it compiles into two object, a StringLength object which is a pure function, and a SystemFunctionCall object which is an expression that calls the function. The SystemFunctionCall object is generic across all functions, while the implementations of SystemFunction contain all the code that is specific to one function. This change was motivated primarily by the need to handle dynamic function calls (and hence first-class function objects) properly, but it has provided a stimulus for a refactoring that achieves much more than this.</p> <p>So, how long is a piece of string? At least we now know how to work it out more efficiently. Sorry this little yarn wasn't shorter.</p>



View XML
View XSL