XML/XSL Portal

Multiple Stylesheet Aware
HideXML TutorialHide
     
  
 
 

The following feeds are currently available for adding to your navigation bar.

Saxon diaries tag:dev.saxonica.com,2012-01-16:/blog/mike//3 2017-11-13T14:03:07Z Michael Kay's blog Movable Type 5.13-en Transforming JSON tag:dev.saxonica.com,2017:/blog/mike//3.220 2017-11-13T13:02:13Z 2017-11-13T14:03:07Z In my conference paper at XML Prague in 2016 I examined a couple of use cases for transforming JSON structures using XSLT 3.0. The overall conclusion was not particularly encouraging: the easiest way to achieve the desired results was to... Michael Kay <p>In my <a href="https://www.saxonica.com/papers/xmlprague-2016mhk.pdf">conference paper at XML Prague</a> in 2016 I examined a couple of use cases for transforming JSON structures using XSLT 3.0. The overall conclusion was not particularly encouraging: the easiest way to achieve the desired results was to convert the JSON to XML, transform the XML, and then convert it back to JSON.</p> <p>Unfortunately this study came too late to get any new features into XSLT 3.0. However, I've been taking another look at the use cases to see whether we could design language extensions to handle them, and this is looking quite encouraging.</p> <h2><strong>Use case 1: bulk update</strong></h2> <p>We start with the JSON document</p> <pre><code>[ { "id": 3, "name": "A blue mouse", "price": 25.50, "dimensions": {"length": 3.1, "width": 1.0, "height": 1.0}, "warehouseLocation": {"latitude": 54.4, "longitude": -32.7 }}, { "id": 2, "name": "An ice sculpture", "price": 12.50, "tags": ["cold", "ice"], "dimensions": {"length": 7.0, "width": 12.0, "height": 9.5 }, "warehouseLocation": {"latitude": -78.75, "longitude": 20.4 } } ] </code></pre> <p>and the requirement: for all products having the tag "ice", increase the price by 10%, leaving all other data unchanged. I've prototyped a new XSLT instruction that allows this to be done as follows:</p> <pre><code>&lt;saxon:deep-update root="json-doc('input.json') select=" ?*[?tags?* = 'ice']" action="map:put(., 'price', ?price * 1.1)"/&gt; </code></pre> <p>How does this work?</p> <p>First the instruction evaluates the <code>root</code> expression, which in this case returns the map/array representation of the input JSON document. With this root item as context item, it then evaluates the <code>select</code> expression to obtain a sequence of contained maps or arrays to be updated: these can appear at any depth under the root item. With each of these selected maps or arrays as the context item, it then evaluates the action expression, and uses the returned value as a replacement for the selected map or array. This update then percolates back up to the root item, and the result of the instruction is a map or array that is the same as the original except for the replacement of the selected items.</p> <p>The magic here is in the way that the update is percolated back up to the root. Because maps and arrays are immutable and have no persistent identity, the only way to do this is to keep track of the maps and arrays selected en-route from the root item to the items selected for modification as we do the downward selection, and then modify these maps and arrays in reverse order on the way back up. Moreover we need to keep track of the cases where multiple updates are made to the same containing map or array. All this magic, however, is largely hidden from the user. The only thing the user needs to be aware of is that the select expression is constrained to use a limited set of constructs when making downward selections.</p> <p>The select expression <code>select="?*[?tags?* = 'ice']"</code> perhaps needs a little bit of explanation. The root of the JSON tree is an array of maps, and the initial <code>?*</code> turns this into a sequence of maps. We then want to filter this sequence of maps to include only those where the value of the "tags" field is an array containing the string "ice" as one of its members. The easiest way to test this predicate is to convert the value from an array of strings to a sequence of strings (so <code>?tags?*</code>) and then use the XPath existential "=" operator to compare with the string "ice".</p> <p>The action expression <code>map:put(., 'price', ?price * 1.1)</code> takes as input the selected map, and replaces it with a map in which the <code>price</code> entry is replaced with a new entry having the key "price" and the associated value computed as the old price multiplied by 1.1.</p> <h2><strong>Use case 2: Hierarchic Inversion</strong></h2> <p>The second use case in the XML Prague 2016 paper was a hierarchic inversion (aka grouping) problem. Specifically: we'll look at a structural transformation changing a JSON structure with information about the students enrolled for each course to its inverse, a structure with information about the courses for which each student is enrolled.</p> <p>Here is the input dataset:</p> <pre><code>[{ "faculty": "humanities", "courses": [ { "course": "English", "students": [ { "first": "Mary", "last": "Smith", "email": "mary_smith@gmail.com"}, { "first": "Ann", "last": "Jones", "email": "ann_jones@gmail.com"} ] }, { "course": "History", "students": [ { "first": "Ann", "last": "Jones", "email": "ann_jones@gmail.com" }, { "first": "John", "last": "Taylor", "email": "john_taylor@gmail.com"} ] } ] }, { "faculty": "science", "courses": [ { "course": "Physics", "students": [ { "first": "Anil", "last": "Singh", "email": "anil_singh@gmail.com"}, { "first": "Amisha", "last": "Patel", "email": "amisha_patel@gmail.com"}] }, { "course": "Chemistry", "students": [ { "first": "John", "last": "Taylor", "email": "john_taylor@gmail.com"}, { "first": "Anil", "last": "Singh", "email": "anil_singh@gmail.com"} ] } ] }] </code></pre> <p>The goal is to produce a list of students, sorted by last name then irst name, each containing a list of courses taken by that student, like this:</p> <pre><code>[ { "email": "anil_singh@gmail.com", "courses": ["Physics", "Chemistry" ]}, { "email": "john_taylor@gmail.com", "courses": ["History", "Chemistry" ]}, .... ] </code></pre> <p>The classic way of handling this is in two phases: first reduce the hierarchic input to a flat sequence in which all the required information is contained at one level, and then apply grouping to this flat sequence.</p> <p>To achieve the flattening we introduce another new XSLT instruction:</p> <pre><code>&lt;saxon:tabulate-maps root="json-doc('input.json')" select="?* ! map:find(., 'students)?*"/&gt; </code></pre> <p>Again the <code>root</code> expression delivers a representation of the JSON document as an array of maps. The <code>select</code> expression first selects these maps ("?*"), then for each one it calls map:find() to get an array of maps each representing a student. The result of the instruction is a sequence of maps corresponding to these student maps in the input, where each output map contains not only the fields present in the input (first, last, email), but also fields inherited from parents and ancestors (faculty, course). For good measure it also contains a field _keys containing an array of keys representing the path from root to leaf, but we don't actually use that in this example.</p> <p>Once we have this flat structure, we can construct a new hierarchy using XSLT grouping:</p> <pre><code>&lt;xsl:for-each-group select="$students" group-by="?email"&gt; &lt;xsl:map&gt; &lt;xsl:map-entry key="'email'" select="?email"/&gt; &lt;xsl:map-entry key="'first'" select="?first"/&gt; &lt;xsl:map-entry key="'last'" select="?last"/&gt; &lt;xsl:map-entry key="'courses'"&gt; &lt;saxon:array&gt; &lt;xsl:for-each select="current-group()"&gt; &lt;saxon:array-member select="?course"/&gt; &lt;/xsl:for-each&gt; &lt;/saxon:array&gt; &lt;/xsl:map-entry&gt; &lt;/xsl:map&gt; &lt;/xsl:for-each-group&gt; </code></pre> <p>This can then be serialized using the JSON output method to produce to required output.</p> <p>Note: the <code>saxon:array</code> and <code>saxon:array-member</code> instructions already exist in Saxon 9.8. They fill an obvious gap in the XSLT 3.0 facilities for handling arrays - a gap that exists largely because the XSL WG was unwilling to create a dependency XPath 3.1.</p> <h2><strong>Use Case 3: conversion to HTML</strong></h2> <p>This use case isn't in the XML Prague paper, but is included here for completeness.</p> <p>The aim here is to construct an HTML page containing the information from a JSON document, without significant structural alteration. This is a classic use case for the recursive application of template rules, so the aim is to make it easy to traverse the JSON structure using templates with appropriate match patterns.</p> <p>Unfortunately, although the XSLT 3.0 facilities allow patterns that match maps and arrays, they are cumbersome to use. Firstly, the syntax is awkward:</p> <pre><code>match=".[. instance of map(...)]" </code></pre> <p>We can solve this with a Saxon extension allowing the syntax</p> <pre><code>match="map()" </code></pre> <p>Secondly, the type of a map isn't enough to distinguish one map from another. To identify a map representing a student, for example, we aren't really interested in knowing that it is a <code>map(xs:string, item()*)</code>. What we need to know is that it has fields (email, first, last). Fortunately another Saxon extension comes to our aid: tuple types, described here: http://dev.saxonica.com/blog/mike/2016/09/tuple-types-and-type-aliases.html With tuple types we can change the match pattern to</p> <pre><code>match="tuple(email, first, last)" </code></pre> <p>Even better, we can use type aliases:</p> <pre><code>&lt;saxon:type-alias name="student" as="tuple(email, first, last)"/&gt; &lt;xsl:template match="~student"&gt;...&lt;/xsl:template&gt; </code></pre> <p>With this extension we can now render this input JSON into HTML using the stylesheet:</p> <pre><code>&lt;?xml version="1.0" encoding="utf-8"?&gt; &lt;xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:saxon="http://saxon.sf.net/" exclude-result-prefixes="#all" expand-text="yes" &lt;saxon:type-alias name="faculty" type="tuple(faculty, courses)"/&gt; &lt;saxon:type-alias name="course" type="tuple(course, students)"/&gt; &lt;saxon:type-alias name="student" type="tuple(first, last, email)"/&gt; &lt;xsl:template match="~faculty"&gt; &lt;h1&gt;{?faculty} Faculty&lt;/h1&gt; &lt;xsl:apply-templates select="?courses?*"/&gt; &lt;/xsl:template&gt; &lt;xsl:template match="~course"&gt; &lt;h2&gt;{?course} Course&lt;/h2&gt; &lt;p&gt;List of students:&lt;/p&gt; &lt;table&gt; &lt;thead&gt; &lt;tr&gt; &lt;th&gt;Name&lt;/th&gt; &lt;th&gt;Email&lt;/th&gt; &lt;/tr&gt; &lt;/thead&gt; &lt;tbody&gt; &lt;xsl:apply-templates select="?students?*"&gt; &lt;xsl:sort select="?last"/&gt; &lt;xsl:sort select="?first"/&gt; &lt;/xsl:apply-templates&gt; &lt;/tbody&gt; &lt;/table&gt; &lt;/xsl:template&gt; &lt;xsl:template match="~student"&gt; &lt;tr&gt; &lt;td&gt;{?first} {?last}&lt;/td&gt; &lt;td&gt;{?email}&lt;/td&gt; &lt;/tr&gt; &lt;/xsl:template&gt; &lt;xsl:template name="xsl:initial-template"&gt; &lt;xsl:apply-templates select="json-doc('courses.json')"/&gt; &lt;/xsl:template&gt; &lt;/xsl:stylesheet&gt; </code></pre> <h2><strong>Conclusions</strong></h2> <p>With only the facilities of the published XSLT 3.0 recommendation, the easiest way to transform JSON is often to convert it first to XML node trees, and then use the traditional XSLT techniques to transform the XML, before converting it back to JSON.</p> <p>With a few judiciously chosen extensions to the language, however, a wide range of JSON transformations can be achieved natively.</p> Bugs: How well are we doing? tag:dev.saxonica.com,2017:/blog/mike//3.218 2017-02-05T17:36:01Z 2017-02-05T19:13:01Z We're about to ship another Saxon 9.7 maintenance release, with another 50 or so bug clearances. The total number of patches we've issued since 9.7 was released in November 2015 has now reached almost 450. The number seems frightening and... Michael Kay We're about to ship another Saxon 9.7 maintenance release, with another 50 or so bug clearances. The total number of patches we've issued since 9.7 was released in November 2015 has now reached almost 450. The number seems frightening and the pace is relentless. But are we getting it right, or are we getting it badly wrong?<div><br /></div><div>There are frequently-quoted but poorly-sourced numbers you can find on the internet suggesting a norm of 10-25 bugs per thousand lines of code. Saxon is 300,000 lines of (non-comment) code, so that would suggest we can expect a release to have 3000 to 7500 bugs in it. One one measure that suggests we're doing a lot better than the norm. Or it could also mean that most of the bugs haven't been found yet.</div><div><br /></div><div>I'm very sceptical of such numbers. I remember a mature product in ICL that was been maintained by a sole part-time worker, handling half a dozen bugs a month. When she went on maternity leave, the flow of bugs magically stopped. No-one else could answer the questions, so users stopped sending them in. The same happens with Oracle and Microsoft. I submitted a Java bug once, and got a response 6 years later saying it was being closed with no action. When that happens, you stop sending in bug reports. So in many ways, a high number of bug reports doesn't mean you have a buggy product, it means you have a responsive process for responding to them. I would hate the number of bug reports we get to drop because people don't think there's any point in submitting them.</div><div><br /></div><div>And of course the definition of what is a bug is completely slippery. Very few of the bug reports we get are completely without merit, in the sense that the product is doing exactly what it says on the tin; at the same time, rather few are incontrovertible bugs either. If diagnostics are unhelpful, is that a bug?</div><div><br /></div><div>The only important test really is whether our users are satisfied with the reliability of the product. We don't really get enough feedback on that at a high level. Perhaps we should make more effort to find out; but I so intensely hate completing customer satisfaction questionnaires myself that I'm very reluctant to inflict it on our users. Given that open source users outnumber commercial users by probably ten-to-one, and that the satisfaction of our open source users is just as important to us as the satisfaction of our commercial customers (because it's satisfied open source users who do all the sales work for us); and given that we don't actually have any way of "reaching out" to our open source users (how I hate the marketing jargon); and given that we really wouldn't know what to differently if we discovered that 60% of our users were "satisfied or very satisfied": I don't really see very much value in the exercise. But I guess putting a survey form on the web site wouldn't be difficult, some people might interpret it as a signal that we actually care.</div><div><br /></div><div>With 9.7 there was a bit of a shift in policy towards fixing bugs pro-actively (more marketing speak). In particular, we've been in a phase where the XSLT and XQuery specs were becoming very stable but more test cases were becoming available all the time (many of them, I might add, contributed by Saxonica - often in reaction to queries from our users). So we've continuously been applying new tests to the existing release, which is probably a first. Where a test showed that we were handling edge cases incorrectly, and indeed when the spec was changed in little ways under our feet, we've raised bugs and fixes to keep the conformance level as high as possible (while also maintaining compatibility). So we've shifted the boundary a little between feature changes (which traditionally only come in the next release), and bug fixes, which come in a maintenance release. That shift also helps to explain why the gap between releases is becoming longer - though the biggest factor holding us back, I think, is the ever-increasing amount of testing that we do before a release.</div><div><br /></div><div>Fixing bugs pro-actively (that is before any user has hit the bug) has the potential to improve user satisfaction if it means that they never do hit the bug. I think it's always as well to remember also that for every user who reports a bug there may be a dozen users who hit it and don't report it. One reason we monitor StackOverflow is that a lot of users feel more confident about reporting a problem there, rather than reporting it directly to us. Users know that their knowledge is limited and they don't want to make fools of themselves, and you need a high level of confidence to tell your software vendor that you think the product is wrong.&nbsp;</div><div><br /></div><div>On the other hand, destabilisation is a risk. A fix in one place will often expose a bug somewhere else, or re-awaken an old bug that had been laid to rest. As a release becomes more mature, we try to balance the benefits of fixing problems with the risk of de-stabilisation.</div><div><br /></div><div>So, what about testing? Can we say that because we've fixed 450 bugs, we didn't run enough tests in the first place?</div><div><br /></div><div>Yes, in a sense that's true, but how many more tests would have had to write in order to catch them? We probably run about a million test cases (say, 100K tests in an average of ten product configurations each) and these days the last couple of months before a major release are devoted exclusively to testing. (I know that means we don't do enough continuous testing. But sorry, it doesn't work for me. If we're doing something radical to the internals of the product then things are going to break in the process, and my style is to get the new design working while it's still fresh in my head, then pick up the broken pieces later. If everything had to work in every nightly build, we would never get the radical things done. That's a personal take, and of course what works with a 3-4 person team doesn't necessarily work with a larger project. We're probably pretty unusual in developing a 300Kloc software package with 3-4 people, so lots of our experience might not extrapolate.)</div><div><br /></div><div>We've had a significant number of bug reports this time on performance regression. (This is of course another area where it's arguable whether it's a bug or not. Sometimes we will change the design in a way that we know benefits some workloads at the expense of others.) Probably most of these are extreme scenarios, for example compilation time for stylesheets where a single template declares 500 local variables. Should we have run tests to prevent that? Well, perhaps we should have more extreme cases in our test suite: the vast majority of our test cases are trivially small. But the problem is, there will always be users who do things that we would never have imagined. Like the user running an XSD 1.1 schema validation in which tens of thousands of assertions are expected to "fail", because they've written it in such a way that assertion failures aren't really errors, they are just a source of statistics for reporting on the data.</div><div><br /></div><div>The bugs we hate most (and therefore should to most to prevent) are bugs in bytecode generation, streaming, and multi-threading. The reason we hate them is that they can be a pig to debug, especially when the user-written application is large and complex.&nbsp;</div><div><br /></div><div><ul><li>For bytecode generation I think we've actually got pretty good test coverage, because we not only run every test in the QT3 and XSLT3 test suites with bytecode generation enabled, we also artificially complicate the tests to stop queries like 2+5 being evaluated by the compiler before bytecode generation kicks in. We've also got an internal recovery mechanism so if we detect that we've generated bad code, we fall back to interpreted mode and the user never notices (problem with that is of course that we never find out).</li><li>Streaming is tricky because the code is so convoluted (writing everything as inverted event-based code can be mind-blowing) and because the effects of getting it wrong often give very little clue as to the cause. But at least the failure is "in your face" for the user, who will therefore report the problem, and it's likely to be reproducible. Another difficulty with streaming is that because not all code is streamable, tests for streaming needed to be written from scratch.</li><li>Multi-threading bugs are horrible because they occur unpredictably. If there's a low probability of the problem happening then it can require a great deal of detective work to isolate the circumstances, and this often falls on the user rather than on ourselves. Fortunately we only get a couple of these a year, but they are a nightmare when they come. In 9.7 we changed our Java baseline to Java 6 and were able therefore to replace many of the hand-built multithreading code in Saxon with standard Java libraries, which I think has helped reliability a lot. But there are essentially no tools or techniques to protect you from making simple thread-safety blunders, like setting a property in a shared object without synchronization. Could we do more testing to prevent these bugs? I'm not optimistic, because the bugs we get are so few, and so particular to a specific workload, that searching the haystack just in case it contains a needle is unlikely to be effective.</li></ul><div>Summary: Having the product perceived as reliable by our users is more important to us than the actual bug count. Fixing bugs quickly before they affect more users is probably the best way of achieving that. If the bug count is high because we're raising bugs ourselves as a result of our own testing, then that's no bad thing. It hasn't yet got to the level where we can't cope with the volumes, or where we have to filter things through staff who are only employed to do support. If we can do things better, let us know.</div></div><div><br /></div><div><br /></div> Guaranteed Streamability tag:dev.saxonica.com,2016:/blog/mike//3.217 2016-12-09T21:56:27Z 2016-12-11T13:21:43Z The XSLT 3.0 specification in its current form provides a set of rules (that can be evaluated statically, purely by inspecting the stylesheet) for determining whether the code is (or is not) guaranteed streamable.If the code is guaranteed streamable then... Michael Kay The XSLT 3.0 specification in its current form provides a set of rules (that can be evaluated statically, purely by inspecting the stylesheet) for determining whether the code is (or is not) guaranteed streamable.<div><br /></div><div>If the code is guaranteed streamable then every processor (if it claims to support streaming at all) must use streaming to evaluate the stylesheet; if it is not guaranteed streamable then the processor can choose whether to use streaming or not.</div><div><br /></div><div>The tricky bit is that there's a requirement in the spec that if the code isn't guaranteed streamable, then a streaming processor (on request) has to detect this and report it. The status section of the spec says that this requirement is "at risk", meaning it might be removed if it proves too difficult to implement. There are people on the working group who believe passionately that this requirement is really important for interoperability; there are others (including me) who fully understand why users would like to have this, but have been arguing that it is extremely difficult to deliver.</div><div><br /></div><div>In this article I'm going to try to explain why it's so difficult to achieve this requirement, and to explore possibilities for overcoming these difficulties.</div><div><br /></div><div>Streamability analysis can't be performed until various other stages of static analysis are complete. It generally requires that names have been resolved (for example, names of modes and names of streamable functions). It also relies on rudimentary type analysis (determining the static type of constructs). For Saxon, this means that streamability analysis is done after parsing, name fixup, type analysis, and rewrite optimization.</div><div><br /></div><div>When Saxon performs these various stages of analysis, it modifies the expression tree as it goes: not just to record the information obtained from the analysis, but to make use of the information at execution time. It goes without saying that in modifying the expression tree, it's not permitted to replace a streamable construct with a non-streamable one, and that isn't too hard to achieve (though these things are relative...). But the requirement to report departures from guaranteed streamability imposes a second requirement, which is proving much harder. If we are to report any deviations from guaranteed streamability, then up to the point where we do the streamability analysis, we must never replace a non-streamable construct with a streamable one.</div><div><br /></div><div>There are various points at which we currently replace a non-streamable construct with a streamable one.</div><div><br /></div><div><ul><li>Very early in the process, the expression tree that is output by the parsing phase uses the same data structure on the expression tree to represent equivalent constructs in the source. For example, the expression tree produced by &lt;xsl:if test="$a=2"&gt;&lt;xsl:sequence select="3"/&gt;&lt;/xsl:if&gt; will be identical to the expression tree produced by &lt;xsl:sequence select="if ($a=2) then 3 else ()"/&gt;. But streamability analysis makes a distinction between these two constructs. It's not a big distinction (in fact, the only thing it affects is exactly where you are allowed to call the accumulator-after() function) but it's big enough to count.</li><li>At any stage in the process, if we spot a constant expression then we're likely to replace it with its value. For example if we see the expression $v+3, and $v is a global variable whose value is 5, we will replace the expression with the literal 8. This won't usually affect streamability one way or the other. However, there are a few cases where it does. The most obvious is where we work out that an expression is void (meaning it always returns an empty sequence). For example, according to the spec, the expression (author[0], author[1]) is not streamable because it makes two downward selections. But Saxon spots that author[0] is void and rewrites the expression as (author[1]), which is streamable. Void expressions often imply some kind of user error, so we often output a warning when this happens, but just because we think the user has written nonsense doesn't absolve us from the conformance requirement to report on guaranteed streamability. Void expressions are particularly likely to be found with schema-aware analysis.</li><li>Inlining of calls to user-defined functions will often make a non-streamable expression streamable.</li><li>Many other rewrites performed by the optimizer have a similar effect, for example replacing (X|Y) by *[self::X|self::Y].</li></ul><div>My first attempt to meet the requirement is therefore (a) to add information to the expression tree where it's needed to maintain a distinction that affects streamability, and (b) to try to avoid those rewrites that turn non-streamable expressions into streamable ones. As a first cut, skipping the optimization phase completely seems an easy way to achieve (b). But it turns out it's not sufficient, firstly because some rewrites are done during the type-checking phase, and secondly because it turns out that without an optimization pass, we actually end up finding that some expressions that should be streamable are not. The most common case for this is sorting into document order. Given the expression A/B, Saxon actually builds an expression in the form sort(A!B) relying on the sort operation to sort nodes into document order and eliminate duplicates. This relies on the subsequent optimization phase to eliminate the sort() operation when it can. If we skip the optimization phase, we are left with an unstreamable expression.</div></div><div><br /></div><div>The other issue is that the streamability rules rely on type inferencing rules that are much simpler than the rules Saxon uses. It's only in rare cases that this will make a difference, of course: in fact, it requires considerable ingenuity to come up with such cases. The most obvious case where types make a difference to streamability is with a construct like &lt;xsl:value-of select="$v"/&gt;: this is motionless if $v is a text or attribute node, but consuming if it is a document or element node. If a global variable with private visibility is initialized with select="@price", but has no "as" attribute, Saxon will infer a type of attribute(price) for the variable, but the rules in the spec will infer a type of item()*. So to get the same streamability answer as the spec gives, we need to downgrade the static type inferencing in Saxon.</div><div><br /></div><div>So I think the changes needed to replicate exactly the streamability rules of the XSLT 3.0 spec are fairly disruptive; moreover, implementing the changes by searching for all the cases that need to change is going to be very difficult to get right (and is very difficult to test unless there is another trustworthy implementation of the rules to test against).</div><div><br /></div><div>This brings us to Plan B. Plan B is to meet the requirement by writing a completely free-standing tool for streamability analysis that's completely separate from the current static analysis code. One way to do this would be to build on the tool written by John Lumley and demonstrated at Balisage a couple of years ago. Unfortunately that's incomplete and out of date, so it would be a significant effort to finish it. Meeting the requirement in the spec is different here from doing something useful for users: what the spec demands is a yes/no answer as to whether the code is streamable; what users want to know is why, and what they need to change to make the code streamable. The challenge is to do this without users having to understand the difficult abstractions in the spec (posture, sweep, and the rest). John's tool produces an annotated expression tree revealing all the properties: that's great for a user who understands the methodology but probably rather bewildering to the typical end user. Doing the minimum for conformance, a tool that just says yes or no without saying why, involves a lot of work to get a "tick in the box" with a piece of software that no-one will ever use, but would be a lot easier to produce. Conformance has always been a very high priority for Saxonica, but I can't see anyone being happy with this particular solution.</div><div><br /></div><div>So, assuming the WG maintains its insistence of having this feature (and it seems to me likely that it will), what should we do about it?</div><div><br /></div><div>One option is simply to declare a non-conformance. Once upon a time, standards conformance was very important to Saxon's reputation in the market, but I doubt that this particular non-conformance would affect our sales.</div><div><br /></div><div>Another option is to declare conformance, do our best to achieve it using the current analysis technology, and simply log bugs if anyone reports use cases where we get the answer wrong. That seems sloppy and dishonest, and could leave us with a continuing stream of bugs to be fixed or ignored.</div><div><br /></div><div>Another option is the "minimal Plan B" analyser - a separate tool for streamability analysis, that simply reports a yes/no answer (without explanation). It would be significant piece of work to create this and test it, and it's unclear that anyone would use it, but it's probably the cheapest way of getting the conformance tick-in-the-box.</div><div><br /></div><div>A final option is to go for a "fully featured" but free-standing streamability analysis tool, one which aims to not only answer the conformance question about guaranteed streamability, but also to provide genuinely useful feedback and advice helping users to create streamable stylesheets. Of course ideally such a tool would be integrated into an IDE rather than being free-standing. I've always argued that there's only a need for one such tool: it's not something that every XSLT 3.0 processor needs to provide. Doing this well would be a large project and involves different skills from those we currently have available.</div><div><br /></div><div>In the short term, I think the only honest and affordable approach would be the first option: declare a non-conformance. Unfortunately that could threaten the viability of the spec, because we can only get a spec to Recommendation status if all features have been shown to be implementable.</div><div><br /></div><div>No easy answers.</div><div><br /></div><div><b>LATER</b></div><div><b><br /></b></div><div>I've been thinking about a Plan C which might fly...</div><div><br /></div><div>The idea here is to try and do the streamability analysis using the current expression tree structure and the current streamability logic, but applying the streamability rules to an expression tree that faithfully represents the stylesheet as parsed, with no modifications from type checking or optimization.</div><div><br /></div><div>To do this, we need to:</div><div><br /></div><div>* Define a configuration flag --strictStreamability which invokes the following logic.</div><div><br /></div><div>* Fix places where the initial expression tree loses information that's needed for streamability analysis. The two that come to mind are (a) losing the information that something is an instruction rather than an expression (e.g. we lose the distinction between xsl:map-entry and a singleton map expression) - this distinction is needed to assess calls on accumulator-after(); (b) turning path expressions A/B into docSort(A!B). There may be other cases that we will discover along the road (or fail to discover, since we may not have a complete set of test cases...)</div><div><br /></div><div>* Write a new type checker that attaches type information to this tree according to the rules in the XSLT 3.0 spec. This will be much simpler than the existing type checker, partly because the rules are much simpler, but more particularly because the only thing it will do is to assign static types: it will never report any type errors, and it will never inject any code to do run-time type checking or conversion.</div><div><br /></div><div>* Immediately after this type-checking phase, run the existing streamability rules against the expression tree. As far as I'm aware, the streamability rules in Saxon are equivalent to the W3C rules (at any rate, most of the original differences have now been eliminated).</div><div><br /></div><div>There are then two options. We could stop here: if the user sets the --strictStreamability flag, they get the report on streamability, but they don't get an executable that can actually be run. The alternative would be, if the streamability analysis succeeds, attempt to convert the expression tree into a form that we can actually use, by running the existing simplify / typecheck / optimize phases. The distinctions introduced to the expression tree by the changes described above would be eliminated by the simplify() phase, and we would then proceed along the current lines, probably including a rerun of the streamability analysis against the optimised expression tree (because the posture+sweep annotations are occasionally needed at run-time).</div><div><br /></div><div>I will do some further exploration to see whether this all looks feasible. It will be very hard to prove that we've got it 100% right. But in a sense that doesn't matter, so long as the design is sound and we're passing known tests then we can report honestly that to the best of our knowledge the requirement is satisfied, which is not the case with the current approach.</div><div><br /></div> Tuple types, and type aliases tag:dev.saxonica.com,2016:/blog/mike//3.216 2016-09-08T11:44:15Z 2016-09-08T12:28:21Z I've been experimenting with some promising Saxon extensions.Maps and arrays greatly increase the flexibility and power of the XPath / XSLT / XQuery type system. But one drawback is that the type declarations can be very cumbersome, and very uninformative.Suppose... Michael Kay I've been experimenting with some promising Saxon extensions.<div><br /></div><div>Maps and arrays greatly increase the flexibility and power of the XPath / XSLT / XQuery type system. But one drawback is that the type declarations can be very cumbersome, and very uninformative.</div><div><br /></div><div>Suppose you want to write a library to handle arithmetic on complex numbers. How are you going to represent a complex number? There are several possibilities: as a sequence of two doubles (<b>xs:double*</b>); as an array of two doubles (<b>array(xs:double)</b>), or as a map, for example <b>map{"r": 0.0e0, "i": 0.0e0}</b> (which has type <b>map(xs:string, xs:double)</b>).</div><div><br /></div><div>Note that whichever of these choices you make, (a) your choice is exposed to the user of your library by the way you declare the type in your function signatures, (b) the type allows many values that aren't legitimate representations of complex numbers, and (c) there's nothing in the type declaration that tells the reader of your code that this has anything to do with complex numbers.</div><div><br /></div><div>I think we can tackle these problems with two fairly simple extensions to the language.</div><div><br /></div><div>First, we can define type aliases. For XSLT, I have implemented an extension that allows you to declare (as a top-level element anywhere in the stylesheet):</div><div><br /></div> <pre><b>&lt;saxon:type-alias name="complex"&nbsp;</b></pre><pre><b> type="map(xs:string, xs:double)"/&gt;</b></pre> <div>and then you can use this type alias (prefixed by a tilde) anywhere an item type is allowed, for example</div><div><br /></div> <pre style="font-size: 13px;"><b>&lt;xsl:variable name="i" as="~complex"&nbsp;</b></pre><pre style="font-size: 13px;"><b> select="cx:complex(0.0, 1.0)"/&gt;</b></pre> <div>Secondly, we can define tuple types. So we can instead define our complex numbers as:</div><div><br /></div> <pre style="font-size: 13px;"><b>&lt;saxon:type-alias name="complex"&nbsp;</b></pre><pre style="font-size: 13px;"><b> type="tuple(r: xs:double, i: xs:double)"/&gt;</b></pre> <div>We're not actually introducing tuples here as a fundamental new type with their own set of functions and operators. Rather, a tuple declaration defines constraints on a map. It lists the keys that must be present in the map, and the type of the value to be associated with each key. The keys here are the strings "r" and "i", and in both cases the value must be an xs:double. The keys are always NCNames, which plays well with the map lookup notation M?K; if $c is a complex number, then the real and imaginary parts can be referenced as $c?r and $c?i respectively.</div><div><br /></div><div>For this kind of data structure, tuple types provide a much more precise constraint over the contents of the map than the current map type does. It also provides much better static type checking: an expression such as $c?i can be statically checked (a) to ensure that "i" is actually a defined field in the tuple declaration, and (b) that the expression is used in a context where an xs:double value is expected.</div><div><br /></div><div>I've been a little wary in the past of putting syntax extensions into Saxon; conformance to standards has always been a primary goal. But the standards process seems to be running out of steam, and I'm beginning to feel that it's time to push a few innovative ideas out in product to keep things moving forward. For those who would prefer to stick entirely to stuff defined by W3C, rest assured that these features will only be available if you explicitly enable extensions.</div> Improving Compile-Time Performance tag:dev.saxonica.com,2016:/blog/mike//3.215 2016-06-22T10:04:28Z 2016-06-22T11:51:14Z For years we've been putting more and more effort into optimizing queries and stylesheets so that they would execute as fast as possible. For many workloads, in particular high throughput server-side transformations, that's a good strategy. But over the last... Michael Kay For years we've been putting more and more effort into optimizing queries and stylesheets so that they would execute as fast as possible. For many workloads, in particular high throughput server-side transformations, that's a good strategy. But over the last year or two we've become aware that for some other workloads, it's the wrong thing to do.<div><br /></div><div>For example, if you're running a DocBook or DITA transformation from the command line, and the source document is only a couple of KB in size, then the time taken to compile the stylesheet greatly exceeds the actual transformation time. It might take 5 seconds to compile the stylesheet, and 50 milliseconds to execute it. (Both DocBook and DITA stylesheets are vast.) For many users, that's not an untypical scenario.</div><div><br /></div><div>If we look at the XMark benchmarks, specifically a query such as Q9, which is a fairly complex three-way join, the query executes against a 10Mb source document in just 9ms. But to achieve that, we spend 185ms compiling and optimizing the query. We also spend 380ms parsing the source document. So in an ad-hoc processing workflow, where you're compiling the query, loading a source document, and then running a query, the actual query execution cost is about 2% of the total. But it's that 2% that we've been measuring, and trying to reduce.</div><div><br /></div><div>We haven't entirely neglected the other parts of the process. For example, one of the most under-used features of the product is document projection, which enables you during parsing, to filter out the parts of the document that the query isn't interested in. For query Q9 that cuts down the size of the source document by 65%, and reduces the execution time of the query to below 8ms. Unfortunately, although the memory saving is very useful, it actually increases the parsing time to 540ms. Some cases are even more dramatic: with Q2, the size of the source document is reduced by 97%; but parsing is still slowed down by the extra work of deciding which parts of the document to retain, and since the query only takes 2ms to execute anyway, there's no benefit other than the memory saving.</div><div><br /></div><div>For the DocBook and DITA scenarios (unlike XMark) it's the stylesheet compilation time that hurts, rather than the source document parsing time. For a typical DocBook transformation of a small document, I'm seeing a stylesheet compile time of around 3 seconds, source document parsing time of around 0.9ms, and transformation time also around 0.9ms. Clearly, compile time here is far more important than anything else.</div><div><br /></div><div>The traditional answer to this has always been to compile the stylesheet once and then use it repeatedly. That works if you're running hundreds of transformations using the same stylesheet, but there are many workflows where this is impractical.</div><div><br /></div><div>Saxon 9.7 makes a big step forward by allowing the compiled form of a stylesheet to be saved to disk. This work was done as part of the implementation of XSLT 3.0 packages, but it doesn't depend on packages in any way and works just as well with 1.0 and 2.0 stylesheets. If we export the docbook stylesheets as a compiled package, and then run from this version rather than from source, the time taken for loading the compiled stylesheet is around 550ms rather than the original 3 seconds. That's a very useful saving especially if you're processing lots of source documents using a pipeline written say using a shell script or Ant build where the tools constrain you to run one transformation at a time. (To ensure that exported stylesheet packages work with tools such as Ant, we've implemented it so that in any API where a source XSLT stylesheet is accepted, we also accept an exported stylesheet package).</div><div><br /></div><div>But the best performance improvements are those where you don't have to do anything different to get the benefits (cynically, only about 2% of users will ever read the release notes.) So we've got a couple of further projects in the pipeline.</div><div><br /></div><div>The first is simply raw performance tuning of the optimizer. There's vast potential for this once we turn our minds to it. What we have today has grown organically, and the focus has always been on getting the last ounce of run-time performance regardless how long it takes to achieve it. One approach is to optimize a bit less thoroughly: we've done a bit of that recently in response to a user bug report showing pathological compilation times on an extremely large (20Mb) automatically generated stylesheet. But a better approach is to think harder about the data structures and algorithms we are using.</div><div><br /></div><div>Over the last few days I've been looking at how we do loop-lifting: that is, identifying subexpressions that can be moved out of a loop because each evaluation will deliver the same result. The current approach is that the optimizer does a recursive walk of the expression tree, and at each node in the tree, the implementation of that particular kind of expression looks around to see what opportunities there are for local optimization. Many of the looping constructs (xsl:for-each, xsl:iterate, for expressions, filter expressions, path expressions) at this point initiate a search of the subtree for expressions that can be lifted out of the loop. This means that with nested loops (a) we're examining the same subtrees once for each level of loop nesting, and (b) we're hoisting the relevant expressions up the tree one loop at a time, rather than moving them straight to where they belong. This is not only a performance problem; the code is incredibly complex, it's hard to debug, and it's hard to be sure that it's doing as effective a job as it should (for example, I only found during this exercise that we aren't loop-lifting subexpressions out of xsl:for-each-group.)</div><div><br /></div><div>In 9.7, as reported in previous blog posts, we made some improvements to the data structures used for the expression tree, but so far we've been making rather little use of this. One improvement was to add parent pointers, which enables optimizations to work bottom-up rather than top-down. Another improvement was a generic structure for holding the links from a parent node to its children, using an Operand object that (a) holds properties of the relationship (e.g. it tells you when the child expression is evaluated with a different focus from the parent), and (b) is updatable, so a child expression can replace itself by some different expression without needing the parent expression to get involved. These two improvements have enabled a complete overhaul of the way we do loop-lifting. Without knowing anything about the semantics of different kinds of expressions, we can now do a two-phase process: first we do a scan over the expression tree for a function or template to identify, for each node in the tree, what its "innermost scoping node" is: for example an expression such as "$i + @x" is scoped both by the declaration of $i and by the instruction (e.g. xsl:for-each) that sets the focus, and the innermost scoping expression is the inner one of these two. Then, in a second pass, we hoist every expression that's not at the same looping level as its innermost scoping expression to be evaluated (lazily) outside that loop. The whole process is dramatically simpler and faster than what we were doing before, and at least as effective - possibly in some cases more so.</div><div><br /></div><div>The other project we're just starting on is to look at just-in-time compilation. The thing about stylesheets like DocBook is that they contain zillions of template rules for processing elements which typically don't appear in your average source document. So why waste time compiling template rules that are never used? All we really need to do is make a note of the match patterns, build the data structures we use to identify which rule is the best match for a node, and then do the work of compiling that rule the first time it is used. Indeed, the optimization and byte-code generation work can be deferred until we know that the rule is going to be used often enough to make it worthwhile. We're starting this project (as one should start all performance projects) by collecting instrumentation, so we can work out exactly how much time we are spending in each phase of compilation; that will tell us how much we should be doing eagerly and how much we should defer. There's a trade-off with usability here: do users want to be told about errors found while type-checking parts of the stylesheet that aren't actually exercised by a particular run?</div><div><br /></div><div>Plenty of ideas to keep us busy for a while to come.</div> Introducing Saxon-JS tag:dev.saxonica.com,2016:/blog/mike//3.214 2016-02-13T14:15:04Z 2016-02-13T14:17:07Z At XML Prague yesterday we got a spontaneous round of applause when we showed the animated Knight's tour application, reimplemented to use XSLT 3.0 maps and arrays, running in the browser using a new product called Saxon-JS. So, people... Michael Kay <p style="margin-bottom: 0cm"><span style="font-size: 1em;">At XML Prague yesterday we got a spontaneous round of applause when we showed the animated Knight's tour application, reimplemented to use XSLT 3.0 maps and arrays, running in the browser using a new product called Saxon-JS.</span></p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">So, people will be asking, what exactly is Saxon-JS?</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Saxon-EE 9.7 introduces a new option -export which allows you to export a compiled stylesheet, in XML format, to a file: rather like producing a .so file from a C compiler, or a JAR file from a Java compiler. The compiled stylesheet isn't executable code, it's a decorated abstract syntax tree containing, in effect, the optimized stylesheet execution plan. There are two immediate benefits: loading a compiled stylesheet is much faster than loading the original source code, so if you are executing the same stylesheet repeatedly the cost of compilation is amortized; and in addition, it enables you to distribute XSLT code to your users with a degree of intellectual property protection analogous to that obtained from compiled code in other languages. (As with Java, it's not strong encryption - it wouldn't be too hard to write a fairly decent decompiler - but it's strong enough that most people won't attempt it.)</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Saxon-JS is an interpreted, written in pure Javascript, that takes these compiled stylesheet files and executes them in a Javascript environment - typically in the browser, or on Node.js. Most of our development and testing is actually being done using Nashorn, a Javascript engine bundled with Java 8, but that's not a serious target environment for Saxon-JS because if you've got Nashorn then you've got Java, and if you've got Java then you don't need Saxon-JS.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Saxon-JS can also be seen as a rewrite of Saxon-CE. Saxon-CE was our first attempt at doing XSLT 2.0 in the browser. It was developed by producing a cut-down version of the Java product, and then cross-compiling this to Javascript using Google's GWT cross-compiler. The main drawbacks of Saxon-CE, at a technical level, were the size of the download (800Kb or so), and the dependency on GWT which made testing and debugging extremely difficult - for example, there was no way of testing our code outside a browser environment, which made running of automated test scripts very time-consuming and labour-intensive. There were also commercial factors: Saxon-CE was based on a fork of the Saxon 9.3 Java code base and re-basing to a later Saxon version would have involved a great deal of work; and there was no revenue stream to fund this work, since we found a strong expectation in the market that this kind of product should be free. As a result we effectively allowed the product to become dormant.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">We'll have to see whether Saxon-JS can overcome these difficulties, but we think it has a better chance. Because it depends on Saxon-EE for the front-end (that is, there's a cost to developers but the run-time will be free) we're hoping that there'll be a reveue stream to finance support and ongoing development; and although the JS code is not just a fork but a complete rewrite of the run-time code the fact that it shares the same compiler front end means that it should be easier to keep in sync.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Development has been incredibly rapid - we only started coding at the beginning of January, and we already have about 80% of the XSLT 2.0 tests running - partly because Javascript is a powerful language, but mainly because there's little new design involved. We know how an XSLT engine works, we only have to decide which refinements to leave out. We've also done client-side XSLT before so we can take the language extensions of Saxon-CE (how to invoke templates in response to mouse events, for example) the design of its Javascript APIs, and also some of its internal design (like the way event bubbling works) and reimplement these for Saxon-JS.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">One of the areas where we have to make design trade-offs is deciding how much standards conformance, performance, and error diagnostics to sacrifice in the interests of keeping the code small. There are some areas where achieving 100% conformance with the W3C specs will be extremely difficult, at least until JS6 is available everywhere: an example is support for Unicode in regular expressions. For performance, memory usage (and therefore expression pipelining) is important, but getting the last ounce of processor efficiency less so. An important factor (which we never got quite right for Saxon-CE) is asynchronous access to the server for the doc() and document() functions - I have ideas on how to do this, but it ain't easy.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">It will be a few weeks before the code is robust enough for an alpha release, but we hope to get this out as soon as possible. There will probably then be a fairly extended period of testing and polishing - experience suggests that when the code is 90% working, you're less than half way there.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">I haven't yet decided on the licensing model. Javascript by its nature has no technical protection, but that doesn't mean we have to give it an open source license (which would allow anyone to make changes, or to take parts of the code for reuse in other projects).</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">All feedback is welcome: especially on opportunities for exploiting the technology in ways that we might not have thought of.</p> Parent pointers in the Saxon expression tree tag:dev.saxonica.com,2015:/blog/mike//3.213 2015-09-11T19:38:54Z 2015-09-11T20:21:38Z A while ago (http://dev.saxonica.com/blog/mike/2014/11/redesigning-the-saxon-expression-tree.html) I wrote about my plans for the Saxon expression tree. This note is an update.We've made a number of changes to the expression tree for 9.7.Every node in the tree (every expression) now references a Location... Michael Kay A while ago (<a href="http://dev.saxonica.com/blog/mike/2014/11/redesigning-the-saxon-expression-tree.html">http://dev.saxonica.com/blog/mike/2014/11/redesigning-the-saxon-expression-tree.html</a>) I wrote about my plans for the Saxon expression tree. This note is an update.<div><br /></div><div>We've made a number of changes to the expression tree for 9.7.</div><div><br /></div><div><ul><li><span style="font-size: 1em;">Every node in the tree (every expression) now references a Location object, providing location information for diagnostics (line number, column number, etc). Previously the expression node implemented the SourceLocator interface, which meant it provided this information directly. The benefit is that we can now have different kinds of Location object. In XQuery we will typically hold the line and column and module URI. In XSLT, for a subexpression within an XPath expression, we can now hold both the offset within the XPath expression, and the path to the containing node within the XSLT stylesheet. Hopefully debuggers and editing tools such as oXygen and Stylus Studio will be able to take advantage of the improved location information to lead users straight to the error location in the editor. Where an expression has the same location information as its parent or sibling expressions, the Location object is shared.</span></li></ul></div><div><span style="font-size: 1em;"><br /></span></div><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><div><div>Another reason for changing the way we hold location information is connected with the move to separately-compiled packages in XSLT 3.0. This means that the system we previously used, of globally-unique integer "location identifiers" which are translated into real location information by reference to a central "location provider" service, is no longer viable.&nbsp;</div></div></blockquote><div><br /><ul><li>Every node in the tree now points to a RetainedStaticContext object which holds that part of the static context which can vary from one expression to another, and which can be required at run-time. Previously we only attempted to retain the parts of the static context that each kind of expression actually used. The parts of the static context that this covers include the static base URI, in-scope namespaces, the default collation, and the XPath 1.0 compatibility flag. Retaining the whole static context might seem extravagent. But in fact, it very rarely changes, so a child expression will nearly always point to the same RetainedStaticContext object as its parent and sibling expressions.</li></ul><div><br /></div></div><div><ul><li>Every node in the tree now points to its parent node. This choice has proved tricky. It gives many advantages: it means that the code for every expression can easily find details of the containing package, the configuration options, and a host of details about the query or stylesheet as a whole. The fact that we have a parent node eliminates the need for the "container" object (typically the containing function or template) which we held in previous releases. It also reduces the need to pass additional information to methods on the Expression class, for example methods to determine the item type and cardinality of the expression. There is a significant downside to holding this information, which is the need to keep it consistent. Some of the tree rewrite operations performed by the optimizer are complex enough without having to worry about keeping all the parent pointers correct. And it turns out to be quite difficult to enforce consistency through the normal "private data, public methods" encapsulation techniques: those work when you have to keep the data in a single object consistent, but they aren't much use for maintaining mutual consistency between two different objects. In any case it seems to be unavoidable that to achieve the kind of tree rewrites we want to perform, the tree has to be temporarily inconsistent at various stages.</li></ul><span style="font-size: 1em;">&nbsp;</span></div><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><div><span style="font-size: 1em;">Using parent pointers means that you can't share subtrees. It means that when you perform operations like inlining a function, you can't just reference the subtree that formed the body of the function, you have to copy it. This might seem a great nuisance. But actually, this is not a new constraint. It never was safe to share subtrees, because the&nbsp;</span>optimiser would happily make changes to a subtree without knowing that there were other interested parties. The bugs this caused have been an irritation for years. The introduction of parent pointers makes the constraint more explicit, and makes it possible to perform integrity checking on the tree to discover when we have inadvertently violated the constraints.</div><div><br /></div><div>During development we've had diagnostic code switched on that checks the integrity of the tree and outputs warnings if problems are found. We've gradually been examining these and eliminating them. The problems can be very hard to diagnose, because the detection of a problem in the data may indicate an error that occurred in a much earlier phase of processing. We've developed some diagnostic tools for tracing the changes made to a particular part of the tree and correlating these with the problems detected later. Most of the problems, as one might expect, are connected with optimization rewrites. A particular class of problem occurs with rewrites that are started but then not completed, (because problems are found) or with "temporary" rewrites that are designed to create an equivalent expression suitable for analysis (say for streamability analysis or for schema-aware static type-checking) but which are not actually intended to affect the run-time interpreted tree. The discipline in all such cases is to copy the part of the tree you want to work on, rather than making changes in-situ.</div><div><br /></div><div>For some non-local rewrites, such as loop-lifting optimizations, the best strategy seems to be to ignore the parent pointers until the rewrite is finished, and then restore them during a top-down tree-walk.</div><div><br /></div><div>The fact that we now have parent pointers makes context-dependent optimizations much easier. Checking, for example, whether &nbsp;a variable reference occurs within a loop (a "higher-order expression" as the XSLT 3.0 spec calls it) is now much easier: it can be done by searching upwards from the variable reference rather than retaining context information in an expression visitor as you walk downwards. Similarly, if there is a need to replace one expression by another (a variable reference by a literal constant, say), the fact that the variable reference knows its own parent makes the substitution much easier.</div><div><br /></div><div>So although the journey has had a few bumps, I'm reasonably confident that we will see long-term benefits.</div></blockquote><div><div><br /></div></div><div><br /></div><div><br /></div><div><br /></div> Lazy Evaluation tag:dev.saxonica.com,2015:/blog/mike//3.212 2015-06-28T23:15:13Z 2015-06-28T23:27:43Z We've seen some VM dumps recently that showed evidence of contention problems when multiple threads (created, for example, using &lt;xsl:for-each&gt; with the saxon:threads attribute) were attempting lazy evaluation of the same local variable. So I've been looking at the... Michael Kay <p style="margin-bottom: 0cm"><span style="font-size: 1em;">We've seen some VM dumps recently that showed evidence of contention problems when multiple threads (created, for example, using </span><font face="American Typewriter, monospace" style="font-size: 1em;">&lt;xsl:for-each&gt;</font><span style="font-size: 1em;"> with the </span><font face="American Typewriter, monospace" style="font-size: 1em;">saxon:threads</font><span style="font-size: 1em;"> attribute) were attempting lazy evaluation of the same local variable. So I've been looking at the lazy evaluation code in Saxon to try and understand all the permutations of how it works. A blog posting is a good way to try and capture that understanding before I forget it all again. But I won't go into the extra complexities of parallel execution just yet: I'll come back to that at the end.</span></p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Lazy evaluation applies when a variable binding, for example "<font face="American Typewriter, monospace">let $v := //x[@y=3]</font>" isn't evaluated immediately when the variable declaration is encountered, but only when the variable is actually referenced. This is possible in functional languages because evaluating an expression has no side-effects, so it doesn't matter when (or how often) it is done. In some functional languages such as Scheme, lazy evaluation happens only if you explicitly request it. In others, such as Haskell, lazy evaluation is mandated by the language specification (which means that a variable can hold an infinite sequence, so long as you don't try to process its entire value). In XSLT and XQuery, lazy evaluation is entirely at the discretion of the compiler, and in this post I shall try to summarize how Saxon makes use of this freedom.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Internally, when a local variable is evaluated lazily, Saxon instead of putting the variable's value in the relevant slot on the stack, will instead put a data structure that contains all the information needed to evaluate the variable: that is, the expression itself, and any part of the evaluation context on which it depends. In Saxon this data structure is called a Closure. The terminology isn't quite right, because it's not quite the same thing as the closure of an inline function, but the concepts are closely related: in some languages, lazy evaluation is implemented by storing, as the value of the variable, not the variable's actual value, but a function which delivers that value when invoked, and the data needed by this function to achieve that task is correctly called a closure. (If higher-order functions had been available in Saxon a few years earlier, we might well have implemented lazy evaluation this way.) </p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">We can distinguish two levels of lazy evaluation. We might use the term "deferred evaluation" to indicate that a variable is not evaluated until it is first referenced, and "incremental evaluation" to indicate that when it is referenced, it is only evaluated to the extent necessary. For example, if the first reference is the function call <font face="American Typewriter, monospace">head($v)</font>, only the first item in the sequence $v will be evaluated; remaining items will only be evaluated if a subsequent reference to the variable requires them.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Lazy evaluation can apply to global variables, local variables, parameters of templates and functions, and return values from templates and functions. Saxon handles each case slightly differently.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">We should mention some static optimizations which are not directly related to lazy evaluation, but are often confused with it. First, a variable that is never referenced is eliminated at compile-time, so its initializing expression is never evaluated at all. Secondly, a variable that is only referenced once, and where the reference is not in any kind of loop, is inlined: that is, the variable reference is replaced by the expression used to initialize the variable, and the variable itself is then eliminated. So when someone writes "<font face="American Typewriter, monospace">let $x := /a/b/c return $x[d=3]</font>", Saxon turns this into the expression "<font face="American Typewriter, monospace">(/a/b/c)[d=3]</font>". (Achieving this of course requires careful attention to the static and dynamic context, but we won't go into the details here.)</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Another static optimization that interacts with variable evaluation is loop-lifting. If an expression within a looping construct (for example the content of xsl:for-each, or of a predicate, or the right-hand-side of the "/" operator) will have the same value for every iteration of the loop, then a new local variable bound to this expression is created outside the loop, and the original expression is replaced by a reference to the variable. In this situation we need to take care that the expression is not evaluated unless the loop is executed at least once (both to avoid wasted evaluation cost, and to give the right behaviour in the event that evaluating the expression fails with a dynamic error.) So lazy evaluation of such a variable becomes mandatory.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">The combined effect of these static optimizations, together with lazy evaluation, is that the order of evaluation of expressions can be quite unintuitive. To enable users to understand what is going on when debugging, it is therefore normal for some of these rewrites to be suppressed if debugging or tracing are enabled.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">For global variables, Saxon uses deferred evaluation but not incremental evaluation. A global variable is not evaluated until it is first referenced, but at that point it is completely evaluated, and the sequence representing its value is held in memory in its entirety.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">For local variables, evaluation is generally both deferred and incremental. However, the rules are quite complex.</p> <p style="margin-bottom: 0cm"><br /> </p> <ul> <li><p style="margin-bottom: 0cm">If the static type shows that the value will be a singleton, then it will be evaluated eagerly. [It's not at all clear that this rule makes sense. Certainly, incremental evaluation makes no sense for singletons. But deferred evaluation could still be very useful, for example if the evaluation is expensive and the variable is only referenced within a branch of a conditional, so the value is not always needed.]</p> </li><li><p style="margin-bottom: 0cm">Eager evaluation is used when the binding expression is very simple: in particular when it is a literal or a reference to another variable.</p> </li><li><p style="margin-bottom: 0cm">Eager evaluation is used for binding expressions that depend on <font face="American Typewriter, monospace">position()</font> or <font face="American Typewriter, monospace">last()</font>, to avoid the complexities of saving these values in the Closure.</p> </li><li><p style="margin-bottom: 0cm">There are some optimizations which take precedence over lazy evaluation. For example if there are variable references using predicates, such as <font face="American Typewriter, monospace">$v[@x=3]</font>, then the variable will not only be evaluated eagerly, but will also be indexed on the value of the attribute <font face="American Typewriter, monospace">@x</font>. Another example: if a variable is initialized to an expression such as <font face="American Typewriter, monospace">($v, x)</font> - that is, a sequence that appends an item to another variable - then we use a "shared append expression" which is a data structure that allows a sequence to be constructed by appending to an existing sequence without copying the entire sequence, which is a common pattern in algorithms using head-tail recursion.</p> </li><li><p style="margin-bottom: 0cm">Lazy evaluation (and inlining) need special care if the variable is declared outside a try/catch block, but is referenced within it. In such a case a dynamic error that occurs while evaluating the initialization expression must not be caught by the try/catch; it is logically outside its scope. (Writing this has made me realise that this is not yet implemented in Saxon; I have written a test case and it currently fails.)</p> </li></ul> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">If none of these special circumstances apply, lazy evaluation is chosen. There is one more choice to be made: between a <font face="American Typewriter, monospace">Closure</font> and a <font face="American Typewriter, monospace">MemoClosure</font>. The common case is a <font face="American Typewriter, monospace">MemoClosure</font>, and in this case, as the variable is incrementally evaluated, the value is saved for use when evaluating subsequent variable references. A (non-memo) closure is used when it is known that the value will only be needed once. Because most such cases have been handled by variable inlining, the main case where a non-memo closure is used is for the return value of a function. Functions, like variables, are lazily evaluated, so that the value returned to the caller is not actually a sequence in memory, but a closure containing all the information needed to materialize the sequence. (Like most rules in this story, there is an important exception: tail-call optimization, where the last thing a function does is to call itself, takes precedence over lazy evaluation).</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">So let's look more closely at the <font face="American Typewriter, monospace">MemoClosure</font>. A <font face="American Typewriter, monospace">MemoClosure</font> is a data structure that holds the following information:</p> <p style="margin-bottom: 0cm"><br /> </p> <ul> <li><p style="margin-bottom: 0cm">The Expression itself (a pointer to a node in the expression tree). The Expression object also holds any information from the static context that is needed during evaluation, for example namespace bindings.</p> </li><li><p style="margin-bottom: 0cm">A copy of the dynamic context at the point where the variable is bound. This includes the context item, and values of any local variables referenced by the expression.</p> </li><li><p style="margin-bottom: 0cm">The current evaluation state: one of UNREAD (no access to the variable has yet been made), MAYBE_MORE (some items in the value of the variable are available, but there may be more to come), ALL_READ (the value of the variable is fully available), BUSY (the variable is being evaluated), or EMPTY (special case of ALL_READ in which the value is known to be an empty sequence).</p> </li><li><p style="margin-bottom: 0cm">An <font face="American Typewriter, monospace">InputIterator</font>: an iterator over the results of the expression, relevant when evaluation has started but has not finished</p> </li><li><p style="margin-bottom: 0cm">A reservoir: a list containing the items delivered by the InputIterator so far.</p> </li></ul> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm"> Many variable references, for example <font face="American Typewriter, monospace">count($v)</font>, or <font face="American Typewriter, monospace">index-of($v, 'z')</font> result in the variable being evaluated in full. If this is the first reference to the variable, that is if the state is UNREAD, the logic is essentially</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm"><font face="American Typewriter, monospace">inputIterator = expression.iterate(savedContext);</font></p> <p style="margin-bottom: 0cm"><font face="American Typewriter, monospace">for item in inputIterator {</font></p> <p style="margin-bottom: 0cm"> <font face="American Typewriter, monospace">&nbsp; &nbsp;reservoir.add(item);</font></p> <p style="margin-bottom: 0cm"><font face="American Typewriter, monospace">}</font></p> <p style="margin-bottom: 0cm"><font face="American Typewriter, monospace">state = ALL_READ;</font></p> <p style="margin-bottom: 0cm"><font face="American Typewriter, monospace">return new SequenceExtent(reservoir);</font></p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">(However, Saxon doesn't optimize this case, and it occurs to me on writing this that it could.)</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Other variable references, such as <font face="American Typewriter, monospace">head($v)</font>, or <font face="American Typewriter, monospace">$v[1]</font>, or <font face="American Typewriter, monospace">subsequence($v, 1, 5)</font>, require only partial evaluation of the expression. In such cases Saxon creates and returns a <font face="American Typewriter, monospace">ProgressiveIterator</font>, and the requesting expression reads as many items from the <font face="American Typewriter, monospace">ProgressiveIterator</font> as it needs. Requests to get items from the <font face="American Typewriter, monospace">ProgressiveIterator</font> fetch items from the reservoir to the extent they are available; on exhaustion of the reservoir, they then attempt to fetch items from the InputIterator until either enough items are available, or the InputIterator is exhausted. Items delivered from the <font face="American Typewriter, monospace">InputIterator</font> are copied to the reservoir as they are found.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">So far so good. This has all been in place for years, and works well. We have no evidence that it is in any way optimal, but it has been carefully tweaked over the years to deal with particular cases where it was performing badly. What has changed recently is that local variables can be referenced from multiple threads. There are two particular cases where this happens today: when <font face="American Typewriter, monospace">xsl:result-document</font> is used in Saxon-EE, it executes by default asynchronously in a new thread; and when the extension attribute <font face="American Typewriter, monospace">saxon:threads</font> is used on <font face="American Typewriter, monospace">xsl:for-each</font>, the items selected by the <font face="American Typewriter, monospace">xsl:for-each</font> are processed in parallel rather than sequentially.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">The effect of this is that the MemoClosure object needs to be thread-safe: multiple requests to access the variable can come simultaneously from different threads. To achieve this a number of methods are synchronized. One of these is the <font face="American Typewriter, monospace">next()</font> method of the <font face="American Typewriter, monospace">ProgressiveIterator</font>: if two threads reference the variable at the same time, each gets its own <font face="American Typewriter, monospace">ProgressiveIterator</font>, and the <font face="American Typewriter, monospace">next()</font> method on one of these iterators is forced to wait until the other has finished.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">This works, but it is risky. Brian Goetz in his excellent book <i>Java Concurrency in Practice</i> recommends that a method should not be synchronized unless (a) its execution time is short, and (b) as the author of the method, you know exactly what code will execute while it is active. In this case neither condition is satisfied. The <font face="American Typewriter, monospace">next()</font> method of <font face="American Typewriter, monospace">ProgressiveIterator</font> calls the <font face="American Typewriter, monospace">next()</font> method of the <font face="American Typewriter, monospace">InputIterator</font>, and this may perform expensive computation, for example retrieving and parsing a document using the <font face="American Typewriter, monospace">doc()</font> function. Further, we have no way of analyzing exactly what code is executed: in the worst case, it may include user-written code (for example, an extension function or a <font face="American Typewriter, monospace">URIResolver</font>). The mechanism can't deadlock with itself (because there cannot be a cycle of variable references) but it is practically impossible to prove that it can't deadlock with other subsystems that use synchronization, and in the face of maliciously-written used code, it's probably safe to assume that deadlock <b>can</b> occur. We haven't seen deadlock happen in practice, but it's unsatisfactory that we can't prove its impossibility.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">So what should we do about it?</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">I think the answer is, add yet another exception to the list of cases where lazy evaluation is used: specifically, don't use it for a variable that can be referenced from a different thread. I'm pretty sure it's possible to detect such cases statically, and they won't be very common. In such cases, use eager evaluation instead.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">We must be careful not to do this in the case of a loop-lifted variable, where the correct error semantics depend on lazy evaluation. So another tweak to the rules is, don't loop-lift code out of a multithreaded execution block.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">This investigation also suggests a few other refinements we might make.</p> <p style="margin-bottom: 0cm"><br /> </p> <ul> <li><p style="margin-bottom: 0cm">It seems worth optimizing for the case where the entire value of a variable is needed, since this case is so common. The problem is, it's not easy to detect this case: a calling expression such as <font face="American Typewriter, monospace">count($v)</font> will ask for an iterator over the variable value, without giving any indication that it intends to read the iterator to completion.</p> </li><li><p style="margin-bottom: 0cm">We need to reassess the rule that singleton local variables are evaluated eagerly.</p> </li><li><p style="margin-bottom: 0cm">We currently avoid using lazy evaluation for expressions with certain dependencies on the dynamic context (for example, <font face="American Typewriter, monospace">position()</font> and <font face="American Typewriter, monospace">last()</font>). But in the course of implementing higher-order functions, we have acquired the capability to hold such values in a saved copy of the dynamic context.</p> </li><li><p style="margin-bottom: 0cm">We could look at a complete redesign that takes advantage of higher-order functions and their closures. This might be much simpler than the current design; but it would discard the benefits of years of fine-tuning of the current design.</p> </li><li><p style="margin-bottom: 0cm">I'm not convinced that it makes sense for a <font face="American Typewriter, monospace">MemoClosure</font> to defer creation of the <font face="American Typewriter, monospace">InputIterator</font> until the first request for the variable value. It would be a lot simpler to call <font face="American Typewriter, monospace">inputIterator = Expression.iterate(context)</font> at the point of variable declaration; in most cases the implementation will defer evaluation to the extent that this makes sense, and this approach saves the cost of the elaborate code to save the necessary parts of the dynamic context. It's worth trying the other approach and making some performance measurements. </p> </li></ul> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm"><br /> </p> A redesign of the NamePool tag:dev.saxonica.com,2015:/blog/mike//3.211 2015-06-24T14:39:33Z 2015-06-24T15:19:25Z As explained in my previous post, the NamePool in Saxon is a potential problem for scaleability, both because access can cause contention, and also because it has serious limits on the number of names it can hold: there's a maximum... Michael Kay As explained in my previous post, the NamePool in Saxon is a potential problem for scaleability, both because access can cause contention, and also because it has serious limits on the number of names it can hold: there's a maximum of one million QNames, and performance starts getting seriously bad long before this limit is reached.<div><br /></div><div>Essentially, the old NamePool is a home-grown hash table. It uses a fixed number of buckets (1024), and when hash collisions occur, the chains of hash duplicates are searched serially. The fact that the number of buckets is fixed, and entries are only added to the end of a chain, is what makes it (reasonably) safe for read access to the pool to occur without locking.</div><div><br /></div><div>One thing I have been doing over a period of time is to reduce the amount of unnecessary use of the NamePool. Most recently I've changed the implementation of the schema component model so that references from one schema component to another are no longer implemented using NamePool fingerprints. But this is peripheral: the core usage of the NamePool for comparing names in a query against names in a source document will always remain the dominant usage, and we need to make this scaleable as parallelism increases.</div><div><br /></div><div>Today I've been exploring an alternative design for the NamePool (and some variations on the implementation of the design). The new design has at its core two Java ConcurrentHashMaps, one from QNames to fingerprints, and one from fingerprints to QNames. The ConcurrentHashMap, which was introduced in Java 5, doesn't just offer safe multi-threaded access, it also offers very low contention: it uses fine-grained locking to ensure that multiple writers, and any number of readers, can access the data structure simulaneously.</div><div><br /></div><div>Using two maps, one of which is the inverse of the other, at first seemed a problem. How can we ensure that the two maps are consistent with each other, without updating both under an exclusive lock, which would negate all the benefits? The answer is that we can't completely, but we can get close enough.</div><div><br /></div><div>The logic is like this:</div><div><br /></div><div><pre><pre style="color: rgb(0, 0, 0); font-size: 12pt;"><font style="font-size: 0.8em;"><span style="color: rgb(0, 0, 128); font-weight: bold;">private final </span>ConcurrentHashMap&lt;StructuredQName, Integer&gt; <span style="color: rgb(102, 14, 122); font-weight: bold;">qNameToInteger </span>= <span style="color: rgb(0, 0, 128); font-weight: bold;">new </span>ConcurrentHashMap&lt;StructuredQName, Integer&gt;(<span style="color: rgb(0, 0, 255);">1000</span>);<br /><span style="color: rgb(0, 0, 128); font-weight: bold;">private final </span>ConcurrentHashMap&lt;Integer, StructuredQName&gt; <span style="color: rgb(102, 14, 122); font-weight: bold;">integerToQName </span>= <span style="color: rgb(0, 0, 128); font-weight: bold;">new </span>ConcurrentHashMap&lt;Integer, StructuredQName&gt;(<span style="color: rgb(0, 0, 255);">1000</span>);<br /><span style="color: rgb(0, 0, 128); font-weight: bold;">private </span>AtomicInteger <span style="color: rgb(102, 14, 122); font-weight: bold;">unique </span>= <span style="color: rgb(0, 0, 128); font-weight: bold;">new </span>AtomicInteger();</font></pre><pre style="color: rgb(0, 0, 0);">// Allocate fingerprint to QName</pre><pre style="color: rgb(0, 0, 0); font-size: 12pt;"><font style="font-size: 0.8em;">Integer existing = <span style="color: rgb(102, 14, 122); font-weight: bold;">qNameToInteger</span>.get(qName);<br /><span style="color: rgb(0, 0, 128); font-weight: bold;">if </span>(existing != <span style="color: rgb(0, 0, 128); font-weight: bold;">null</span>) {<br /> <span style="color: rgb(0, 0, 128); font-weight: bold;">return </span>existing;<br />}<br />Integer next = <span style="color: rgb(102, 14, 122); font-weight: bold;">unique</span>.getAndIncrement();<br />existing = <span style="color: rgb(102, 14, 122); font-weight: bold;">qNameToInteger</span>.putIfAbsent(qName, next);<br /><span style="color: rgb(0, 0, 128); font-weight: bold;">if </span>(existing == <span style="color: rgb(0, 0, 128); font-weight: bold;">null</span>) {<br /> <span style="color: rgb(102, 14, 122); font-weight: bold;">integerToQName</span>.put(next, qName);<br /> <span style="color: rgb(0, 0, 128); font-weight: bold;">return </span>next;<br />} <span style="color: rgb(0, 0, 128); font-weight: bold;">else </span>{<br /> <span style="color: rgb(0, 0, 128); font-weight: bold;">return </span>existing;<br />}</font></pre><pre style="color: rgb(0, 0, 0); font-size: 12pt;"><span style="color: rgb(51, 51, 51); font-family: arial, helvetica, hirakakupro-w3, osaka, 'ms pgothic', sans-serif; font-size: 13px; white-space: normal;">Now, there are several things slightly unsafe about this. We might find that the QName doesn't exist in the map on our first look, but by the time we get to the "putIfAbsent" call, someone else has added it. The worst that happens here is that we've used up an integer from the "unique" sequence unnecessarily. Also, someone else doing concurrent read access might see the NamePool in a state where one map has been updated and the other hasn't. But I believe this doesn't matter: clients aren't going to look for a fingerprint in the map unless they have good reason to believe that fingerprint exists, and it's highly implausible that this knowledge comes from a different thread that has only just added the fingerprint to the map.</span></pre><pre><font face="arial, helvetica, hirakakupro-w3, osaka, ms pgothic, sans-serif"><span style="white-space: normal;">There's another ConcurrentHashMap involved as well, which is a map from URIs to lists of prefixes used in conjunction with that URI. I won't go into that detail.</span></font></pre><pre><font face="arial, helvetica, hirakakupro-w3, osaka, ms pgothic, sans-serif"><span style="white-space: normal;">The external interface to the NamePool doesn't change at all by this redesign. We still use 20-bit fingerprints plus 10-bit prefix codes, so we still have the limit of a million distinct names. But performance no longer degrades when we get close to that limit; and the limit is no longer quite so hard-coded.</span></font></pre><pre><font face="arial, helvetica, hirakakupro-w3, osaka, ms pgothic, sans-serif"><span style="white-space: normal;">My first attempt at measuring the performance of this found the expected benefits in scalability as the concurrency increases and as the size of the vocabulary increases, but the performance under more normal conditions was worse than the existing design: execution time of 5s versus 3s for executing 100,000 cycles each of which performed an addition (from a pool of 10,000 distinct names so 90% of the additions were already present) followed by 20 retrievals.</span></font></pre><pre><font face="arial, helvetica, hirakakupro-w3, osaka, ms pgothic, sans-serif"><span style="white-space: normal;">I suspected that the performance degradation was caused by the need to update two maps, whereas the existing design only uses one (it's cleverly done so that the fingerprint generated for a QName is closely related to its hash key, which enables us to use the fingerprint to navigate back into the hash table to reconstruct the original QName).</span></font></pre><pre><font face="arial, helvetica, hirakakupro-w3, osaka, ms pgothic, sans-serif"><span style="white-space: normal;">But it turned out that the cause was somewhere else. The old NamePool design was hashing QNames by considering only the local part of the name and ignoring the namespace URI, whereas the new design was computing a hash based on both the local name and the URI. Because URIs are often rather long, computing the hash code is expensive, and in this case it adds very little value: it's unusual for the same local name to be associated with more than one URI, and when it happens, the hash table is perfectly able to cope with the collision. By changing the hashing on QName objects to consider only the local name, the costs for the new design came down slightly below the current implementation (about 10% better, not enough to be noticeable).</span></font></pre><pre><font face="arial, helvetica, hirakakupro-w3, osaka, ms pgothic, sans-serif"><span style="white-space: normal;">So I feel comfortable putting this into production. There are a dozen test cases failing (out of 20,000) which I need to sort out first, but it all looks very promising.</span></font></pre></pre><pre style="color: rgb(0, 0, 0); font-family: Menlo; font-size: 12pt;"><br /></pre></div> Another look in the NamePool tag:dev.saxonica.com,2015:/blog/mike//3.210 2015-06-22T08:59:45Z 2015-06-22T09:09:59Z I've been looking again at the implementation of some of the parallel processing features in Saxon (see my XMLPrague 2015 paper) and at how best to make use of the facilities in the Java platform to support them. In... Michael Kay <p style="margin-bottom: 0cm"><span style="font-size: 1em;">I've been looking again at the implementation of some of the parallel processing features in Saxon (see my <a href="http://www.saxonica.com/papers/xmlprague-2015mhk.pdf">XMLPrague 2015 paper</a></span><span style="font-size: 1em;">) and at how best to make use of the facilities in the Java platform to support them. In the course of this I've been studying Brian Goetz's excellent book Java Concurrency in Practice, which although dated, is still probably the best text available on the subject; and the fact that it only takes you up to Java 6 is an advantage in our case, because we still want Saxon to run on Java 6.</span></p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Reading the book has made me think again about the venerable design of Saxon's NamePool, which is the oldest thing in the product where multithreading is relevant. The NamePool is basically a shared data structure holding QNames.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">The design of the NamePool hasn't changed much over the years. On the whole it works well, but there are a number of limitations:</p> <p style="margin-bottom: 0cm"><br /> </p> <ul> <li><p style="margin-bottom: 0cm">Updating the NamePool requires an exclusive lock, and on occasions there has been heavy contention on that lock, which reduces the effectiveness of running more threads.</p> </li><li><p style="margin-bottom: 0cm">We read the NamePool without acquiring a lock. All the textbooks say that's bad practice because of the risk of subtle bugs. We've been using this design for over a dozen years without a single of these subtle bugs coming to the surface, but that doesn't mean they aren't there. It's very hard to prove that the design is thread-safe, and it might only take an unusual JVM, or an unusual hardware architecture, or a massively parallel application, or pathological data (such as a local name that appears in hundreds of namespaces) for a bug to suddenly appear: which would be a serious embarassment.</p> </li><li><p style="margin-bottom: 0cm">The fact that we read the NamePool with no lock means that the data structure itself is very conservative. We use a fixed number of hash buckets (1024), and a chain of names within each bucket. We only ever append to the end of such a chain. If the vocabulary is large, the chains can become long, and searching then takes time proportional to the length of the chains. Any attempt to change the number of buckets on the fly is out of the question so long as we have non-locking readers. So performance degrades with large vocabularies.</p> </li><li><p style="margin-bottom: 0cm">We've got a problem coming up with XSLT 3.0 packages. We want packages to be independently compiled, and we want them to be distributed. That means we can't bind names to fingerprints during package construction, because the package will have to run with different namepools at run-time. We can probably solve this problem by doing the binding of names at package "load" or "link" time; but it's a change of approach and that invites a rethink about how the NamePool works.</p> </li></ul> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Although the NamePool design hasn't changed much over the years, we've changed the way it is used: which essentially means, we use it less. Some while ago we stopped using the NamePool for things such as variable names and function names: it is now used only for element names, attribute names, and type names. Around Saxon 9.4 we changed the Receiver interface (Saxon's ubiquitous interface for passing push-mode events down a pipeline) so that element and attribute names were represented by a NodeName object instead of an integer fingerprint. The NodeName can hold a name either in string form or as an integer code, or both, so this change meant that we didn't have to allocate a NamePool fingerprint just to pass events down this pipeline, which in turn meant that we didn't have to allocate fingerprints to constructed elements and attributes that were going to be immediately serialized. We also stopped using the NamePool to allocate codes to (prefix, uri) pairs representing namespace bindings.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">These changes have been pretty effective and it's a while since we have seen a workload suffer from NamePool contention. However, we want to increase the level of parallelism that Saxon can support, and the NamePool remains a potential pinch point.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">There are a number of things we can do that would make a difference. We could for example use a Java <a href="http://tutorials.jenkov.com/java-util-concurrent/readwritelock.html">ReadWriteLock</a>&nbsp;to allow either a single writer or multiple readers; this would allow us to introduce operations such as reconfiguring the hash table as the size of the vocabulary increases, without increasing contention because of the high frequency of read access.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">But let's first try and remind ourselves what the NamePool is actually for. It is there, first and foremost, to allow fast run-time testing of whether a particular node satisfies a NameTest. Because we use the same NamePool when constructing a source tree and when compiling a stylesheet or query, the node on the tree and the NameTest in the compiled code both know the integer fingerprint of the name, and testing the node against the NameTest therefore reduces to a fast integer comparison. This is undoubtedly far faster than doing a string comparison, especially one involving long URIs.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">If that was the only thing we used the NamePool for, then we would only need a single method, <b>allocateFingerprint(namespace, localName)</b>. What are all the other methods there for?</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Well, firstly, the NamePool knows the mapping of fingerprints to names, so we have read-only get methods to get the fingerprint corresponding to a name, or the name corresponding to a fingerprint. These are a convenience, but it seems that they are not essential. The client code that calls the NamePool to allocate a fingerprint to a name could retain the mapping somewhere else, so that there is no need to go back to a shared NamePool to rediscover what is already known.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">The most obvious case where this happens is with the TinyTree. The TinyTree holds the names of elements and attributes as fingerprints, not as strings, so operations like the XPath <b>local-name()</b> and <b>namespace-uri()</b> functions get the fingerprint from the TinyTree and then call on the NamePool to translate this back to a string. We could avoid this by keeping a map from integers to strings within the TinyTree itself. This could potentially have other benefits: we could make fewer calls on the NamePool to allocate fingerprints during tree construction; and retargeting a TinyTree to work with a different NamePool would be easier.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Secondly, there's a lot of code in the NamePool to manage prefixes. This isn't needed for the core function of matching a node against a NameTest, since that operation ignores namespace prefixes. The detail here is that when we call <b>NamePool.allocate()</b>, we actually supply prefix, uri, and local-name, and we get back a 32-bit <b>nameCode</b> which uniquely represents this triple; the bottom 20 bits uniquely represent the local-name/uri pair, and it is these 20 bits (called the <b>fingerprint</b>) that are used in QName comparisons. The purpose of this exercise has nothing to do with making name comparisons faster; rather it is mainly concerned with saving space in the TinyTree. By packing the prefix information into the same integer as the local-name and URI, we save a few useful bits. But there are other ways of doing this without involving the NamePool; we could use the same few bits to index into a table of prefixes that is local to the TinyTree itself. There are of course a few complications; one of the benefits of the NamePool knowing about prefixes is that it can provide a service of suggesting a prefix to use with a given URI when the system is required to invent one: users like it when the prefix that emerges is one that has previously been associated with that URI by a human being. But there are probably less expensive ways of achieving this.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Let's suppose that we reduced the functionality of the NamePool to a single method, <b>allocate(QName) → int</b>. How would we then implement it to minimize contention? A simple and safe implementation might be</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm"><br /></p><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><p style="margin-bottom: 0cm">HashMap&lt;QName, Integer&gt; map;</p><p style="margin-bottom: 0cm">int next = 0;</p><p style="margin-bottom: 0cm"><br /></p><p style="margin-bottom: 0cm">public synchronized int allocate(QName q) {</p></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><span style="font-size: 1em;">Integer n = map.get(q);</span></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><span style="font-size: 1em;">if (n == null) {</span></blockquote></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><span style="font-size: 1em;">int m = ++next;</span></blockquote></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><span style="font-size: 1em;">map.put(q, m);</span></blockquote></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><span style="font-size: 1em;">return m;</span></blockquote></blockquote></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><span style="font-size: 1em;">} else {</span></blockquote></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><span style="font-size: 1em;">return n;</span></blockquote></blockquote></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><span style="font-size: 1em;">}</span></blockquote></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><p style="margin-bottom: 0cm">}</p></blockquote> <p style="margin-bottom: 0cm"><br /></p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">This still serializes all allocate operations, whether or not a new fingerprint is allocated. We can almost certainly do better by taking advantage of Java's concurrent collection classes, though it's not immediately obvious what the best way of doing it is. But in any case, if we can achieve this then we've reduced the NamePool to something much simpler than it is today, so optimization becomes a lot easier. It's worth noting that the above implementation still gives us the possibility to discover the fingerprint for a known QName, but not to (efficiently) get the QName for a known fingerprint.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">To get here, we need to start doing two things:</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">(a) get prefixes out of the NamePool, and handle them some other way.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">(b) stop using the NamePool to discover the name associated with a known fingerprint.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">After that, redesign becomes relatively straightforward.</p> How long is a (piece of) string? tag:dev.saxonica.com,2015:/blog/mike//3.209 2015-02-09T14:06:09Z 2015-02-10T16:25:57Z As I explained in my previous post, I've been re-examining the way functions work in Saxon. In particular, over the last week or two, I've been changing the way system functions (such as fn:string-length) work. There's a terrific amount... Michael Kay <p><br /> As I explained in my previous post, I've been re-examining the way functions work in Saxon. In particular, over the last week or two, I've been changing the way system functions (such as fn:string-length) work. There's a terrific amount of detail and complexity here, but I thought it might be interesting to take one simple function (fn:string-length) as an example, to see where the complexity comes from and how it can be reduced.</p> <p>At first sight, fn:string-length looks pretty simple. How long is a (piece of) string? Just ask Java to find out: surely it should just map to a simple call on java.lang.String.length(). Well, no actually.</p> <p>If we look to the specification, there are two complications we have to deal with. Firstly we are counting the number of Unicode characters, not (as Java does) the number of 16-bit UTF16 codepoints. In the case of surrogate pairs, one character occupies two codepoints, and that means that a naïve implementation of string-length() takes time proportional to the length of the string.</p> <p>Secondly, there are two forms of the string-length() function. With zero arguments, it's defined to mean string-length(string(.)). That's different from nearly all other functions that have 0-argument and 1-argument forms, where (for example) name() means name(.). Saxon handles functions like name() by converting them statically to name(.), and that conversion doesn't work in this case. To illustrate the difference, consider an attribute code="003", defined in the schema as an xs:integer. The function call string-length(@code) returns 1 (it atomizes the attribute to produce an integer, converts the integer to the string "3", and then returns the length of this string. But @code!string-length() returns 3 - the length of the string value of the attribute node.</p> <p>The other complexity applies specifically to string-length#0 (that is, the zero-argument form). Dynamic calls to context-dependent functions bind the context at the point where the function is created, not where it is called. Consider:</p> <p> &lt;xsl:for-each select="0 to 9"><br /> &lt;xsl:variable name="f" select="string-length#0"/><br /> &lt;xsl:for-each select="21 to 50"><br /> &lt;xsl:value-of select="$f()"/><br /> &lt;/xsl:for-each><br /> &lt;/xsl:for-each></p> <p>This will print the value "1" three hundred times. In each case the context item at the point where $f is bound is a one-digit integer, so $f() returns the length of that integer, which is always one. The context item at the point where $f() is evaluated is irrelevant.</p> <p>Now let's take a look at the Saxon implementation. There's a Java class StringLength which in Saxon 9.6 is about 200 lines of code (including blank lines, comments, etc), and this does most of the work. But not all: in the end all it does is to call StringValue.getStringLength(), which is what really does the work. Atomic values of type xs:string are represented in Saxon by an instance of the class StringValue, which encapsulates a Java CharSequence: often, but not always, a String. The reason for the encapsulating class is to provide type safety on methods like Function.call() which returns a Sequence; StringValue implements AtomicValue which implements Item which implements Sequence, so the XDM data model is faithfully represented in the Java implementation classes.</p> <p>In addition there's a class StringLengthCompiler which generates a bytecode implementation of the string-length function. This is another 60 or so lines.</p> <p>Some functions also have a separate streaming implementation to accept streamed input, and one or two (string-join() and concat(), for example), have an implementation designed to produce streamed output. That's designed to ensure that an instruction like <xsl:value-of select="//emp/name" separator=","/>, which compiles down to a call on string-join() internally, doesn't actually assemble the whole output in memory, but rather writes each part of the result string to the output stream as it becomes available.</p> <p>Since the introduction of dynamic function calls, many system functions have two separate implementations, one for static calls and one for dynamic calls. That's the case for string-length: the evaluateItem() method used for static calls is almost identical to the call() method used for dynamic calls. One reason this happened was because of a fear of performance regression that might occur if the existing code for static calls was generalized, rather than introducing a parallel path.</p> <p>In 9.6, the implementation of dynamic calls to context-dependent functions like string-length#0 is rather fudged. In fact, the expression string-length#0 compiles into a call on function-lookup("fn:string", 0). The implementation of function-lookup() keeps a copy of both the static and dynamic context at the point where it is called, and this is then used when evaluating the resulting function. This is vastly more expensive than it needs to be: for functions like string-length#0 where there are no arguments other than the context, the function can actually be pre-evaluated at the point of creation. In the new 9.7 implementation, the result of the expression string-length#0 is a function implemented by the class ConstantFunction, which encapsulates its result and returns this result when it is called. (It's not quite as simple as this, because the constant function also has to remember its name and arity, just in case the user asks.) </p> <p>The method StringValue.getStringLength() attempts to recognize cases where walking through the codepoints of the string to look for surrogate pairs is not actually necessary. In previous releases there was an extra bit kept in StringValue, set when the string was known to contain no surrogate pairs: so having walked the string once, it would never be done again. In 9.6 this mechanism is replaced with a different approach: Saxon includes several implementations of CharSequence that maintain the value as an array of fixed-size integers (8-bit, 16-bit, or 32-bit, as necessary). If the CharSequence within a StringValue is one of these classes (known collectively as UnicodeString), then the length of the string is the length of the array. And when getStringLength() is called on a string the first time, the string is left in this form, in the hope that future operations on the string will benefit. Of course, this will in some cases be counter-productive (and there's a further refinement in the implementation, which I won't go into, that's designed to overcome this).</p> <p>There are a few other optimizations in the implementation of string-length() that are worth mentioning. Firstly, it's quite common for users to write</p> <p> &lt;xsl:if test="string-length($x) != 0"></p> <p>Here we don't need to count surrogate pairs in the string: the string is zero-length if and only if the underlying CharSequence is zero-length. Saxon therefore does a static rewrite of such an expression to boolean(string($x)). (If $x is statically known to be a string, the call string($x) will then be further rewritten as $x.)</p> <p>If string-length#1 is applied to a value that can be computed statically, then the string-length function is itself computed statically. (This optimization, for odd historical reasons, is often called "constant folding". It's possible only when there are no context dependencies.)</p> <p>During type-checking, the implementation of string-join#0 keeps a note of whether a context item is known to exist. This is used during byte-code generation; if it's known that the context item won't be absent, then there is no need to generate code to check for this error condition. It's through tiny optimizations like this that generated bytecode ends up being faster than interpreted code.</p> <p>In my current exercise refactoring the implementation of system functions such as string-length, I've been looking at how much of logic is duplicated either across the different implementations of a single function (streamed and unstreamed, static and dynamic, bytecode and interpreted) or across the implementations of functions that have a lot in common (such as string(), string-length(), and normalize-space()). I've found that with the exception of the core code in StringValue.getStringLength, and the optimization of string-length()=0, everything else can be vastly reduced. In place of the original StringLength class, there are now two (inner) classes StringLength_0 and StringLength_1 each of which consists of a single one-line method. The code for generating byte-code can also be considerably simplified by achieving more reuse across different functions.</p> <p>The main essence of the reorganization is that the class StringLength (or rather, its two variants) are no longer Expressions, they are now Functions. Previously a call onto string-length($x) compiled to an expression, held as a node on the expression tree. Now it compiles into two object, a StringLength object which is a pure function, and a SystemFunctionCall object which is an expression that calls the function. The SystemFunctionCall object is generic across all functions, while the implementations of SystemFunction contain all the code that is specific to one function. This change was motivated primarily by the need to handle dynamic function calls (and hence first-class function objects) properly, but it has provided a stimulus for a refactoring that achieves much more than this.</p> <p>So, how long is a piece of string? At least we now know how to work it out more efficiently. Sorry this little yarn wasn't shorter.</p> Functions, Function Calls, Function Items tag:dev.saxonica.com,2015:/blog/mike//3.208 2015-02-01T14:05:36Z 2015-02-01T14:06:40Z XSLT and XQuery, in their 3.0 incarnations, have become fully functional programming languages, where functions are first class values that can be manipulated in the same way as other values, for example they can be bound to variables and... Michael Kay <p style="margin-bottom: 0cm">XSLT and XQuery, in their 3.0 incarnations, have become fully functional programming languages, where functions are first class values that can be manipulated in the same way as other values, for example they can be bound to variables and passed as arguments to other functions. So you would expect that functions play a pretty central role in the Saxon implementation. In this article I shall review how functions work within Saxon, and how I think this needs to change.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">One of the things that happens when features are added to a complex piece of software is that they tend to be added around the edges, rather than in the core. You can tell when something was added by how central it is to the class hierarchy. It shouldn't be that way, but there are good reasons why it happens. And when we look at how functions work in Saxon, that's what we see. For example, if we had always had dynamic function calls, then we would probably find that static function calls (where the function to be called is known at compile time) were treated simply as a special case of dynamic calls, much as ($x+1) is treated as a special case of ($x+$y). But because they came along later, dynamic calls are actually handled very differently.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">I've been doing work recently on the design of Saxon's expression tree (the data structure produced by the compilation phase for use by the execution phase of stylesheet and query evaluation). One aspect of that has been working out how to write the expression tree to persistent storage, and then load it back again for execution. In doing this, I've been struck by the sheer number of different classes that are somehow related to functions, and by the lack of coherence between them.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">The first thing we notice, which is in itself a bit smelly, is that there is no class called Function. In fact, for many functions, there is no object in Saxon that represents the function itself. For example, with system functions (recall that in XSLT 1.0 and XPath 1.0 that's the only kind there were), we have two things: a data table containing general information about functions, such as their type signatures and context dependencies, and a class SystemFunctionCall which actually represents a call to the function, not the function itself. The implementation of functions such as name() or substring() is in a class called Name or Substring which is a subclass of SystemFunctionCall. This in turn is a subclass of Expression, and as such it forms a node in the expression tree.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">This works fine for static function calls, but what happens to an expression such as substring#3 or true#0 or name#0? Answer: different things in each of these three cases.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">For substring#3, this is a pure function: there are no context dependencies. We create something called a SystemFunctionItem, which is an Item (and therefore a Sequence), we wrap this inside a Literal, and we put the Literal on the expression tree. This is much the same as the way we handle a constant such as "London" or 39. Internally, the SystemFunctionItem contains a reference to an instance of the Substring class, which is where things get a bit peculiar, because Substring is designed to be a function call, not a function. In fact, Substring has a dual personality, it tries to act in both roles depending which methods you call. That's a hack, and like all hacks, it comes back to bite you.</p> <p style="margin-bottom: 0cm">For true#0 there's a further bit of hackery, because the function call true() doesn't actually generate a SystemFunctionCall, it generates a BooleanValue: it's treated as a constant, just like "London" or 39. But we have to support dynamic calls on true(), so we introduced a SystemFunctionCall called True to handle this case, even though it doesn't have the dual personality: it acts only as a function, never as a function call.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">The name#0 function is different again, because it encapsulates context. It's defined to return the name of the node that was the context node at the point where the name#0 expression was evaluated. So this should remind us that name#0 is not a value, it is an expression: the function it represents has no static existence, it can only be created dynamically by evaluating the expression with a particular dynamic context. We solve this with another hack: for context-dependent function "literals" like name#0 or position#0, the compiler actually generates a call on function-lookup("name", 0), which is an expression rather than a value, and which has to be evaluated at run-time.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">The SystemFunctionItem class implements an internal interface called FunctionItem, and as one might expect, a FunctionItem is an Item and therefore a Sequence: that is, it's a run-time value. Other subclasses of FunctionItem are used for calls to user defined functions, Java extension functions, or constructor functions. But they are only ever used for dynamic calls, never for static calls.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Although there is no Function class, some functions are in fact represented as objects in their own right. The most important is UserFunction, which represents a user-written function in XSLT or XQuery. Another is InlineFunction (representing an anonymous inline function), and another is ExtensionFunctionDefinition, which represents a function whose implementation is provided in Java code: this is used both for user-written extension functions, and for many (but not all) vendor supplied extension functions in the Saxon namespace. But these classes have nothing in common with each other, there is no common superclass. This has the consequence that not only is there one set of machinery for static function calls and another quite different set for dynamic calls, but that in each case, there are many different flavours depending on what kind of function you happen to be calling.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Getting this right involves a cleaner data model. As always, a clean data model leads to cleanly structured code, and the data model should be designed to accurately reflect the specification, not for the convenience of the implementation. The specification says we have two objects of interest, a function, and a function call. Quite rightly, the spec no longer uses the term "function item" as something distinct from the function: a function is an item. There's also something called the "function implementation" which we should recognize, because two functions may share an implementation. For example, the expression name#0 returns a different function each time it is evaluated: these functions differ in the data they hold (a snapshot of the dynamic context), but they share the same implementation.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">We should be trying to move towards a structure where we have a hierarchy of subclasses of Function (for system functions, constructor functions, user functions, Java extension functions, etc); where the Function is an Item; where a Function may reference a FunctionImplementation to provide for sharing; and where a FunctionCall is as far as possible independent of what kind of function it is calling, but simply has two flavours for static function calls and dynamic function calls.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">It will be interesting to see how close we can get to that situation.</p><p style="margin-bottom: 0cm"><br /></p> Redesigning the Saxon expression tree tag:dev.saxonica.com,2014:/blog/mike//3.206 2014-11-11T23:45:00Z 2014-11-12T00:22:53Z I've been embarking on an exercise to redesign the Saxon expression tree. It's got a number of problems that it would be nice to fix; and there's major work ahead in being able to save and restore the tree as... Michael Kay I've been embarking on an exercise to redesign the Saxon expression tree. It's got a number of problems that it would be nice to fix; and there's major work ahead in being able to save and restore the tree as part of a compiled XSLT 3.0 package, so it would be nice to get the structure into better shape first.<div><br /></div><div>One of the problems is that there's 150-odd different implementation classes for nodes on the expression tree, and that's before you start counting individual function calls; and there's far too much duplication between these classes. Implementing a new kind of expression involves far too much code. In addition, there's far too much scope to get this code wrong, leading to obscure bugs when a new kind of expression gets involved in some particular optimization rewrite, such as function or variable inlining. There's a steady flow of bugs, perhaps a couple a month, caused by tree corruptions of one kind or another, and it would be nice to make the whole structure more robust.</div><div><br /></div><div>One thing I want to do is to be more systematic about the way in which static context information is held on the tree. At present the principle is that each expression saves that part of the static context that it thinks it might need. For example, some expressions need run-time access to the static base URI, some to the static namespace context, and some to the register of collation names. (This last case has been simplified in 9.6 because all collation names are now global to a Configuration, though the default collation can still vary from one expression to another). A recent shock discovery is that in XPath 3.0, general comparisons (that is, the humble "=" operator) can depend on the namespace context - if one operand turns out to be untyped atomic and another is a QName, then the untyped atomic value needs to be converted to a QName using the in-scope namespaces. This means the namespace context needs to be saved with every general comparison expression, which is heavy stuff. Given that the static context is almost always the same for all expressions in a module, it would be much better rather than saving what each expression thinks it might need, to save the changes to the static context on the tree, so each expression can discover the whole static context by processing a set of deltas held with its ancestors in the tree.</div><div><br /></div><div>This leads to another point, which is that it would be nice to have parent pointers in the expression tree. At present you can navigate from a node to its children (that is, its subexpressions), but not in the other direction. Saxon gets around this by keeping a dynamic stack of expressions visited in some of its operations on the tree (such as type-checking and optimization) so the ancestor expressions are maintained dynamically during the recursive tree-walk rather than being maintained statically on the tree.</div><div><br /></div><div>In 9.6, mainly to support XSLT 3.0 streamability analysis, we introduced a generalized mechanism for obtaining the subexpressions of any node: the operands() method. This returns a set of Operand objects, each of which contains a pointer to the child expression itself, and also properties in effect of the parent-child expression relationship, such as the XSLT 3.0 posture and sweep properties, and the operand usage, used in the 3.0 general streamability rules. This mechanism has proved very successful (and not just for streaming) in enabling more generalized operations on the tree. But a limitation is that modtifications to the tree, such as substitution one child expression for another (which is very common during type-checking and optimization) is still entirely ad-hoc, and has to be managed independently by each class of expression node.</div><div><br /></div><div>As a first step in redesigning the tree for 9.7, I have extended the way in which we use Operand objects. As well as being used to navigate to subexpressions, they are also now used to modify subexpressions: the Operand object has a method setChildExpression() which can be used to replace the existing child expression with another. All structural changes to the tree are required to go via this method, which is enforced by encapsulating the reference to the child expression within the Operand object. The Operand also holds a reference to its "owner" expression, so when a child expression is changed, the single setChildExpression() method can take responsibility for housekeeping such as making sure the child expression has location information for use in error reporting, and making sure that expression properties cached on the parent node (such as the inferred type) are invalidated and recomputed when the children of the expression change.</div><div><br /></div><div>This process is complicated by the fact that the nodes on the expression tree are highly diverse, in fact they don't even all represent expressions. The tree also has to cater, for example, for XSLT patterns and XQuery FLWOR expression clauses.</div><div><br /></div><div>Making updates go through the Operand object enables many expressions to inherit a generic implementation of common rewrite methods such as typeCheck and optimize. For example, the default action of optimize is to call optimize() on each subexpression, and if any changes have occurred, replace the subexpression with its rewritten self. The redesign means that this "replace" operation can now be done in a generic way, meaning that the default optimize() method does not need to be tailor-made for each class of expression. The same is true of other methods such as primote().</div><div><br /></div><div>I was hoping that channelling all updates through Operand.setChildExpression() would also make it easy to maintain parent pointers in the tree. Unfortunately this is not the case. It's easy enough, when B is set as a child of A, for B's parent pointer to be updated to point to A. The problem arises when B's parent pointer was previously set to C: what happens to C's children? Can B be a child of A and C simultaneously (making it not a tree at all?). It turns out that some rewrites on the tree involve creating a new structure over existing leaf nodes in the tree, which might then be discarded if not all the conditions for optimization are met. So we've updated the parent pointers in the leaf nodes to this new superstructure, which we then discarded, reverting to the original. It's difficult then to make sure that the parent pointers are reset properly when the rewrite is abandoned. It can be done in an ad-hoc way, of course, but we're looking for something more robust: an API for tree rewriting that doesn't allow the tree to become inconsistent. This is proving hard to achieve. We may have to resort to a different approach, doing a bulk sweep of the tree to set all parent pointers before the typecheck and optimize operations on each template or function. This is intellectually unsatisfactory because it means accepting that the tree could be temporarily inconsistent in the middle of such an operation, but it may be the best option available.</div><div><br /></div><div>This is work in progress, but it's looking promising. I appreciate that for most users it's about as interesting as the repair works to the sewers beneath your street. Hopefully, though, the bottom line will be fewer product bugs, and more ability to take the product forward into new areas.</div> Sharing key indexes tag:dev.saxonica.com,2014:/blog/mike//3.205 2014-06-04T23:10:24Z 2014-06-04T23:30:22Z For ever and a day, Saxon has tried to ensure that when several transformations are run using the same stylesheet and the same source document(s), any indexes built for those documents are reused across transformations. This has always required some... Michael Kay For ever and a day, Saxon has tried to ensure that when several transformations are run using the same stylesheet and the same source document(s), any indexes built for those documents are reused across transformations. This has always required some careful juggling of Java weak references to ensure that the indexes are dropped from memory as soon as either the executable stylesheet or the source document are no longer needed.<div><br /></div><div>I've now spotted a flaw in this design. It wasn't reported by a user, and it didn't arise from a test case, it simply occurred to me as a theoretical possibility, and I have now written a test case that shows it actually happens. The flaw is this: if the definition of the key includes a reference to a global variable or a stylesheet parameter, then the content of the index depends on the values of global variables, and these are potentially different in different transformations using the same stylesheet.</div><div><br /></div><div>Fixing this is not easy, and poses some interesting dilemmas, given that no-one has hit the problem in a dozen years and it's quite likely that no-one ever will. It would be nice to roll back my discovery of the problem and pretend that it doesn't exist, but that's not my style.</div><div><br /></div><div>One solution would be to abandon the strategy of reusing indexes. It's likely that very few users have workloads that benefit from this strategy. On the other hand, those that do could see a big performance hit. Removing an powerful optimization because of a problem that doesn't apply to these workloads is not very friendly.</div><div><br /></div><div>The right solution is to identify those key definitions that allow indexes to be shared and distinguish them from indexes that can't be shared, and then manage the two kinds of index differently. Both the identification and the management can be done, but they introduce more complexity and hence more risk of bugs, and the whole thing feels like overkill given that we don't know any users who have the problem.</div><div><br /></div><div>Civil engineers design structures to withstand storms that are likely to happen more than once in N years. It would be nice if we could adopt a similar approach here. But software doesn't seem to be like that; it seems to follow Murphy's law, which states that if something can go wrong, it will.</div><div><br /></div><div>So with heavy heart, I think that there's no alternative to "doing the right thing", with all the complexity it entails.</div><div><br /></div><div>Reference: Saxon bug 1968; W3C test case key-085a/b.</div> Another regex rewrite tag:dev.saxonica.com,2014:/blog/mike//3.204 2014-02-11T07:52:21Z 2014-02-11T07:54:42Z In Saxon 9.5, a new regular expression engine was introduced. This was obviously a risky thing to do, but I'm not averse to such risks, and although there is often some short-term inconvenience to users, I believe such things... Michael Kay <p style="margin-bottom: 0cm"><span style="font-size: 1em;">In Saxon 9.5, a new regular expression engine was introduced. This was obviously a risky thing to do, but I'm not averse to such risks, and although there is often some short-term inconvenience to users, I believe such things usually bring long-term benefits. The risks relate to both functionality and performance. In both cases we were able to minimize the risks thanks to a good collection of test material, but we always knew that contact with real life problems might throw up things that the test suite didn't.</span></p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Why did we need the new engine? There were several factors. On Saxon-CE, the GWT technology didn't offer access to the JDK regex library, only to the Javascript regex engine, which has different syntax and semantics, and no support for non-BMP Unicode characters. We could have translated XPath regexes to Javascript regexes the way we were doing for the JDK, but the expansions to handle non-BMP characters are very cumbersome. (We could also have abandoned W3C regex conformance, but that's not my style.) For "server-side Saxon", I had been impatient with the existing design for some time. There are bugs I reported in the JDK regex engine 5 years ago which Sun/Oracle have shown no inclination to fix; the approach of translating one regex dialect to another is slow and clunky; and the JDK wasn't keeping pace with new Unicode versions. I felt it was time we brought the regex engine in-house.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">I looked around for an open source regex engine suitable for forking, and chose the Apache Jakarta engine. Unlike some other options the code was clear and well commented, the performance seemed reasonable, and the license was compatible. I made all the changes needed to support the XPath regex syntax and semantics, and a few other changes including some optimizations, and it passed all the tests. In fact, it has proved very reliable; there have been no bug reports against its functionality.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">But there have been two performance glitches. </p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm"><ul><li><span style="font-size: 1em;">One is the way it handles integer repetition bounds such as a{3,7}. The algorithm it uses for this is the same as Saxon uses in its schema processor, which is described in a paper by Thompson and Tobin [1]; it essentially translates such an expression to aaaa?a?a?a?, and then generates a finite state machine accordingly. This becomes problematic when the integer bounds start to become significantly large; the size of the finite state machine and the time taken to generate it both increase rapidly with size. (For the technically minded, I'm not sure how rapidly, but even if it's only linear, it's a significant problem.)</span></li></ul><br /><ul><li><span style="font-size: 1em;">The second problem is the handling of backtracking. The Jakarta engine, like most regex engines, relies on backtracking to handle ambiguous expressions. Backtracking relies on maintaining a stack so you can go back to where you were. In Jakarta, this data was kept on the Java program stack; each checkpoint was represented by a recursive function call. In many cases each character position in the input string is a potential checkpoint, so the depth of recursion was essentially equal to the length of the input string. That puts a limit of around 1K to 2K on the size of the input, which is not really acceptable.</span></li></ul></p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">For the 9.5 release I mitigated this problem by introducing an optimization: in cases where repetition was unambiguous, the recursive calls were eliminated. This catches many cases like A*B where you know that you have to keep reading A's until you hit a B, and when you do hit a B you know you won't have to try again from somewhere in the middle of the sequence of A's. But unfortunately, ambiguous expressions are common (an example would be .*B), and we have to handle them.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">So I did something that I very rarely do; I studied how the JDK regex engine avoided this problem, by following its activity through the debugger. I was surprised to see that the JDK engine doesn't actually build a finite state machine. Instead it uses a pure interpreter pattern: it builds an abstract syntax tree representing the grammar of the regex, and then evaluates the regex by direct interpretation of the constructs on this tree. The maximum recursive depth with this approach is represented by the depth of the abstract syntax tree (that is, it depends on the regex but not on the input string), and the data needed for backtracking is held on the heap rather than on the Java stack.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Over the last week I have been rewriting the back end of the (ex-)Jakarta regex engine to use a similar design. It's passing all the functionality tests; I still have some performance testing to do, but it's looking good. The front-end code, that is the regex parser, is largely unchanged, but it now generates an operation tree rather than an FSM. I retained many of the features of the existing design, such as the use of a variety of implementations of integer sets and predicates to support the different kind of character classes, but the operations have all been re-implemented to work interpretively. I departed from the JDK design by using an approach similar to Saxon's XPath interpreter, in that the implementation of each operation returns an iterator over all the strings matched by the relevant expression. The key method supported by each operation is</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm"><b>IntIterator iterateMatches(final REMatcher matcher, final int position)</b></p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">which is passed an REMatcher (contain all necessary information about the match operation, including the input string and regex flags, plus support for captured groups and backreferences), and a start position for matching; it return an IntIterator representing a sequence of integers which are the possible positions where a match can end. </p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">The class handling a repeat operation such as (a|aa)* implements this operation by repeated calls on the underlying choice operation. The first choice operation will return an iterator over the two possible end positions (1,2); the next time round there are four possible positions (1,2,2,3) (there is no attempt to eliminate duplicates), and so on.The method maintains a stack of IntIterator objects representing these successive calls; initially only the first hit from each iterator is used, but when backtracking is necessary, the stack is popped, and the iterator now at the top of the stack is called to deliver its next match.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">Saxon continues to take advantage of the optimization that recognizes unambiguous repetitions. With patterns such as A*B, or A{5}A, there is no need for backtracking so the stack of iterators is not used, and only the first match from each iterator is used.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">There is scope for further optimization here. Some regular expression engines such as PCRE are known to handle highly-ambiguous expressions much more efficiently than this crude backtracking approach, mainly by eliminating the duplication that arises when a backtrack attempt ends up doing the same matches at the same positions as a previous attempt. Such improvements lie in the future; for the moment, being as good as the Java engine in performance, and better in features such as Unicode support and buglessness, is enough to declare success.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">The new code is currently operational on the Saxon 9.6 development branch. In due course it will be retrofitted to Saxon-CE, and probably also to a Saxon 9.5 maintenance release. Probably only 10% of the original Jakarta code remains, but that's probably enough that I can't get rid of the Apache license that goes with it, for those who care about these things.</p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm">[1] <font color="#3f3f3f"><font face="Verdana, sans-serif"><font size="3">Using Finite-State Automata to Implement W3C XML Schema Content Model Validation and Restriction Checking. Henry Thompson and Richard Tobin. XML Europe 2003</font></font></font></p> <p style="margin-bottom: 0cm"><br /> </p> <p style="margin-bottom: 0cm"><br /> </p>

 
 
  
     
Open

Delete

View XML
View XSL