Sausalito's XQuery Data Definition Facility

Sausalito extends the XQuery language with support for collections and indexes. This is accomplished via a combination of new Prolog declarations, new kinds of expressions or extensions to existing expressions, and new built-in functions. Furthermore, both the static and the dynamic contexts are extended with new components that store information about collections and indexes. Collectively, all these extensions are called the XQuery Data Definition Facility (XQDDF). XQDDF is an extension of XQuery 1.1, XQuery Update Facility, and XQuery Scripting Facility.

As part of the XQDDF implementation, Sausalito includes a new "built-in" module that contains the declarations of all the new built-in functions introduced by XQDDF to manipulate collections and indexes. The module is located at <zorba-root-directory>/modules/com/zorba-xquery/www/modules/xqddf.xq and its namespace is "http://www.zorba-xquery.com/modules/xqddf". In the remaining of this document, we will refer to this module as the "xqddf" module, and the functions in it as the "xqddf" functions. As usual, the xqddf module must be imported by any other module that wants to invoke any xqddf functions.

1 Collections in Sausalito

The current W3C XQuery specification defines collections simply as sequences of nodes that are accessible via the fn:collection function. Everything else about collections is implementation-dependent. For example, XQuery programmers have no direct way to control the contents or the life-cycle of collections. XQDDF attempts to close this gap by making explicit most of the collection-related issues that are left as implementation-dependent by W3C. This way, XQuery developers can manipulate collections directly from XQuery programs, rather than relying in some external hosting environment or application. As explained below, collections as defined by XQDDF have some important differences from collections as defined by the W3C specification. As a result, in the remainder of the this document we will use the terms "XQDDF collection" and "W3C Collection" to make clear the distinction between the two kinds of collections. We will also use the term "document" to refer to any XML tree whose root is a parent-less node of any kind (not necessarily a document node).

An XQDDF collection

is defined as an ordered set of documents that is disjoint from any other collection and is uniquely identified by a QName. Furthermore, with respect to document order, the relative order of two nodes belonging to different documents within the same collection is defined to be the same as the relative position of their containing documents within the collection. We will say that a node belongs to a collection if it is inside a document that belongs to that collection.

Like a W3C collection, an XQDDF collection can also be viewed as a sequence of nodes: it is the sequence containing the root nodes of the documents that belong to the collection (and as we will see later, the function xqddf:collection returns exactly this sequence of nodes). However, even when viewed as sequences of nodes, XQDDF collections differ from W3C collections in the following ways:

For brevity, in the remaining of this document we will the term "collection" to mean XQDDF collection. For backward compatibility with the W3C XQuery specification, Sausalito retains some basic support for W3C collections (see section 1.7 W3C Collections). However, users are encouraged to use XQDDF collections instead.

Sausalito supports five kinds of operations on collections: collection declaration, collection creation, collection deletion, collection update, and node retrieval. These are explained briefly in the following simple example. Full details for each operation are provided in the subsequent chapters.

1.1 Collections in action - A simple example

Let us assume an application that models a news organization. The application models its data as XML documents grouped into collections of logically related entities. In this example, we show how three such collections may be created and used; the first collection contains employee data, the second contains news articles, and the third contains information about the months of the year (e.g., the name, number of days, and fixed holidays for each month).

Before a collection can be created, it must be declared. A collection declaration describes the collection by providing a unique name for it and specifying certain properties for the collection itself and for the documents in the collection. As explained in 1.2 Collection Declaration, collections must be declared inside library modules. In terms of the XQuery language, collection declarations become part of a module's static context.

In this example, the declarations are placed inside the "news-data" library module (shown below). The declarations assign the names news-data:employees, news-data:articles, and news-data:months to the three collections, respectively. Documents in both the employees and the months collections are assumed to have a well-known structure, which is reflected in an XML schema ("news-schema"). The schema declares two global elements for employees and months respectively. Accordingly, the collection declarations for employees and months specify that their root nodes are elements whose name and type matches the name and type of the corresponding global element declarations in "news-schema". In contrast, articles may come from various sources (including external ones), and as a result, article documents do not have any particular schema. Therefore, the declaration for the articles collection specifies node() as the type of the root nodes. Both employee and article documents may be updated during their lifetime. Instead, the months-related information is fixed (can not change), so the nodes of the months collection are declared as "read-only". Furthermore, the collection itself is declared "const", meaning that no months may be added to or deleted from this collection after it is created and initialized. Finally, we want the order of the month documents within their containing collection to be the same as the actual order of the months within the year. To achieve this, we have to declare the collection as "ordered", so that when we later insert the month documents in the collection, the system will store and return them in the same order as their insertion order. In contrast, the position of employees or articles inside their respective collections does not have any special meaning for the application, so the corresponding declarations do not specify any ordering property. This allows the system to store and access the contents of these collections in what it considers as the most optimal order.

  (: The "news-data" Library Module :)

  module namespace news-data = "http://www.news.org/data";

  import schema namespace news-schemas = "http://www.news.org/schemas";

  declare collection news-data:employees as schema-element(news-schema:employee)*;

  declare collection news-data:articles as node()*;

  declare const ordered collection news-data:months as schema-element(news-schema:month)* with read-only nodes;

  declare variable $news-data:employees := xs:QName("news-data:employees");
  declare variable $news-data:articles := xs:QName("news-data:articles");
  declare variable $news-data:months := xs:QName("news-data:months");

Having been declared, the collections can now be created. Collection creation is illustrated by the "admin-script-1" script shown below. First, the collection descriptions must be made visible to the script. This is done by importing the "news-data" library module that contains the collection declarations. Then, the collections are created by calling the xqddf:create-collection function. There are two versions of this function: the first takes a QName as input and the second takes both a QName and a node-producing expression. In the first version, an empty document container is created by Sausalito's storage system and registered inside a collections table that maps collection names to document containers. In the second version, the given expression is evaluated first, and (deep) copies are made of the nodes in the result sequence. This way, a sequence of distinct documents is produced. This is called the "insertion sequence". Then, as in the first version of the function, the document container is created and registered. Finally, the container is populated with the documents in the insertion sequence. In "admin-script-1", this second version is used to create and initialize the months collection. In fact, months must be initialized during creation because it is a constant collection, so no documents can be added to it later. The months are inserted in the collection in the order from January to December, and since the collection was declared as ordered, this order is preserved by the associated document container.

  (: "admin-script-1" :)

  import module namespace xqddf = "http://www.zorba-xquery.com/modules/xqddf";

  import schema namespace news-schemas = "http://www.news.org/schemas";

  import module namespace news-data = "http://www.news.org/data";

  xqddf:create-collection($news-data:employees);

  xqddf:create-collection($news-data:articles);

  xqddf:create-collection($news-data:months, (<month name="Jan">...</month>, ..., <month name="Dec">...</month>)));

The next script ("user-script-1") shows how collections may be used. First the necessary modules and schemas are imported. Next, the employees collection is populated using the xqddf:insert-nodes function. The first argument to this function is the QName of a collection, and the second is a node-producing expression (called the source expression). The QName is used to lookup the collection declaration and the collection itself (i.e., its document container). Then, the nodes produced by the source expression (source nodes) are copied and the copies are added to the document container, making sure that the actual type of each node matches the static type found in the collection declaration. Copying the source nodes (and their sub-trees) guarantees that the nodes in the insertion sequence are indeed parent-less nodes that do not belong to any other collection already and are distinct from each other. Notice that the need to validate the root nodes against the type specified in the collection declaration is the reason why the "news-schema" must be imported, even though no type defined by the schema is referenced explicitly in the query.

In this example, the employees collection is populated by a single call to the xqddf:insert-nodes function, whose source expression is a concatenation of explicitly constructed documents. The articles collection is populated using the xqddf:insert-nodes function as well, but in a slightly different fashion: The article documents are assumed to exist already, either as text files in the local filesystem, or at various web sites. As a result, the articles collection is populated via a concatenation of xqddf:insert-function calls, each reading and parsing a single XML document and inserting the generated XML tree in the collection. Although there is one function call per article, the articles will be inserted all together in an atomic (all-or-nothing) operation, when the ";" at line 16 is processed. This is because, as explained in 1.5 Updating Collections, the xqddf:insert-nodes function (and all other xqddf functions that create, delete, or update collections) is an //updating function//, that is, rather than applying the insertion immediately, it produces an updating primitive that becomes part of a pending updates list (PUL), which is applied atomically when the next ";" appears in the program.

After populating the two collections, "user-script-1" runs a query expression that uses the xqddf:collection function to access their root nodes. The expression returns, for each journalist, the articles authored by that journalist ordered by their date.

Finally, "user-script-1" uses the xqddf:remove-nodes function to remove from the articles collection all articles that were published before 2000. Like xqddf:insert-nodes, xqddf:remove-nodes takes as input the QName of a collection and a node-producing source expression. The source nodes must be parent-less nodes that belong to the collection. The function looks up the collection declaration and the collection container, and removes the source nodes from the collection container.

Notice that the whole user-script-1 is organized as a concatenation of three block expressions. Only the second block produces an actual result, the other two are purely updating blocks (their result is the empty sequence). Writing the query as a concatenation of blocks (instead of a single sequential expression), allows the result of the script to be the concatenation of the results of each block (instead of the result of just the last expression in the sequential expression).

  (: "user-script-1":)

  import module namespace xqddf = "http://www.zorba-xquery.com/modules/xqddf";

  import schema namespace news-schemas = "http://www.news.org/schemas";

  import module namespace news-data = "http://www.news.org/data";

  block
  {
    xqddf:insert-nodes($news-data:employees, (<employee id="100">....</employee>, ..., <employee id="500">...</employee>);

    (xqddf:insert-nodes($news-data:articles, doc("article1.xml)/article),
     xqddf:insert-nodes($news-data:articles, rest:get("http://www.reuters.com.xml/article234.xhtml")//article),
     ....,
     xqddf:insert-nodes($news-data:articles, doc("article100.xml)/article));
  },
  block
  {
    for $emp in xqddf:collection($news-data:employees)[./position/@kind eq "journalist"]
    let $articles := for $art in xqddf:collection($news-data:articles)[.//author//name eq $emp/name]
                     order by $art//date
                     return $art
    return <result>{$emp}<articles>{$articles//title}</articles></result>,
  },
  block
  {
    xqddf:delete-nodes($news-data:articles, xqddf:collection($news-data:articles)[.//date lt xs:date("01/01/2000")];
  }

We conclude this example with the "admin-script-2" script, which simply destroys the collections using the xqddf:delete-collection function. The function de-registers the collection from the collections table, destroys all the documents in the collection and all the indexes associated the collection, and finally destroys the document container itself.

  (: admin-script2 :)

  import module namespace xqddf = "http://www.zorba-xquery.com/modules/xqddf";

  import module namespace news-data = "http://www.news.org/data";

  xqddf:delete-collection($news-data:employees);

  xqddf:delete-collection($news-data:articles);

  xqddf:delete-collection($news-data:months);

1.2 Collection Declaration

  Prolog ::= ((DefaultNamespaceDecl | Setter | NamespaceDecl | Import) Separator)* 
             ((VarDecl | ContextItemDecl | FunctionDecl | OptionDecl | CollectionDecl | IndexDecl | ICDecl) Separator)*

  CollectionDecl ::= 'declare' CollectionProperties 'collection' QName CollectionTypeDecl? ('with' NodeModifier 'nodes)?

  CollectionProperties ::= ('const' | 'mutable' | 'append-only' | 'queue' | 'ordered' | 'unordered')*

  CollectionTypeDeclaration ::= 'as' KindTest OccurenceIndicator?

  NodeModifier ::= ('read-only' | 'mutable')

Collections are defined by collection declaration statements, which specify a unique name for a collection as a QName, a set of collection properties, the collection's static type, and whether the collection nodes can be modified or not. Syntactically, collection declarations are placed inside module prologs. The Prolog syntax is extended accordingly, as shown above. An additional constraint (not expressible syntactically) is that only library modules may contain collection declarations [err:XDST0003]. This is because library modules can be shared among queries, whereas if a collection was declared inside a main module, then every other query that would like to use this collection would have to redeclared it in its main module. Worse, allowing collection declarations in "user" queries can lead to "data leaks": a collection declared and created by a user query and not destroyed by the same query will be unknown to the rest of the application, and may stay in the database indefinitely. In contrast, library modules containing XQDDF declarations are expected to be under the jurisdiction of a system administrator who makes sure that queries see the data that they must see, and no data inconsistencies or leaks can arise.

To accommodate collection declarations, XQDDF extends the static context with a component called the statically known collections. This is a map whose entries associate an expanded QName with an implementation-dependent representation of the information contained in a collection declaration with the same QName. The effect of a collection declaration is to add an entry to the statically known collections of the module containing the declaration. If the expanded QName of the collection is equal (as defined by the eq operator) to the expanded QName of another collection in the statically known collections of the same module, a static error is raised [err:XDST0001]. Like variables and functions, the statically known collections of a module that is imported by another module are copied into the statically known collections of the importing module. It is a static error [err:XDST0002] if the expanded QName of a collection declared in an imported module is equal (as defined by the eq operator) to the expanded QName of a collection declared in the importing module or in another imported module (even if the declarations are consistent).

Sausalito defines two collection properties: update mode (with possible values 'const', 'mutable', 'append-only', or 'queue') and ordering mode (with possible values 'ordered' or 'unordered'). The syntax allows the values for these properties to be listed in any order or not be specified at all. If not specified, the default values for update and ordering mode are 'mutable' and 'unordered', respectively. It is a static error [err:XDST0004] if a collection declaration contains more than one value for the same property. An ordered collection is a collection into which the ordering of documents is assumed to be meaningful for the application, and as a result, programmers can explicitly control the placement of documents via appropriate updating functions. In contrast, the ordering of documents inside unordered collections is implementation dependent, but stable (see 1.4 Accessing Collections for details). A constant collection is one that is created with an initial set of documents and does not allow any subsequent insertions to or deletions from this initial set. An append-only collection does not allow any deletions at all and restricts insertions to take place at the "end" only, i.e., all new documents must be inserted after all existing ones. This implies a user-visible document ordering, and as a result, an append-only collection must also be declared as ordered [err:XDST0005]. A queue collection forbids both insertions and deletions in/from the "middle"; only documents at the front of the collection may be deleted, and new documents can be inserted only at the collection's end. Like append-only, queue collections must be declared as ordered [err:XDST0005].

In addition to the two properties described above, a collection declaration also specifies the collection static type, i.e., the static type for the result of the xqddf:collection function. This is specified as a sequence type that adheres to the syntax and semantics of a KindTest plus an (optional) occurrence indicator. If no static type is specified, it is assumed to be document-node(element(*, xs:untyped))*. The static type without the occurrence indicator is the static type of the collection's root nodes.

The final component of a collection declaration is the document update mode. Possible values for this property are 'read-only' or 'mutable', with 'mutable' being the default value if none is specified. If the document update mode for a collection is 'read-only', then an error is raised [err:XDDY0010] every time a node of the collection appears as the target node of an updating expression; otherwise, no such error is raised.

1.3 Creating Collections

As explained already, collections are just sets of parent-less XML trees (called "documents" in XQDDF terminology). In terms of the XQuery language, these sets "live" in the dynamic context. In particular, XQDDF extends the dynamic context with a component called the available collections. This is a map whose entries associate the expanded QName of a collection with the collection's document set. If an entry for a collection appears in the available collections of a module, the collection is said to be available to that module.

In practice, the available collections component is implemented by Sausalito's storage system. To begin with, each document set is implemented by some appropriate data structure that acts as a document container. The description of potential data structures is beyond the scope of this document, but the choice will, in general, depend on the properties of the collection and the contained documents. In addition to managing the document containers, the store maintains a collections table, which maps collection names to document containers. The collections table is accessible by all queries, so once an entry is added to the table, the associated collection is assumed to be available to every query and every module that participates in the execution of that query.

Creation of a collection involves creating an initially empty document container and "registering" that container in the collections table. XQDDF provides two functions for creating collections. Both are updating functions, so instead of actually performing the updates, they generate pending update primitives that become part of a pending update list (PUL) to be applied at a later time (see 4 Extensions to the XQUF updates). The functions and their associated update primitives are described below:

  declare updating function xqddf:create_collection($collectionName as xs:QName)

  upd:createCollection($collectionName as xs:QName).

The function is evaluated as follows:

The update primitive is applied as follows:

The second create-collection function creates the collection and populates it with an initial set of trees.

  declare updating function xqddf:create_collection($collectionName as xs:QName, $nodes as node()*)

The function is evaluated as follows:

The upd:createCollection primitive was described above. The upd:insertNodesFirst will be described in 1.5 Updating Collections, in the context of the xqddf:insert-nodes-first function.

1.4 Accessing Collections

To access the root nodes of a collection, XQDDF provides the xqddf:collection function.

  declare function xqddf:collection($collectionName as xs:QName) as node()*

The function is evaluated as follows:

Another non-updating XQDDF function that accesses a collection implicitly, is the index-of function:

  declare function xqddf:index_of($collectionName as xs:QName, $node as node()) as xs:integer

The function is evaluated as follows:

1.5 Updating Collections

A collection update is an operation that either inserts or deletes a number of root nodes (and their subtrees) to/from a collection. XQDDF provides 5 functions that insert root nodes, and another 5 functions that delete root nodes. All of these functions are //updating functions// (in the terminology of the XQUF). As a result, rather than applying the update immediately, they produce an updating primitive that becomes part of a pending updates list (PUL), which is applied atomically when the next ";" appears in a script. The signature and semantics of each function and its associated update primitive are described in this section. The order in which the various update primitives are applied and constraints in how update primitives may be combined in a PUL are described in 4 Extensions to the XQUF updates.

  declare updating function xqddf:insert-nodes($collectionName as xs:QName, $nodes as node()*)

  upd:insertIntoCollection($collectionName as xs:QName, $nodes as node()*)

The insert-nodes function is evaluated as follows:

The update primitive is applied as follows:

  declare updating function xqddf:insert-nodes-first($collectionName as xs:QName, $nodes as node()*)

  upd:insertFirstIntoCollection($collectionName as xs:QName, $nodes as node()*)

The insert-nodes-first function is evaluated as follows:

The update primitive is applied as follows:

  declare updating function xqddf:insert-nodes-last($collectionName as xs:QName, $nodes as node()*)

  upd:insertLastIntoCollection($collectionName as xs:QName, $nodes as node()*)

The insert-nodes-last function is evaluated the same way as the insert-nodes-first function except:

The update primitive is applied as follows:

  declare updating function xqddf:insert-nodes-before($collectionName as xs:QName, $target as node(), $nodes as node()*)

  upd:insertBeforeIntoCollection($collectionName as xs:QName, $target as node(), $nodes as node()*)

The insert-nodes-before function is evaluated as follows:

The update primitive is applied as follows:

  declare updating function xqddf:insert-nodes-after($collectionName as xs:QName, $target as node(), $nodes as node()*)

  upd:insertAfterIntoCollection($collectionName as xs:QName, $target as node(), $nodes as node()*)

The insert-nodes-after function is evaluated the same way as the insert-nodes-before function except:

The update primitive is applied as follows:

  declare updating function xqddf:delete-nodes($collectionName as xs:QName, $nodes as xs:node()*)

  upd:deleteFromCollection($collectionName as xs:QName, $nodes as xs:node()*)

The delete-nodes function is evaluated as follows:

The update primitive is applied as follows:

  declare updating function xqddf:delete-nodes-first($collectionName as xs:QName, $number as xs:unsignedLong)

The delete-nodes-first function is evaluated as follows:

  declare updating function xqddf:delete-node-first($collectionName as xs:QName)

The delete-node-first function is a special case of the delete-nodes-first function. Specifically, delete-node-first($collectionName) is equivalent to delete-nodes-first($collectionName, 1).

  declare updating function xqddf:delete-nodes-last($collectionName as xs:QName, $number as xs:unsignedLong)

The delete-nodes-last function is evaluated as follows:

  declare updating function xqddf:delete-node-last($collectionName as xs:QName)

The delete-node-last function is a special case of the delete-nodes-lasst function. Specifically, delete-node-last($collectionName) is equivalent to delete-nodes-last($collectionName, 1).

1.6 Destroying Collections

To destroy a collection, XQDDF provides the delete-collection updating function. The function itself and its associated update primitive are described below.

  declare updating function xqddf:delete-collection($collectionName as xs:QName)

  upd:deleteCollection($collectionName as xs :QName)

The delete-collection function is evaluated as follows:

The update primitive is applied as follows:

1.7 W3C Collections

2 Indexes in Sausalito

Sausalito introduces support for value indexes. A value index is a set whose contents (called index entries) are defined by a "domain" expression and a number of "key" expressions. Informally, an index is created by evaluating its domain expression first, resulting in a sequence of nodes (called the index domain sequence). Then, for each node D in the domain sequence, the key expressions are evaluated with node D serving as their context node. Each key expression must not return more than one value. If a value returned by a key expression is not atomic, it is converted to an atomic value via atomization. Thus, for each domain node D, an associated key tuple K of N atomic values is constructed, where N is the number of key expressions. The purpose of the index is to map key tuples to domain nodes. In general, the relationship from key tuples to domain nodes is 1:N (i.e., a key tuple may be produced from more than one domain nodes). As a result, each index entry is a pair consisting of a key tuple and the set of domain nodes that produced the key tuple.

Sausalito supports five index-related operations: index declaration, index creation, index deletion, index probing and index maintenance. These are explained briefly in the following simple example. Full details for each operation are provided in the subsequent chapters.

2.1 Indexes in action - A simple example

Let us assume the same news application we used in 1.1 Collections in action - A simple example. In this section we will how to create and use indexes on the collections of the news organization. First, let us assume that each employee has a city where he/she is currently stationed at. We want to create an index that maps city names to the employees that are stationed in those cities. The index will contain one entry for each city where at least one employee is stationed in. Let us also assume that we want to search for journalists based on the number of articles they have written. For this, we will create an index that maps article counts to the employees who are journalists and have produced that number of articles.

Before an index can be created, it must be declared. An index declaration describes the index by providing its domain expression, its key expressions, and certain index properties; it also specifies a name for referencing the index in subsequent operations. Like collections, indexes must be declared inside the prolog of library modules. In terms of the XQuery language, index declarations become part of a module's static context.

In this example, the index declaration is placed inside the "news-data" library module shown below (same as the module we saw in 1.1 Collections in action - A simple example, except for the additional index declarations). The first index declaration assigns the name news-data:CityEmp to the index. It uses the "on nodes" and "by" keywords to specify the domain and key expressions respectively. The "as" keyword specifies a target atomic data type which the results of the key expression must match with (after atomization). The index is declared as a "value equality" index. This means that it can be used to find the employees in a particular city, but not in a "range" of cities. In other words, the index is not aware of any ordering among city names. Finally, the maintenance property of the index is set to "automatically maintained". Briefly, an automatically maintained index is one whose maintenance is the responsibility of Sausalito rather than the xquery programmers. The second index declaration assigns the name news-data:ArtCountEmp to the index. Its domain expression selects all employees who are journalists. Its key expression computes the number of articles written by the "current" journalist. This index is declared as a "value range" index, which means that it can be used to find journalists whose article count is within a given range. Finally, the index is also declared as "manually maintained", which means that programmers must explicitly request that the index be synchronized with the underlying data.

  (: The "news-data" Library Module :)

  module namespace news-data = "http://www.news.org/data";

  import schema namespace news-schemas = "http://www.news.org/schemas";

  declare collection news-data:employees as schema-element(news-schema:employee)*;

  declare collection news-data:articles as node()*;

  declare const ordered collection news-data:months as schema-element(news-schema:month)* with read-only nodes;

  declare automatically maintained value equality index news-data:CityEmp
  on nodes xqddf:collection(xs:QName("news-data:employees"))/employee
  by .//station/city as xs:string;

  declare manually maintained value range index news-data:ArtCountEmp
  on nodes xqddf:collection(xs:QName("news-data:employees"))/employee[./position/@kind eq "journalist"]
  by count(for $art in xqddf:collection(xs:QName("news-data:articles"))//article
           where $art/empid eq ./id
           return $art) as xs:integer;

  declare variable $news-data:employees := xs:QName("news-data:employees");
  declare variable $news-data:articles := xs:QName("news-data:articles");
  declare variable $news-data:months := xs:QName("news-data:months");
  declare variable $news-data:CityEmp := xs:QName("news-data:CityEmp");
  declare variable $news-data:ArtCountEmp := xs:QName("news-data:ArtCountEmp");

Having declared the indexes in a library module, they can now be created. This is done by the "admin-script-3" script shown below. The script must first import the "news-data" module. As far as indexes are concerned, the effect of this import is to create two entries in the static context of the main module, mapping the index names to the index definitions (domain expression, key specification, and properties). Then, the query creates the indexes by invoking the xqddf:create-index function, passing the name of the index as input.

Let us consider the creation of the CityEmp index (the process is the same for the ArtCountEmp index). Index creation starts with retrieving the index definition from the static context, using the index name. Then, an index container is created, whose entries will be pairs associating a city name with a set of employees. Next, the index container is populated using the process outlined above: The domain expression is evaluated, and for each employee node E in the domain sequence, the name of the city C where the employee is currently stationed in is retrieved by evaluating the key expression, atomizing its result, and checking that the atomic value matches the specified target type. Finally, the pair [E, C] is inserted in the index: if an entry for C exists already, E is inserted in the set associated with C; otherwise, an new entry is created mapping C to the set { E }. The last step in index creation involves registering the index inside an indexes table that maps index names to index containers. The index container will remain registered until it is destroyed by a call to the xqddf:delete-index function (see the "admin-script-4" script below).

  (: The "admin-script-3" script :)

  import module namespace xqddf = "http://www.zorba-xquery.com/modules/xqddf";

  import module namespace news-data = "http://www.news.org/data" at "news_data.xqlib";

  xqddf:create-index($news-data:CityEmp);

  xqddf:create-index($news-data:ArtCountEmp);

The next step in this example is to show how the index can be used to to optimize query performance, which of course, is the primary motivation for supporting indexes in any data-processing system. XQDDF provides two functions for index probing: xqddf:probe-index-point and xqddf:probe-index-range. The xqddf:probe-index-range function is available only for indexes declared as "value range".

The "probe-1" query illustrates the use of xqddf:probe-index-point. The query returns the names of all employees stationed in Paris. As shown, the xqddf:probe-index-point function takes the index name and the keyword "Paris" as inputs. It uses the index name to find the index container via the indexes tables, looks-up the entry for "Paris" inside this container, and returns all the associated employee nodes.

The "probe-2" query illustrates index probing via the xqddf:probe-index-range function. The query returns all journalists who have written more than 100 articles. As shown, the first parameter of the xqddf:probe-index-range function is the index name, followed by 6 parameters per key expression. The 6 parameters specify a range of value for the key values: the first 2 are the lower and upper values of the range, the next two are booleans that specify whether the range does indeed have a lower and/or upper bound, and the last 2 are also booleans that specify whether the range is open or closed from below or above (i.e., whether the lower/upper bound are included in the range or not).

The "no-probe-1" and "no-probe-2" queries return the same results as the "probe-1" and "probe-2" queries, respectively, but without using any index. Normally, the performance of the probe queries will be much better than that of the corresponding no-probe queries. This is because, in general, indexes organize their entries in ways that make the execution of the probe functions very efficient. Typically, some kind of a hash table (for value equality indexes) or ordered tree (for value range indexes) data structure is employed, and as we will see, Sausalito support both kinds of indexes. So, for example, the "probe-1" query does not have to access every entry in the index until it finds the one for Paris, whereas the "no-probe-1" query has to access every employee in the collection and check his/her city.

People familiar with SQL and modern relational DBMSs would probably expect the query optimizer to be able to automatically rewrite queries like "no-probe-1" and "no-probe-2" to queries like "probe-1" and "probe-2". Unfortunately, in Sausalito the query optimizer is not yet smart enough with respect to indexes. Although we do plan to offer automatic index-related rewrites in the near future, we also expect the probing functions to remain useful for manual rewrites because both the XQuery language and the kind of indexes that are allowed in Sausalito can be much more complex than their relational counterparts.

  (: The "probe-1" query :)

  import module namespace xqddf = "http://www.zorba-xquery.com/modules/xqddf";

  import module namespace news-data = "http://www.news.org/data" at "news_data.xqlib";

  xqddf:probe-index-point($news-data:CityEmp, "Paris")

  (: The "probe-2" query :)

  import module namespace xqddf = "http://www.zorba-xquery.com/modules/xqddf";

  import module namespace news-data = "http://www.news.org/data" at "news_data.xqlib";

  xqddf:probe-index-range($news-data:ArtCountEmp, 100, (), true, false, true, false)

  (: The "no-probe-1" query :)

  import module namespace xqddf = "http://www.zorba-xquery.com/modules/xqddf";

  import module namespace news-data = "http://www.news.org/data" at "news_data.xqlib";

  xqddf:collection($news-data:employees)/employee[.//station/city eq "Paris"]

  (: The "no-probe-2" query :)

  import module namespace xqddf = "http://www.zorba-xquery.com/modules/xqddf";

  import module namespace news-data = "http://www.news.org/data" at "news_data.xqlib";

  for $emp in xqddf:collection($news-data:employees)/employee[./position/@kind eq "journalist"]
  where 100 <= count(for $art in xqddf:collection(xs:QName("news-data:articles"))//article
                     where $art/empid eq $emp/id
                     return $art)
  return $emp

Now, let us consider what happens when the data on which an index is built gets updated. In general, index maintenance is the operation where the index contents are updated so that they reflect the index definition with respect to the current snapshot of the data. Sausalito offers two maintenance modes: manual and automatic. If an index is declared as "manually maintained", index maintenance is done only when the function xqddf:refresh-index (described in 2.6 Index maintenance) is invoked inside a query. Essentially, in manual mode maintenance is in the control of the query programmers, and the index may become stale between two consecutive calls to the xqddf:refresh-index function. In contrast, if an index is declared as "automatically maintained", Sausalito guarantees that the index stays up-to-date at any given time.

In this example, the CityEmp index was declared as automatic. The "index-maintenance" query shown below transfers the employee with id "007" from his current city, say Paris, to Beijing. Since index CityEmp is automatic, after the update is applied, Sausalito will initiate a maintenance operation on the index, whereby the employee node will be removed from the node set associated with Paris and inserted into the node set associated with Beijing (if there is no other employee stationed in Beijing already, an entry for it will be created first). Notice that although the index is not explicitly referenced anywhere in this query, its definition must still be available to the query because it is needed to perform the index maintenance. In this example, the query imports the "news-data" module because it contains the declaration for the employees collection, which is referenced by the query. But the "news-data" module contains the index declaration as well, so index maintenance can find the index definition. In general, it is a best practice to declare an index in the same module as the collections that are referenced by the index.

The ArtCountEmp index is more complex than the CityEmp index, so the system may not be able to maintain it in an efficient way. Furthermore, the index contains "statistical" information, so it may be acceptable if its contents are not always in sync with the underlying data. For these reasons, the ArtCountEmp index was declared as "manually maintained".

  (: The "index-maintenance" query :)

  import module namespace xqddf = "http://www.zorba-xquery.com/modules/xqddf";

  import module namespace news-data = "http://www.news.org/data" at "news_data.xqlib";

  replace node value xqddf:collection($news-data:employees)/employee[@id eq "007"]//station/city
  with "Beijing"

Finally, we conclude this example with a query that shows how to destroy an index. As shown in "admin-script-4" below, index deletion is done via the xqddf:delete-index function. The function simply destroys the index container and removes the mapping between the index name and the index container from the indexes table. After the index is deleted, any query that tries to access the index will receive an error.

  (: The "admin-script-4" query :)

  import module namespace xqddf = "http://www.zorba-xquery.com/modules/xqddf";

  import module namespace news-data = "http://www.news.org/data" at "news_data.xqlib";

  ddf:delete-index($news-data:CityEmp);

2.2 Index declaration

  Prolog ::= ((DefaultNamespaceDecl | Setter | NamespaceDecl | Import) Separator)*
             ((VarDecl | ContextItemDecl | FunctionDecl | OptionDecl | CollectionDecl | IndexDecl | ICDecl) Separator)*

  IndexDecl ::= 'declare' IndexProperties 'index' IndexName
               'on' 'nodes' IndexDomainExpr
               'by' IndexKeySpec (',' IndexKeySpec)*

  IndexName ::= QName

  IndexProperties ::= ('unique' |
                      'non' 'unique' |
                      'value' 'range' |
                      'value' 'equality' |
                      'automatically' 'maintained' |
                      'manually' 'maintained')*

  IndexDomainExpr ::= PathExpr

  IndexKeySpec ::= IndexKeyExpr IndexKeyTypeDecl IndexKeyCollation

  IndexKeyExpr ::= PathExpr

  IndexKeyTypeDecl ::= 'as' AtomicType

  AtomicType ::= QName

  IndexKeyCollation ::= ('collation' UriLiteral)?

Syntactically, each index is defined by an index declaration statement, which specifies a unique name for the index as a QName, the index domain expression, a number of key specifications, and a set of index properties. Index declarations must be placed inside module prologs. The Prolog syntax is extended accordingly, as shown above. An additional constraint (not expressible syntactically) is that only library modules may contain index declarations [err:XDST0023]. The reasons for this rule are the same as those for collections (see 1.2 Collection Declaration).

To accommodate index declarations, XQDDF extends the static context with a component called the statically known indexes. This is a map whose entries associate an expanded QName with an implementation-dependent representation of the information contained in an index declaration with the same QName. Each index declaration adds an entry to the statically known indexes of the module containing the declaration. If the expanded QName of the index is equal (as defined by the eq operator) to the expanded QName of another index in the statically known indexes of the same module, a static error is raised [err:XDST0021]. Like the statically known collections, the statically known indexes of a module that is imported by another module are copied into the statically known indexes of the importing module. It is a static error [err:XDST0022] if the expanded QName of an index declared in an imported module is equal (as defined by the eq operator) to the expanded QName of an index declared in the importing module or in another imported module (even if the declarations are consistent).

Sausalito defines three index properties: uniqueness (with possible values 'unique' or 'non unique'), usage (with possible values 'value range' or 'value equality'), and maintenance mode (with possible values 'manually maintained' or 'automatically maintained'). The syntax allows the values for these properties to be listed in any order or not be specified at all. If not specified, the default values for uniqueness, usage, and maintenance mode are 'non unique', 'value equality', and 'automatically maintained', respectively. It is a static error [err:XDST0024] if more than one value is listed in an index declaration for any of these properties.

The uniqueness property determines the kind of relationship between keys (or key tuples, more generaly) and domain nodes: if the relationship is one-to-one, the index may be declared as unique; otherwise (one-to-N relationship), the index must be declared as non-unique. As we will see, a unique index actually enforces the one-to-one relationship between keys and domain nodes.

The usage property specifies the kind of query expressions that may be optimized by using the index. A value equality index can optimize expressions involving value equality predicates only. The "probe-1" and "no-probe-1" queries in 2.1 Indexes in action - A simple example are an example of such usage. As shown there, a value equality index supports the xqddf:probe-index-point function. A value range index can optimize expressions involving any kind of value comparison. The "probe-2" and "no-probe-2" queries in 2.1 Indexes in action - A simple example are an example of such usage. A value range index supports both the xqddf:probe-index-point and the xqddf:probe-index-range functions.

The maintenance mode specifies how index maintenance is done. Sausalito offers two maintenance modes: manual and automatic. For a manual index, maintenance is done only when the function xqddf:refresh-index (described in 2.6 Index maintenance) is invoked inside a query. Essentially, in manual mode maintenance is in the control of the query programmers, and the index may become stale between two consecutive calls to the xqddf:refresh-index function. In contrast, for an automatic index, Sausalito guarantees that the index stays up-to-date at any given time.

The index declaration syntax is very liberal with respect to the expressions that can appear as domain or key expressions. However, the following semantic restrictions are imposed by Sausalito on the domain expression and each of the key expressions:

Furthermore, the domain expression must satisfy the following additional semantic restrictions:

With each key expression, an index declaration associates a key type and a key collation. The triplet IndexKeyExpr, IndexKeyTypeDecl, IndexKeyCollation is called a keyspec. The IndexKeyTypeDecl provides an atomic type that the result of the associated key expression (for each domain node) must match with according to the rules of sequence type matching. This atomic type must not be xs:untypedAtomic or xs:anyAtomicType [err:XDST0027]. Furthermore, if the index is a value range one, an ordering must exist among the values in the type domain [err:XDST0027] (this rules excludes the following atomic types and their subtypes: QName, NOTATION, hexBinary, hex64Binary, gYearMonth, gYear, gMonthDay, gMonth, and gDay). If the key type in a keyspec is xs:string (or subtype of), the IndexKeyCollation specifies the collation to use when comparing key values from this keyspec. If no collation is specified, the default collation from the static context of the declaring module is used.

2.3 Index creation

As explained already, indexes are just sets of index entries, mapping key tuples to domain nodes. In terms of the XQuery language, these sets "live" in the dynamic context. In particular, XQDDF extends the dynamic context with a component called the available indexes. This is a map whose entries associate the expanded QName of an index with the entry set of the index. If an entry for an index appears in the available indexes of a module, the index is said to be available to that module.

In practice, the available indexes component is implemented by Sausalito's storage system. To begin with, each index entry set is implemented by some appropriate data structure that acts as an index entry container. The description of potential data structures is beyond the scope of this document, but the typical choices are either some sort of hash table (for equality indexes) or some kind of ordered tree (for range indexes). In addition to managing the index entry containers, the store maintains an indexes table, which maps index names to index entry containers. The indexes table is accessible by all queries, so once an entry is added to the table, the associated index is assumed to be available to every query and every module that participates in the execution of that query.

Creation of an index involves creating an initially empty index entry container, populating that container with the entries computed by the domain and key expressions of the index, and "registering" that container in the indexes table. All this is done by the xqddf:create-index function that is described below. In fact, xqddf:create-index is an updating function, so instead of actually creating the index, it generates a pending update primitive that becomes part of a pending update list (PUL) to be applied at a later time (see Extensions to the XQUF updates routines). The update primitive is also described below.

  declare updating function xqddf:create-index($indexName as xs:QName)

  upd:createIndex($indexName as xs:QName).

The create-index function is evaluated as follows:

The update primitive is applied as follows:

2.4 Index deletion

To destroy an index, XQDDF provides the delete-index updating function. The function itself and its associated update primitive are described below.

  declare updating function xqddf:delete-index($indexName as xs :QName)

  upd:deleteIndex($indexName as xs:QName)

The delete-index function is evaluated as follows:

The update primitive is applied as follows:

2.5 Index probing

Probing an index means retrieving the domain nodes associated with a particular search condition, which is either a single key tuple, or a "range" of key tuples. Probing can be done via the xqddf functions xqddf:probe-index-point or probe-index-range. Both functions accept a variable number of arguments. The first argument is always a QName identifying an index. The rest of the arguments specify the search condition. For both functions, the index must exist in both the statically known indexes and the available indexes of the invoking module; otherwise error XDDY0021 or XDDY0023 is raised, respectively.

  xqddf:probe-index-point($indexUri as xs:QName,
                          $key1     as xs:anyAtomic,
                          ...,
                          $keyM     as xs:anyAtomic) as node()*

For the probe-index-point function, the search condition is specified as a number of atomic items comprising a key tuple. This number must be equal to the number of indexspecs found in the index declaration [err:XDDY0025]. If the index contains an entry with the given key tuple, the associated domain nodes are returned. Otherwise, the empty sequence is returned.

  probe-index-range($indexUri                 as xs:QName,
                    $rangeLowerBound1         as xs:anyAtomic?,
                    $rangeUpperBound1         as xs:anyAtomic?,
                    $rangeHaveLowerBound1     as xs:boolean,
                    $rangeHaveupperBound1     as xs:boolean,
                    $rangeLowerBoundIncluded1 as xs:boolean,
                    $rangeupperBoundIncluded1 as xs:boolean,
                    ....,
                    $rangeLowerBoundM         as xs:anyAtomic?,
                    $rangeUpperBoundM         as xs:anyAtomic?,
                    $rangeHaveLowerBoundM     as xs:boolean,
                    $rangeHaveupperBoundM     as xs:boolean,
                    $rangeLowerBoundIncludedM as xs:boolean,
                    $rangeupperBoundIncludedM as xs:boolean) as node()*

The probe-index-range function can be invoked on value range indices only [err:XDDY0026]. To describe the semantics of this function, we start by defining the ith key column of an index as the set of key items produced by evaluating the ith keyspec of the index for every domain node. Then, the search condition of a range probe can be defined as a number of rangespecs, where a rangespec describes a constraint on the values of a key column. The first rangespec applies to the first key column, the second rangespec to the second key column, etc. The number of rangespecs must be less or equal to the number of keyspecs found in the declaration of the given index [err:XDDY0025]. Each rangespec consists of 6 values:

If the number of rangespecs is less than the number of key columns, then the missing rangespecs are assumed to have the following value: [(), (), false, false, false, false].

A key tuple K = [k1, ..., kM] satisfies a range search condition if for each i = 1, 2, ..., M the following is true:

effectiveLowerBoundi lOp ki uOp effectiveUpperBoundi

where lOp (uOp) is either the le operator if rangeLowerBoundIncludedi (rangeUpperBoundIncludedi) is true, or the lt operator if rangeLowerBoundIncludedi (rangeUpperBoundIncludedi) is false. The index-probe-range function finds all key tuples in the index that satisfy the given search condition, creates the union of the domain nodes associated with each such key tuple, puts the resulting sequence of nodes in document order, and finally returns this sequence to its caller.

2.6 Index maintenance

An index is said to be up-to-date if its contents reflect the index definition on the current data snapshot, i.e., the contents are the same as those that would be produced if the xqddf:create-index function was invoked on the same index and with the same underlying data. An index is said to be stale if it is not up-to-date. Indexes become stale when documents in collections are updated or when documents are inserted/removed in/from collections. Index maintenance is the operation by which stale index contents are updated so that the index becomes up-to-date. Sausalito offers two maintenance modes: manual and automatic.

If an index is declared as "automatically maintained", Sausalito guarantees that every time a PUL is applied, the index is made up-to-date before the upd:apply-updates function returns. Ideally, all indexes should be automatically maintained, but in general, index maintenance can be a very exprensive operation perormance-wise. As a result, Sausalito will reject a declaration for an automatic index if it determines that it cannot maintain the index in an "efficient" way. The definition of efficiency with respect to index maintenance is implementation dependent, but in general, it means that the index can be maintained in some incremental way that is faster than simply re-creating the whole index from scratch. However, even incremental maintenance can have a high cost, which may make the manual mode described below the preferred choice.

If an index is declared as "manually maintained", it is the responsibility of the programmers to keep the index up-to-date. This can be done using the xqddf:refresh-index updating function described below. Since Sausalito does not take any maintenance action during PUL applications, manually maintenained indexes may become stale in between calls to the xqddf:refresh-index function. Obviously, the manual mode must be used if an index cannot be maintained automatically. However, even for automatically maintainable indexes, the manual mode may be preferable if users can tolerate a stale index in return for better performance during updates.

  declare updating function xqddf:refresh-index($indexName as xs:QName)

  upd:refreshIndex($indexName as xs:QName)

The refresh-index function is evaluated as follows:

The update primitive is applied as follows: