XesamSearchUpdates

Proposed Updates to the Xesam Spec

Legends

2007-09-22

Proposed changes to the xesam spec, 2007-09-22, by MikkelKamstrupErlandsen

1. Exceptions in Search API

The current draft for the search api does not specify what under what circumstances dbus errors should be returned. Here follows my proposals:

2. Richer Format for Extension Property via Quirks

The current format for the vendor.extensions property is as where each string in the array denotes an extension as specified in the query language. However some extensions - like the regExp selector is loosely specified atm. It doesn't specify what regex syntax to use.

Skipping the details about different regexp syntaxes it has become clear to me that it is hard to force everyone to use the same syntax. Posix Extended and Perl-like seems the most dominating. To remedy this I think we should allow the extensions specified in vendor.extensions to specify a quirk. So if some engine support the fuzzy and regExp extensions it would return

  ["fuzzy", "regExp"]

To hint that it uses Perl regExps it can add the quirk perl to the regExp extension string, like so

  ["fuzzy", "regExp:perl"]

So in short, any extension can have added a quirk (and only one) by prepending :quirck_name to the extension name. The vendor is not required to specify a quirk, quirks are optional.

Note that this really only affects clients of xesam servers.

3. regExp Quirks

Add the quirks perl and extended to the regExp extension to specify what type of regexps are used. If no quirk is appended to the regExp is just of an undefined syntax.

4. Tweaks to Type Extension

Original proposal: Currently the type query extension is on the form (there is an error in the written spec, the xml schema is the true reference)

  <type name="xesam:Source" value="xesam:File"/>

I propose to change it to the following more descriptive syntax

  <type category="xesam:Source" class="xesam:File"/>

5. Rename storedAs Back to source

I propose to rename storedAs back to source. See item 3 from 2007-08-02 update proposals.

6. GetHits and GetHitData Return Type

Following up on item 4 from the 2007-08-02 update proposals. I propose to simply define the following default values for unset fields

The reasons for using this is that I find it unlikely that it is mission critical to clients whether fields are set or not. The search API is not the interface for a semantic metadata storage - it is a search api, and one of the prime qualities is ease of use for end user applications. When we define the metadata api in xesam iteration 2, the question of unset values is of a lot higher importance.

7. Ontology Format

Very short version: Use both a .ini and RDF/XML version. We will provide a small tool written in python called xesam-ontology-install which accepts both .ini and RDF/XML as input, and installs both versions (the other compiled from the input).

While it does indeed suck to have two formats installed it is not necessarily bad installing third party ontologies via a tool. This has the benefit of being able to validate the ontology before it is actually installed.

8. Punt Xesam UI API

The spec for the xesam ui methods have received very little feedback, and is not mission critical in any way, I suggest we punt it for a later iteration (if we even want it).

9. User Search Lang: Split Type Selector

The XesamUserSearchLanguage defines a special type selector. This has an unclear mapping to the XesamQuery Language. It is proposed to replace it by two selectors called content and source with the obvious mappings to the query language.

10. User Search Lang: Keywords and Aliases

The keyword definition of the user search language has yet to be defined. Keywords understood as the first part of a Select statement in the search language definition.

It is proposed to let the keywords consist of the fields defined in the upcoming xesam ontology without the xesam namespace. Ie to search in the xesam:title field do

title:algorithms

Formally xesam will be the default namespace for the keywords. Fields from third party ontologies are also accessed this way; by stripping the namespace. So if a third party food ontology has a food:taste, you can search for a specific taste with:

taste:salty

Using the field names as the only keywords might prove to be too advanced for ordinary users. To this end we introduce keyword aliases. A keyword alias is just another name for the keyword. The following list probably needs more items

11. Xesam Versioning Convention

Use MajorMajor.MinorMinor versioning scheme. We shouldn't need micro versions. The session property vendor.xesam is a uint mapped to this in the obvious way. Fx 1.0 becomes 100. 1.1 will be 110. <!> The RC1 will be versioned 0.9. Ie the vendor.xesam property will be 90.



History

2007-08-02

Proposed changes to the xesam spec, 2007-08-02, by MikkelKamstrupErlandsen

1. One Field Per Selector

In the XML Query language, disallow using multiple fields in the baseSelectorType. Just allow exactly one field. Currently the two field elements in the query below means match-in-any-one-of-these-fields (meaning OR).

<equals>
  <field name="xesam:author"/>
  <field name="xesam:Title"/>
  <string>Andersen</string>
</equals>

To ease parsing it is proposed to ban this behavior and enforce exactly one field element per selector:

<or>
  <equals>
    <field name="xesam:author"/>
    <string>Andersen</string>
  </equals>
  <equals>
    <field name="xesam:Title"/>
    <string>Andersen</string>
  </equals>
</or>

2. Type Attribute in Query Language

The semantics of the "type" attribute on the query element in the XesamQueryLanguage is unclear. It is proposed to either

  1. Remove the type attribute entirely since it can be represented by ANDing with a comparison on the xesam:category field.

  2. Split it out into two attributes category and source.

The idea behind it is that some backends can optimize their queries against particular data types (source/cat). Maybe some data types arer stored in another index or so, and it convenient to know this before the actual query parsing begins.

Another benefit is readability of the query. If a particular category (or source) is specified in an attribute on the query element it is more easy to discern what the rest of the query does. Look at the examples

<query category="xesam:Audio>
  <contains>
    <field name="xesam:artist"/>
    <string>hendrix</string>
  </contains>
</query>

without the category attribute this becomes

<query>
  <and>
    <contains>
      <field name="xesam:artist"/>
      <string>hendrix</string>
    </contains>
    <equals>
      <field name="xesam:category"/>
      <string>xesam:Audio</string>
    </equals>
  </and>
</query>

3. Ontology: Rename "Source" to "DataType"

According to XesamOntologyDraft, indexed items have two core metadata fields that define them, "source" and "category". The ontology tree is split in three main branches, Source, Category, and Fields. The meaning of the source metadata field has become this-item-is-stored-as-a. In this view the term Source seems misleading. it is proposed to use DataType instead. Another option is to use storedAs. Examples of datatypes would be xesam:archiveItem, xesam:File, and xesam:EmailAttachMent.

MikkelKamstrupErlandsen: Discussion on IRC has led to another proposal: Rename "source" to "storage" and "category" to "content" and don't use *type postfixes (fx "DataType").

4. Problems With Hit Data

There are a few problems with the return type aav in GetHits and GetHitData

To remedy these problems it is proposed to use strings instead of variants, ie aas instead of aav. This makes the returned data easier to present and solves the first issue partially. For some fields the empty string could actually be the value, but this is not a big problem. At least it is smaller than the one we have now.

5. Session Properties search.blocking and search.live

As discussed on the list search.blocking and search.live has been a great cause of confusion. It is proposed to scrap search.blocking entirely and make calls to the methods GetHits, GetHitData and CountHits, always block until the requested data is ready. DBus can make the client side async anyway.

6. Introduce search.readonce

Introduce a new session property search.readonce, which if True indicates to the search engine that the client will not (be guarantee) request refreshed- or additional metadata on retrieved hits. Hits will be retrieved once with GetHits, and once fetched the search engine is free to drop any reference to it. The default value is True.

This can be useful for lighter queries that are spawned rapidly. Fx like Gnomes deskbar-applet (interactive search field on the desktop).

7. Rename CountHits to GetHitCount

CountHits is out of terminology with the rest of the api. Rename it to GetHitCount.

8. search.maxhits

Introduce a read-only session property search.maxhits which determines how many hits can maximally be retrieved for an arbitrary query, This fixes a common scalability issue for queries with large result sets.

9. Split out vendor.fieldnames

In the light of the current ontology it does not make much sense to have the only ontology introspection property be vendor.fieldnames It is proposed to replace it with three properties

Each returning an as with the fields/cats/sources supported by the search engine. Ie not all from the ontology, but all that it actually support.

10. Use uint Instead of Int Where Applicable

The session properties hit.snippet.length, vendor.version, and vendor.xesam can all safely use unsigned intergers. It is proposed to use u instead of i in the dbus signatures.

Furthermore the methods GetHitData, GetHits and CountHits use intergers where u could be used instead (it is just array offsets). It is proposed to replace these as well. Likewise in the signals Hits{Added, Modified,Removed}.

Older Proposals

1. Ontology Introspection and Installation

Add a session property vendor.ontologies that has value type aas - an array of ontology definitions which are triplets (unique_name,version,path. So fx hooking up to a shared online search service and calling GetProperty(session, "vendor.ontologies") which might return Yahoo and Google ontologies

[
        ["yahoo", "1.0", "/usr/share/ontologies/yahoo-1.0"],
        ["google", "1.0", "/usr/share/ontologies/google-1.0"]
]

The values of the ontology-triples (unique_name, version, path) deserve description:

Ontologies are installed in a directory under {XDG_USER_DATA_DIR,XDG_SYSTEM_DATA_DIR}/ontologies named <unique_name>-<version>. There should be some kind of metadata for the ontology itself such as a vendor name (the unique name as in the dir-name), ontology version, full vendor name (free form string), ontology description, ontology license. Whether this is stored in a separate file or embedded in the ontology itself (could be done in RDF/XML for example) is another matter to be decided later.

Perspectives : This allows alternative search engines (on different dbus paths than the main xesam desktop search service) to expose custom ontologies.

Why we need it : We have to have some way of namesping the actual files installed by an ontology to prevent different packages to overwrite each others'. Also we need a way to point a client to where they might find the actual ontologies.

Alternative : An ontology could be specified to be contained in a single file. This way we can have all ontologies installed in the same directory.

2. No Way to Tell When Search is Done

Currently there is no way to tell when the search is "done". This make some sort of sense for live queries too - when they are "done" it means that they are finished searching the index and is now listening for changes. I propose to add a new signal:

Alternative:: If people object to my proposal we could specify that emitting HitsAdded(search_handle,0) mean that the search has run through the entire index.

3. User Search Language - No Parallel to 'w' Modifier

Currently the w phrase modifier in the XesamUserSearchLanguage has no parallel in the XesamQueryLanguage.

I propose that we add a new attribute on the string element called wordBreak. This will defined as a extendedStringAttribute in the xml schema for the query language. This way it is optional for xesam implementations to support this.

4. User Search Language - Not Possible to Translate to XML Query Language

Consider the task of converting a query from the XesamUserSearchLanguage to the XesamQueryLanguage. Fx the search "hello world"c (case sensitive search for the words hello and world) it would translate to (skipping <request> and <query> elements)

<fullText>
        <string caseSensitive="true" phrase="false">hello world</string>
</fullText>

Now consider the case where we use a r modifier instead, fx in the search "hello|world"r. This would translate to

<regExp>
        <string>hello|world</string>
</regExp>

- but this is not legal according to the schema. You must specify a field to query, fx the following would be legal:

<regExp>
        <field name="xesam:Content.title"/>
        <string>hello|world</string>
</regExp>

The problem is that there is currently no way for a parser user->xml to know what fields to query when we are not using a fullText selector (where no fields are needed).

I propose to allow a choice between one <fullTextFields/> tag or a list of <field/> tags in the proximity and regExp selectors. If you use the fullTextFields tag it implies (suprisingly) that you want to search in the same scope as a fullText select. The user search string "hello|world"r would now translate to

<regExp>
        <fullTextFields/>
        <string>hello|world</string>
</regExp>

My reason for prefering this over other solutions is that this way we do not meddle with the other selectors which work fine in this case.

Note: I've looked closely at this and it seems that regexp and prox are really the only selectors where a fullTextFields element would make sense.

Alternatives : Should my proposal not be well received, here are some alternatives:

5. Extended Selection Types not Allowed in Query Schema

In the current form of the query schema extendedSelectionTypes are not allowed in the selectionTypes - this makes regExp and proximity selectors not allowed by the schema. I propose the following patch which adds a new type baseSelectionTypes with the current contents of selectionTypes and change the selectionTypes to be either a baseSelectionType or a extendedSelectionType

=== modified file 'xesam-query.xsd'
--- xesam-query.xsd     2007-06-11 21:34:36 +0000
+++ xesam-query.xsd     2007-06-17 11:17:56 +0000
@@ -140,9 +140,21 @@
 
 <!--
 Elements listed here are called "select statements". A Xesam compliant
-search engines must suport these selectors.
+search engines must suport these selectors. A selection type is thus
+either a baseSelectionType or an extendedSelectionType
 -->
 <xs:group name="selectionTypes">
+       <xs:choice>
+               <xs:group ref="baseSelectionTypes"/>
+               <xs:group ref="extendedSelectionTypes"/>
+       </xs:choice>
+</xs:group>
+
+<!--
+These select statements must be supported by all xesam compliant search
+engines.
+-->
+<xs:group name="baseSelectionTypes">
   <xs:choice>
     <xs:element name="equals" type="selectBaseType"/>
     <xs:element name="contains" type="selectBaseType"/>

6. Race Condition Inherent in the Search API

There is a race condition build into XesamSearchLive. When you call NewSearch the search engine might start firing HitsAdded signals before you connect to that signal - in fact the search engine might even start firing these signals before your call to NewSearch returns!

Proposed solution: Add a new method StartSearch(in s search_handle) to the search interface.

Note: Technically we can work around this race condition by exploiting that dbus signals are bus-wide, meaning that you can pick them up even though you don't know the sending object. This is a hack though and I prefer that we don't require hacks for proper usage of the API.

7. RegExp Not Well Defined

The behavior of the regexp selector is not clearly defined. Should we match each field exactly against the expression or do we look for sub-strings in the fields matching the expression?

I propose to define that we match on any substring and not just the whole field.

Outstanding Issues

XML Namespace

We need a target namespace for the query schema. It would be nice to have a project home on fdo before we settle this.

Ontology Representation Language

Good olde .ini vs RDF discussion. In my opinion we are down to .ini vs RDF/XML. Turtle is ruled out because it requires massive work to integrate well in the Gnome stack. KDE camp say they should have no problem with Turtle though.