Sources

Tabulator comes with a set of "Source" classes, which make it easy to pull data from a variety of places. These include flat-file, JSON file, XML file, JDBC, XCC and JMX. Nothing stops you adding your own.

The "file" based sources actually get their data from a URL, which is often a file: URL. You can configure SSL parameters which are used for https: URLs and basic authentication userid and password which can be used for http: or https: URLs. This makes it practical to consume data from simple web services.

With all these sources, you get the opportunity to supply an enricher, which looks at the attribute values obtained by the source and the attribute groups requested by the user, and creates additional "computed/enriched" attribute values from them.

The enricher is obliged to check whether these new additional values match any match expression that may have been supplied.

Flat file

The flat-file source reads data from a URL and parses it as a CSV or similar file.

It understands CSV (comma-separated-value), TSV (tab-separated-value) and PSV (pipe-separated-value) formats.

It understands that values containing the separator, escapes or quotes must be escaped.

It can be told to ignore lines starting with # (aka comment lines).

It can be told to ignore the first (non-comment) line. Its quite common for the first line of a flat-file to contain the column names, rather than a row of column values.

The only way values can be NULLs is if a line in the file doesn't have as many columns as it should. Its not possible to have NULL values for earlier attributes, but not for later ones. If NULLs are important to you, perhaps the XML file source would be more appropriate.

When you use the flat-file source, it always reads the whole file and filters the results with the match expressions.

The Music Database example sources some of its data from CSV and XML files (and some from JSON files), which is handy as it makes this example entirely standalone.

JSON file

This source reads data from a URL and parses it as a JSON file.

Tabulator allows you to use a subset of JSONPath to identify where the results and attributes within them are to be found. "Bracket" and "dot" notation are supported. Only paths from the root object/element are supported. Indexes can only be numbers or *. In the bracket notation, names can include special characters using JSON encoding conventions (with ' quotes). The recursive descent operator, union, slices, filters and script expressions are not supported.

Ordinarily, you'd expect to read and process a document looking like this (ie: an array of objects), which conveniently happens to match the output of the JSON formatter :-

[
  {
    "attrName1" : attrValue1,
    "attrName2" : attrValue2,
    ... more attributes
  },
  ... more objects of interest
]

In the above example, JSONPath $[*] identifies the results and we might be mapping :-

JSONPath Attribute
$[*]['attrName1'] attrName1
$[*]['attrName2'] attrName2

JSONPath	Attribute
`$[*]['attrName1']`	`attrName1`
`$[*]['attrName2']`	`attrName2`

However, JSON can be much more complicated than this, especially if it is returned from a RESTful web service. eg:

[
  {
    "id" : 1,
    "name" : { "first" : "andy", "last" : "key" }
  },
  ... more objects of interest
]

Here $[*] identifies the results, and we might map :-

JSONPath Attribute
$[*]['id'] id
$[*]['name']['first'] firstname
$[*]['name']['last'] lastname

JSONPath	Attribute
`$[*]['id']`	`id`
`$[*]['name']['first']`	`firstname`
`$[*]['name']['last']`	`lastname`

Sometimes the data we want can be nested in the file :-

{
  "header" : "some header",
  "data" :
  [
    { "id" : 1, "name" : "john" },
    { "id" : 2, "name" : "bill" }
  ]
}

To handle this, we can use a JSONPath of $['data'][*] to identify the results.

Sometimes we can be lacking a primary key and might want to synthesize it :-

[
  { "name" : "john" },
  { "name" : "bill" }
]

Here we might synthesize a new attribute with a different numeric value for each object within one level of array (ie: within a single [). Index attributes can be synthesized at various levels with names of your choice. Index values start from 0.

Sometimes you may need to denormalize the hierarchical structure in a JSON file, and copy context attributes from the higher level of the structure into each instance of an object at a lower level in the structure. eg:

[
  {
    "outer-id" : 1,
    "inner-things" :
    [
      { "inner-id" : 1, "name" : "something"      },
      { "inner-id" : 2, "name" : "something else" }
    ]
  },
  {
    "outer-id" : 2,
    "inner-things" :
    [
      { "inner-id" : 1, "name" : "different again" }
    ]
  }
]

The primary key for an inner-thing is the outer-id combined with the inner-id, so we need both. To return inner-things, you can identify the results with $[*]['inner-things'][*] and then use these mappings :-

JSONPath Attribute
$[*]['outer-id'] outer-id
$[*]['inner-things'][*]['inner-id'] inner-id
$[*]['inner-things'][*]['name'] name

JSONPath	Attribute
`$[*]['outer-id']`	`outer-id`
`$[]['inner-things'][]['inner-id']`	`inner-id`
`$[]['inner-things'][]['name']`	`name`

It doesn't matter that outer-id is at a higher level than the level of each result. The wrinkle here is that the outer-id must appear in the JSON file before the array(s) of inner-things we want to return.

The Complex JSON example shows off many of these features.

The QPid REST example uses the QPid Queue REST and QPid Message REST types, and these make good use of the JSON file source to parse the results of calling the QPid REST API.

Attributes that are not specified are NULL.

Note that JSON has no representation of +Infinity, -Infinity or NaN.

Date, time and datetime related primitive datatypes are expected to be stored in ISO 8601 format (ie: the format understood and displayed by Tabulator) as string data.

XML file

This source reads data from a URL and parses it as XML.

It expects a document looking like this :-

<?xml version="1.0" encoding="UTF-8"?>
<results>
  <typeName>
    <attrName1>attrValue1</attrName1>
    <attrName2>attrValue2</attrName2>
    ... more attributes
  </typeName>
  ... more results
</results>

It doesn't care about the name of the elements shown as results or typeName above.

It is given a mapping from the name of the tags shown as attrNameN to the actual attribute names in Tabulator. If a tag name isn't in this mapping, the attribute name is assumed to be exactly the same as the tag name.

Attributes that are not specified are NULL.

If there were any deeper nested elements, these would be ignored. "Ignorable whitespace" (according to the SAX API) is also ignored.

Namespaces and namespace prefixes are ignored.

This source should be able to read the output of the XML formatter.

JDBC

The JDBC source is designed to make it easy to fetch data from relational databases over JDBC.

You pass it certain fixed fragments of SQL, and depending upon what attribute groups and matches have been requested, it builds the most appropriate SQL and executes it.

If the matches couldn't perfectly be translated into SQL WHERE clauses, it then filters the false positives from results.

Most of the Tabulator match expression language can be converted into SQL WHERE clauses. LIKE "likestring" expressions with [ in the "likestring" can't be converted. In Tabulator, LIKE can be applied to non-String data. The pattern is applied to the data after it has been converted to a String for display purposes. In SQL, LIKE only applies to textual data. PATTERN "regexp" expressions can't be converted.

The algorithm used to translate to WHERE clauses is quite sophisticated. Its first approach is to resolve into IN (list) or NOT IN (list) clauses, although if the list has a single value, it can use = value or <> value. It uses a ValueCollection helper class to try to do this.

If part of a match expression is too complex, it doesn't necessarily give up on the whole expression. It can insert 1=1 (aka TRUE) in place of the problematic part, so as to maximize the use of the database for whats left. Assuming the id column is an Integer.

LIKE "%00" AND LT 5000    ->    WHERE 1=1 AND (id < 5000)

The column names don't need to be the same as the attribute names, as you can supply a mapping from attribute name to column name when calling the source.

This is important, because you may wish to search a single table, or the JOIN of multiple tables, potentially with clashing column names. eg:

// this is what we're going to SELECT FROM
fromTables = "ORDERS o JOIN ITEMS i ON o.id = i.order_id"

// These are the attributes available as a result
attrNamesTables = { "OrderId",
                    "OrderDesc",
                    "ItemId",
                    "ItemOrderId",
                    "ItemDesc" }

// And this is how you convert each attribute to a SQL column reference
columns = { ( "OrderId"     -> "o.id"          ),
            ( "OrderDesc"   -> "o.description" ),
            ( "ItemId"      -> "i.id"          ),
            ( "ItemOrderId" -> "i.order_id"    ),
            ( "ItemDesc"    -> "i.description" ) }

Type mappings (refer here) :-

If the table column is of type The attribute should be of type
CHAR, VARCHAR, LONGVARCHAR String
DECIMAL, NUMERIC Double (Tabulator has no absolute precision primitive types)
BINARY, VARBINARY, LONGVARBINARY String (Tabulator has no byte[] primtive type)
BIT Boolean
TINYINT Short (Tabulator has no Byte primitive type)
SMALLINT Short
INTEGER Integer
BIGINT Long
REAL Float
FLOAT, DOUBLE Double
DATE DateWithoutOffset
TIME TimeWithoutOffset
TIMESTAMP DateTimeWithoutOffset

If the table column is of type	The attribute should be of type
CHAR, VARCHAR, LONGVARCHAR	String
DECIMAL, NUMERIC	Double (Tabulator has no absolute precision primitive types)
BINARY, VARBINARY, LONGVARBINARY	String (Tabulator has no byte[] primtive type)
BIT	Boolean
TINYINT	Short (Tabulator has no Byte primitive type)
SMALLINT	Short
INTEGER	Integer
BIGINT	Long
REAL	Float
FLOAT, DOUBLE	Double
DATE	DateWithoutOffset
TIME	TimeWithoutOffset
TIMESTAMP	DateTimeWithoutOffset

The Orders Database example uses the JDBC source to access a MySQL database.

MarkLogic XCC

MarkLogic is an "Enterprise NoSQL" database. I tend to think of it as an XML database, primarily queried using XQuery.

The XCC source class requires the MarkLogic XML Contentbase Connector for Java (aka XCC).

The XCC source class provides a mechanism for you to submit XQuery to MarkLogic. Alternatively you supply a module URI to invoke.

As the query only reads (and does not update) the database, a performance uplift can be realised by including an option which avoids query execution placing locks on data that it reads, in anticipation of an update :-

declare option xdmp:update "false";

You construct the XQuery based upon the attribute groups and match expressions the user has requested. To avoid security issues, if you are placing literal strings into the query, use the Utils.encodeXQueryString() method.

The XQuery or module is expected provide a sequence of results, each of which looks like this :-

<result>
  <elem1>value1</elem1>
  <elem2>value2</elem2>
  ...
</result>

You supply a mapping from element names to attribute names. If the element name matches the attribute name, it can be omitted from the mapping.

The source then parses the results and apply matches to them. After this, if you have supplied an enricher, it calls that. If everything is still ok, the result becomes a part of the results displayed or output to a file.

If you wish to return fragments of XML, then remember that Tabulator doesn't support XML as a primitive data type, so you'll need to use xdmp:quote($myXML) to convert it to a String. I can't seem to get xdmp:quote() to reliably indent.

Often you'll structure your queries into a searching step followed by a step which assembles and returns selected data from each matching document.

You have various MarkLogic specific searching strategies you can choose from, including :-

cts:search "filtered" using XPath and query
cts:search "unfiltered" using XPath and query
cts:uris using query (implicitly "unfiltered")

If you're picking out specific elements from the matched documents, then unfiltered approaches can be faster (assuming it is safe for you). If you're returning the lions share of the document, then a normal filtered search appears faster.

Bear in mind that XPaths can use > and < even if range indexes aren't defined, but the same thing expressed in query form (eg: cts:element-range-query) requires a range index.

A QueryXCC helper class is provided to help you construct MarkLogic cts: queries from match expressions.

If the user gives these match expressions :-

Attribute name Match expression
variety IN ('koi','goldfish')
cost GE 100 AND LE 500

Attribute name	Match expression
variety	IN ('koi','goldfish')
cost	GE 100 AND LE 500

Assuming all the fish documents are in a "fish" collection, assuming obvious XML element names, and assuming a range index on cost of type xs:int, you might want to construct a query like this :-

let $uris := cts:uris((),(),
  cts:and-query((
    cts:collection-query('fish'),
    cts:element-value-query(xs:QName('variety'),('koi','goldfish')),
    cts:and-query((
      cts:element-range-query(xs:QName('cost'),'>=',xs:int('100'))
      cts:element-range-query(xs:QName('cost'),'<=',xs:int('500'))
      ))
    ))
  )

In your "Type" class, in its search method, you can write code looking like this :-

sb.append("let $uris := cts:uris((),(),\n");
QueryXCC q = QueryXCC.trueIfNullQuery(
  QueryXCC.andQuery(
    QueryXCC.collectionValueQuery("fish"),
    QueryXCC.elementMatchQuery("variety", ptString, matches.get("variety"), true, null),
    QueryXCC.elementMatchQuery("cost", ptInteger, matches.get("cost"), true, "xs:int")
    )
  );
sb.append(q.text);
sb.append("\n)\n");

When documents have repeating elements, and you are providing access to the repeating portion, using the indexes as shown above isn't enough. You'll want to use a where clause to filter the results. If you don't do this, you can return a large number of results to Tabulator, which then filters lots of them out. Best to filter before returning, and then we can be sure to get a certain number of results. A class called WhereXCC provides factory methods for producing where clauses, based upon XPath expressions and match expressions.

The SourceXCC.search method has a parameter to limit the number of results returned, and this is passed into the request using the Request.setCount method, but do not rely on this alone. It does restrict the number of results returned, but MarkLogic seems to do more work than if you explicitly code the XQuery with an explicit constraint on the number of results returned.

The Fishy Database example shows off this SourceXCC source.

JMX

This is designed to scan for MBeans matching an JMX object name pattern and to pull back values of various MBean attributes.

Each bean found is a result (ie: row in the result table). Each attribute of that result comes from a specific MBean attribute.

If the attribute names don't match the JMX MBean attribute names, this is not a problem, as a mapping from attribute name to JMX MBean attribute name can be provided when using the source.

The object name patterns supported by JMX aren't as flexible as the Tabulator match expression language. If we know the match expression used to identify the MBean name evaluates to a finite set of specific values, it can be quicker to explicitly fetch MBean attributes for those beans, rather than fetch all MBeans matching a wildcard and filtering out those not of interest.

Tabulator supplies a ValueCollection class to help with this. You use it to try to convert a match expression into a set of values. If successful, you use the JMX source for each of them. If unsuccessful, you use the JMX source using a wildcard, and filter. Tabulator also supplies a IntervalCollection class to help by converting matches into ranges of legal values.

Due to the nature of the JMX interface, Tabulator has to fetch MBeans and then filter based upon MBean attribute values. In the general case, there is no way to just fetch MBeans having attributes matching particular expressions.

The JMX source doesn't understand invoking methods on MBeans, CompositeData or TabularData. If these are needed, custom code should be written. An example of this is the JMX variant of the QPid message "Type" class.