GNRGNM Install

Download as pdf or txt
Download as pdf or txt
You are on page 1of 150

IBM Watson Content Analytics

Version 3.5

Programming Guide



SC27-6331-00
IBM Watson Content Analytics
Version 3.5

Programming Guide



SC27-6331-00
Note
Before using this information and the product it supports, read the information in “Notices” on page 133.

This edition applies to version 3, release 5, modification 0 of IBM Watson Content Analytics (product number
5724-Z21) and to all subsequent releases and modifications until otherwise indicated in new editions.
© Copyright IBM Corporation 2009, 2014.
US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract
with IBM Corp.
Contents
ibm.com and related resources. . . . . v Creating and deploying a plug-in for
How to send comments . . . . . . . . . . vi post-filtering search results. . . . . . 83
Contacting IBM . . . . . . . . . . . . . vi
Creating and deploying a plug-in for
API overview . . . . . . . . . . . . 3 exporting documents or deep
API documentation . . . . . . . . . . . . 5
inspection results . . . . . . . . . . 85
REST APIs . . . . . . . . . . . . . 7
Creating and deploying a plug-in to
add custom widgets for user
Search and index APIs . . . . . . . . 11
SIAPI implementation restrictions . . . . . . . 12
applications . . . . . . . . . . . . 87
Enterprise search applications . . . . . . . . 13
Controlling query behavior . . . . . . . . 17 Creating and deploying a custom
Creating a faceted enterprise search application 25 global analysis plug-in . . . . . . . . 91
Search and index API federators . . . . . . 33 Jaql scripts for custom global analysis . . . . . 92
Retrieving targeted XML elements . . . . . . 36
Fetching search results. . . . . . . . . . 37 Creating and deploying a custom
analyzer for document ranking filters . 97
Query syntax . . . . . . . . . . . . 39
Query syntax structure . . . . . . . . . . 53
Sample REST API scenarios . . . . . 103
Compiling the sample REST API applications . . 106
Real-time NLP API . . . . . . . . . . 59 Running the sample REST API applications in
Eclipse . . . . . . . . . . . . . . . 107
Application security . . . . . . . . . 61
Document-level security . . . . . . . . . . 61 Sample SIAPI enterprise search and
Identity management for single sign-on security 62 content mining applications . . . . . 109
Creating the user's security context XML string
Compiling the sample enterprise search and
with the identity management API . . . . . 63
content mining applications . . . . . . . . 110
Simple and advanced sample enterprise search
Crawler plug-ins . . . . . . . . . . 65 applications . . . . . . . . . . . . . . 111
Crawler plug-ins for non-web sources . . . . . 66 Browse and navigation sample application. . . . 111
Creating a crawler plug-in for type A data Time scale view sample application . . . . . . 111
sources . . . . . . . . . . . . . . . 67 Retrieve all search results sample application . . . 112
Creating a crawler plug-in for type B data Fetch document content sample application . . . 113
sources . . . . . . . . . . . . . . . 69 Federated search sample application. . . . . . 114
Creating and deploying a plug-in for archive files 71 Federated faceted search sample application . . . 115
Extending the archive plug-in to view extracted Faceted search sample application . . . . . . 115
files . . . . . . . . . . . . . . . . 74 Content mining sample applications . . . . . . 116
Web crawler plug-ins . . . . . . . . . . . 75
Creating a prefetch plug-in for the web crawler 76
Deploying a prefetch plug-in . . . . . . . 78
Creating a postparse plug-in for the web crawler 79

© Copyright IBM Corp. 2009, 2014 iii


Sample real-time NLP application . . . 119 Trademarks . . . . . . . . . . . . . . 135
Privacy policy considerations . . . . . . . . 136
Sample plug-in application for
non-web crawlers . . . . . . . . . 121 Index . . . . . . . . . . . . . . . 137

Sample plug-in for post-filtering


search results . . . . . . . . . . . 123

Sample plug-ins for custom document


export . . . . . . . . . . . . . . 125

Sample plug-ins for custom widgets


in user applications . . . . . . . . 127

Sample plug-ins for custom global


analysis. . . . . . . . . . . . . . 129

Sample custom analyzer for


document ranking filters . . . . . . 131

Notices . . . . . . . . . . . . . . 133
Additional notices . . . . . . . . . . . . 135

iv IBM Watson Content Analytics: Programming Guide


ibm.com and related resources
Product support and documentation are available from ibm.com®.

Support and assistance

Product support is available on the Web.


IBM® Watson Content Analytics
http://www.ibm.com/support/entry/portal/Overview/Software/
Enterprise_Content_Management/Content_Analytics
IBM Cognos® Business Intelligence
http://www.ibm.com/support/entry/portal/Overview/Software/
Cognos/Cognos_Business_Intelligence
IBM Content Classification
http://www.ibm.com/support/entry/portal/Overview/Software/
Enterprise_Content_Management/Classification_Module
IBM Content Integrator
http://www.ibm.com/support/entry/portal/Overview/Software/
Enterprise_Content_Management/Content_Integrator
IBM InfoSphere® BigInsights
http://www.ibm.com/support/entry/portal/Overview/Software/
Information_Management/InfoSphere_BigInsights

IBM Knowledge Center

You can view the product documentation with a web browser in the IBM
Knowledge Center. Content in the IBM Knowledge Center might be more current
than the PDF publications.

PDF publications

You can view the PDF files online by using the Adobe Acrobat Reader for your
operating system. If you do not have the Adobe Reader installed, you can
download it from the Adobe Web site at http://www.adobe.com.

See the following PDF publication Web sites:

Product Web site address


IBM Watson Content Analytics http://www.ibm.com/support/
Version 3.5 docview.wss?uid=swg27037837
IBM Cognos Business Intelligence http://www.ibm.com/support/
Version 10.1 docview.wss?uid=swg27018884
IBM Content Classification Version http://www.ibm.com/support/
8.8 docview.wss?uid=swg27020843
IBM Content Integrator Version 8.6 https://www.ibm.com/support/
docview.wss?uid=swg27016951

© Copyright IBM Corp. 2009, 2014 v


Product Web site address
IBM InfoSphere BigInsights Version PDFs are not available. For product information, see
1.3 http://www-01.ibm.com/support/knowledgecenter/
SSPT3X_1.3.0/
com.ibm.swg.im.infosphere.biginsights.welcome.doc/
doc/welcome.html?cp=SSPT3X

How to send comments


Your feedback is important in helping to provide the most accurate and highest
quality information.

Send your comments by using the online reader comment form at


https://www14.software.ibm.com/webapp/iwm/web/signup.do?lang=en_US
&source=swg-rcf.

Contacting IBM
To contact IBM customer service in the United States or Canada, call
1-800-IBM-SERV (1-800-426-7378).

To learn about available service options, call one of the following numbers:
v In the United States: 1-888-426-4343
v In Canada: 1-800-465-9600

For more information about how to contact IBM, see the Contact IBM Web site at
http://www.ibm.com/contact/us/.

vi IBM Watson Content Analytics: Programming Guide


Developing applications
Several application programming interfaces are available for customizing the IBM
Watson Content Analytics system.

You can:
v Develop custom applications for searching collections and exploring the results
of text analysis. You can use the application programming interfaces to create
new applications and use the applications that are provided with Watson
Content Analytics as a model for your own applications.

Important: If you customize a provided application, you must rename it to


ensure that your changes are not overwritten when you install a fix pack or
upgrade to a new version of Watson Content Analytics.
v Develop applications for administering the system.
v Create Java™ plug-ins to influence how documents are crawled, accessed, and
exported. Different types of plug-ins enable you to add searchable metadata and
content to documents in a collection, specify rules for controlling access to
documents in the search results, and control how documents are exported for
use in other applications.
v Use the real-time natural language processing (NLP) API to perform ad-hoc text
analytics on document.

© Copyright IBM Corp. 2009, 2014 1


2 IBM Watson Content Analytics: Programming Guide
API overview
Watson Content Analytics provides several sets of application programming
interfaces (APIs) so that you can create search and administration applications,
modify crawled documents, filter search results, export documents, set up an
identity management component to enforce document-level security, and perform
ad-hoc text analysis on documents. .

For information about how to use the Watson Content Analytics APIs, see the
examples in the ES_INSTALL_ROOT/samples directory.

REST APIs

Use the REST APIs to create search, content mining, and administration
applications. The search REST API is available on Watson Content Analytics search
servers and is deployed on the search application port, which by default is port
8393 if you use the embedded web application server. If you use WebSphere
Application Server, the default port is 9081 or 80 if IBM HTTP Server is configured.
The administrative REST API is available on the master server if you use the
embedded web application server and uses the same port number as the
administrative console, which by default is 8390. If you use WebSphere Application
Server, the administrative REST API is available on the search application port,
which by default is 9081 or 80 if IBM HTTP Server is configured. You can change
these port numbers when you install Watson Content Analytics.

For more information about using the REST APIs, see the API documentation in
the ES_INSTALL_ROOT/docs/api/rest directory. Sample scenarios that demonstrate
how to perform administrative and search tasks are available in the
ES_INSTALL_ROOT/samples/rest directory.

IBM search and index APIs

You can use the search and index application programming interfaces to create
custom enterprise search applications. The Watson Content Analytics
implementation of the search and index API (SIAPI) allows the search server to be
accessed remotely.

Restriction: The SIAPI administration APIs are deprecated and are no longer
supported. The SIAPI search APIs are being deprecated and will not be supported
in future releases. Use the REST APIs instead of the SIAPI APIs to create custom
applications.

You can use applications that are provided with Watson Content Analytics as a
base from which to develop your custom applications.
search This application shows you how to do basic search and retrieval tasks,
such as selecting collections for search, querying those collections,
configuring the display of search results, and narrowing results through
faceted browsing.
analytics
This application shows you how to use content mining capabilities to

© Copyright IBM Corp. 2009, 2014 3


explore different facets of content analytics collections. For example, you
can see how frequencies of facet values change over time and analyze
deviations and trends in the data.

Important: If you customize a provided SIAPI application, you must rename it to


ensure that your changes are not overwritten when you install a fix pack or
upgrade to a new version of Watson Content Analytics.

Plug-in APIs

Plug-in APIs allow you to customize the Watson Content Analytics system in the
following ways:
v Use the crawler plug-ins to modify documents after they are crawled, but before
they are parsed and indexed for search. You can add, change, or delete
information in the document or the document metadata. You can also indicate
that the document is to be ignored (skipped) and not indexed.
v Use the post-filtering plug-in to apply your own security logic for post-filtering
search results.
v Use the export plug-in to apply your own logic for exporting crawled, analyzed,
or searched documents and the output from deep inspection requests.

Identity management component APIs

Access to sensitive information that is contained in multiple repositories is


typically controlled and enforced by the managing software. You identify yourself
to the host system with a user ID and password. After the system authenticates
your user ID and password, the managing software controls which documents you
are allowed to see based on your access rights. Unless a single sign-on policy is
implemented, you must have several different user IDs and passwords for each
repository.

Watson Content Analytics provides an identity management component that


enables users to search multiple repositories with a single query and see only the
documents that they are allowed to see. You can build this component into your
applications so that users can sign on with only one user ID and password when
searching secure collections.

See the Javadoc documentation for details about the APIs that can be used to
create your own identity management component or customize the provided
solution.

Real time natural language processing (NLP) API

Use this API to perform ad-hoc text analytics on documents without adding the
documents to the index. Both SIAPI and REST API versions of the real-time NLP
API are provided. The NLP REST API accepts both text and binary content, but the
SIAPI version only accepts content in text format.

Restriction: The SIAPI version of the real-time NLP API is being deprecated and
will not be supported in future releases. Use the REST API version instead of the
SIAPI version to create custom applications.
Related concepts:
“API documentation” on page 5
Related reference:

4 IBM Watson Content Analytics: Programming Guide


“Sample REST API scenarios” on page 103

API documentation
API documentation is available for the REST APIs, search and index APIs,
plug-ins, and the identity management component.

The API documentation is installed in these default locations:

API documentation Installation directory


REST APIs ES_INSTALL_ROOT/docs/api/rest
Search and index APIs ES_INSTALL_ROOT/docs/api/siapi
Fetch API ES_INSTALL_ROOT/docs/api/fetch
Type A data source crawler ES_INSTALL_ROOT/docs/api/crawler
plug-ins
Type B data source crawler ES_INSTALL_ROOT/docs/api/ilelcrawler
plug-ins (for the Agent for
Windows file systems,
BoardReader, Case Manager,
Exchange Server, FileNet P8,
and SharePoint crawlers)
Web crawler plug-ins ES_INSTALL_ROOT/docs/api/crawler/com/ibm/es/wc/pi
Plug-in for post-filtering search ES_INSTALL_ROOT/docs/api/postfilter
results
Plug-in for exporting ES_INSTALL_ROOT/docs/api/export
documents and deep inspection
results
Identity management APIs ES_INSTALL_ROOT/docs/api/imc

Related concepts:
“Search and index APIs” on page 11
“Crawler plug-ins” on page 65
“Extending the archive plug-in to view extracted files” on page 74
“API overview” on page 3
Related tasks:
“Creating and deploying a plug-in for post-filtering search results” on page 83
“Creating and deploying a plug-in for archive files” on page 71
“Creating and deploying a plug-in for exporting documents or deep inspection
results” on page 85
Related reference:
“Enterprise search applications” on page 13

API overview 5
6 IBM Watson Content Analytics: Programming Guide
REST APIs
The Watson Content Analytics REST application programming interfaces (APIs)
enable you to create applications to search, explore, and administer collections.

REST (REpresentational State Transfer) APIs rely on a stateless, client-server,


cacheable communications protocol. REST applications use HTTP requests to post
data (create and update), read data (such as running queries), and delete data.
REST is a lightweight alternative to mechanisms like RPC (Remote Procedure
Calls) and Web Services (such as SOAP and WSDL). Much like Web Services, a
REST service is:
v Platform-independent
v Language-independent
v Standards-based (runs on top of HTTP)
v Able to be used in the presence of firewalls

The REST APIs provide capabilities that IBM search and index APIs (SIAPI) offer,
such as:
v Managing collections
v Controlling and monitoring components
v Adding documents to a collection
v Searching a collection and federated collections
v Searching and browsing facets

The REST APIs offers the following benefits:


v Language independent and pure remote calls.
v No special client modules are required to call the API.
v Easy to understand and use. Almost all communication between client and
server are in text format, and you can use your web browser to try the API.
v Because any client that supports HTTP can use the REST API, you can build
client applications on various platforms.

REST API categories

The REST API consists of two categories of APIs:


v APIs for search and content mining tasks
v APIs for administration tasks

The search REST API is available on search servers and listens on the search
application port, which by default is port 8393 if you use the embedded web
application server. If you use WebSphere Application Server, the default port is
9081 or 80 if IBM HTTP Server is configured. The administrative REST API is
available on the master server if you use the embedded web application server and
uses the same port number as the administrative console, which by default is 8390.
If you use WebSphere Application Server, the administrative REST API is available
on the search application port, which by default is 9081 or 80 if IBM HTTP Server
is configured. You can change these port numbers when you install Watson
Content Analytics.

© Copyright IBM Corp. 2009, 2014 7


Authentication

When application server login is properly configured, BASIC authentication is


required for the administration and search REST APIs. In addition, the
administration REST API needs to be authenticated with the api_username and
api_password keywords.

HTTP methods

You can use both HTTP GET and HTTP POST methods to call most REST APIs.
For the HTTP GET method, you can directly enter a REST API URL into a web
browser. The POST method is recommended for security reasons.

REST API URLs

The base URI for the REST APIs is:

http://host:port/api/v10/

For the administration REST APIs, create URLs in the following format:

http://Index_server_hostname:Administration_console_port/api/v10/admin/
API_name?method=method_name?parameters

For example, use the following administration REST API URL to return
information about the status of the index:

http://Index_server_hostname:8390/api/v10/admin/indexer?method=monitor
&api_username=user_name&api_password=password&collectionId=collection_ID

For the search REST APIs, create URLs in the following format:

http://Search_server_hostname:Search_server_port/api/v10/API_name?parameters

For example, use the following search REST API URL to return a list of all
available namespaces of facets for the specified collection:

http://Search_server_hostname:8393/api/v10/facets/namespaces?collection=sample

The /about/providers API returns all available search REST APIs. The
/about/providerdetail API returns detailed information about all the available
search REST API. These APIs are especially useful if you develop an application
that uses the REST API and you cannot access a computer on which Watson
Content Analytics is installed to view the REST API reference documentation.

For example, place the following URL in a web browser:

http://Search_server_hostname:8393/api/v10/about/providerdetail?path=/
collections&output=application/xml

Tips:
v To create proper URLs, ensure that they are URL encoded. For example,
output=application/atom+xml should be eencoded as output=application/atom
%2Bxml.

8 IBM Watson Content Analytics: Programming Guide


v Try REST API calls in a browser before using them in a program. Trying a REST
API call in a browser gives you an opportunity to see the output from the REST
API call before you attempt to parse it in your application.\
v In order to arrive at the necessary result, you might need a series of API calls.
v Ensure that search queries conform to the Watson Content Analytics query
syntax. For example, if you want to search for a phrase, the query parameter
phrase value must be in double quotation marks.

For more information about using the REST APIs, see the API documentation in
the ES_INSTALL_ROOT/docs/api/rest directory. Sample scenarios that demonstrate
how to perform administrative and search tasks are available in the
ES_INSTALL_ROOT/samples/rest directory.

Restriction: The following functions are not available in the REST API:
v com.ibm.es.siapi.admin.AdminServiceImpl.associateApplicationWithCollection
v com.ibm.es.siapi.admin.AdminServiceImpl.associateApplicationWithCollection
v com.ibm.es.siapi.admin.AdminServiceImpl.registerApplication
v com.ibm.es.siapi.admin.AdminServiceImpl.unregisterApplication
v
com.ibm.es.siapi.admin.AdminServiceImpl.disassociateApplicationFromCollection
v com.ibm.es.siapi.admin.AdminServiceImpl.performAdminCommand
(changeRankingModel and revisitURLs options)
Related reference:
“Sample REST API scenarios” on page 103

REST APIs 9
10 IBM Watson Content Analytics: Programming Guide
Search and index APIs
The IBM search and index API (SIAPI) is a programming interface that enables you
to search and explore collections.

Restriction: The SIAPI search APIs are being deprecated and will not be supported
in future releases. Use the REST APIs instead of the SIAPI APIs to create custom
applications. For more information about using the REST APIs, see the API
documentation in the ES_INSTALL_ROOT/docs/api/rest directory. Sample scenarios
that demonstrate how to perform administrative and search tasks are available in
the ES_INSTALL_ROOT/samples/rest directory.

SIAPI is a factory based interface that allows for different implementations of the
search engine. By using SIAPI, your custom application can use different search
engines that are provided by IBM without changing your SIAPI application. For
example, if you create a SIAPI application in WebSphere® Portal that uses the
portal search engine, you can use the Watson Content Analytics search engine
without the need to change your enterprise search application.

SIAPI supports the following types of search and content mining tasks:
v Searching collections
v Customizing the information that is returned in the search results
v Searching and browsing facets
v Querying several enterprise search collections as if they were one collection
(search federation)
v Viewing results with URIs that you can click and viewing scoring information
(ranking)
v Searching and retrieving documents from a broad range of data sources, such as
IBM Content Integrator repositories and Lotus Notes® databases
v Performing real-time text analytics on documents without adding the analyzed
documents to the index

The following figure shows the relationships among the SIAPI search APIs.

© Copyright IBM Corp. 2009, 2014 11


Java Interface

SearchFactory Java Interface

Searchable
createApplicationInfo ( )
createQuery ( ) search ( )
getSearchService ( ) count ( )
createLocalFederator ( ) setSpellCorrectionEnabled ( )
isSpellCorrectionEnabled ( )
getSpellCorrections ( )
Obtains
setSynonymExpansionEnabled ( )
isSynonymExpansionEnabled ( )
getSynonymExpansions ( )
Java Interface
getDefaultLanguage ( )
SearchService
getAvailableAttributeValues ( )
getAvailableFields ( )
getAvailableSearchables ( )
setProperty ( )
getSearchable ( )
getProperty ( )
getAvailableFederators ( )
getProperties ( )
getFederator ( )
getCollectionInfo ( )

Obtains

Figure 1. Search APIs

Related concepts:
“API documentation” on page 5

SIAPI implementation restrictions


Not all IBM search and index application programming interface (SIAPI) classes
and methods are supported by Watson Content Analytics.

Deprecated packages

The com.ibm.siapi.index and com.ibm.siapi.admin packages, also known as the


SIAPI Administration APIs, are deprecated and are no longer supported.

The com.ibm.siapi.search and com.ibm.siapi.browse packages, also known as the


SIAPI Search APIs, are still supported, but they are being deprecated in this release
and will not be supported in future releases.

Use the REST APIs instead of the SIAPI APIs to create custom applications. For
more information about using the REST APIs, see the API documentation in the

12 IBM Watson Content Analytics: Programming Guide


ES_INSTALL_ROOT/docs/api/rest directory. Sample scenarios that demonstrate how
to perform administrative and search tasks are available in the
ES_INSTALL_ROOT/samples/rest directory.

Unsupported methods

The following methods of the SIAPI classes are not implemented:


Class:
com.ibm.siapi.search.ResultsIterator
Methods:
next(int)

Class:
com.ibm.siapi.search.RemoteFederator
Methods:
searchStreaming(Query)
searchStreaming(Query, String[])
searchStreaming(Query, String[], String[])

Class:
com.ibm.siapi.search.StreamingResultSet
Methods:
getEstimatedNumberOfResults()
getPredefinedResults()
getProperties()
getProperty(String)
getSearchState()
getSpellCorrections()
getSynonymExpansions()
hasUnconstrainedResults()
isEvaluationTruncated()
addMessage(SiapiMessage)
addMessages(List)
clearMessages()
getMessages()

Class:
com.ibm.siapi.browse.BrowseFactory
Methods:
createApplicationInfo(String, String)
createApplicationInfo(String, String, String)

Enterprise search applications


Enterprise search applications can access collections, issue queries, and process
query results.

See the Javadoc documentation for examples of the search and index APIs.

To create an enterprise search application with the search and index APIs:
1. Instantiate an implementation of a SearchFactory object.
2. Use the SearchFactory object to obtain a SearchService object.
The SearchService object is configured with the connection information that is
necessary to communicate with the search engine. With the SearchService
object, you can access searchable collections. Configure the SearchService
object with the Watson Content Analytics administrator user name and
password, host name, and port. Configuration parameters are set in a
java.util.Properties object. The parameters are then passed to the
getSearchService factory method that generates the SearchService object.

Search and index APIs 13


Tip: Search server authentication is enabled by default. To disable
authentication:
a. Back up and open the $ES_NODE_ROOT/master_config/searchserver/dock/
dock.xml file in a text editor that supports UTF-8.
b. Change the value of the <portSecurity> element to false and save the file.
c. Restart Watson Content Analytics by entering the following commands:
esadmin system stop
esadmin system start
Watson Content Analytics applications support Secure Sockets Layer (SSL)
version 3. However, applications that use SSL must include a reference to an
existing keystore. WebSphere Application Server provides a utility called
iKeyman.exe in the Java Runtime Environment bin directory that can be used
for working with keystores.
With SSL, you can establish a security-enabled website on the Internet or on
your private intranet. A browser that does not support HTTP over SSL cannot
request URLs that use HTTPS.
When you request a search and index API service, such as SearchService
searchService = factory.getSearchService(Properties);, you can use any of
the following properties for a service object. The property names are case
sensitive.
Table 1. Property values for service API objects
Property name Expected value
protocol HTTP or HTTPS for SSL. The default is HTTP. If the protocol is
HTTPS, the host name must be fully qualified and the port
must be the SSL port. The default port is 443.
trustStore The fully qualified path to the keystore. If the operating system
is Windows, the backslashes must be escaped with a double
backslash, for example, c:\\temp\\WASWebContainer.jks.
Restriction: If the protocol is HTTPS, the trustStore value
must not be empty. An exception is thrown if the trustStore
property is null.
trustPassword The password to access the keystore.
Restriction: If the protocol is HTTPS, then the trustPassword
value must not be empty. An exception is thrown if the
trustPassword property is null.
proxyHost The host name of the proxy server.
proxyPort The port number for the proxy server.
proxyUser If the proxy server requires HTTP basic authentication, this is
the user name for that login request.
proxyPassword The password for the user that is specified by the proxyUser
parameter.
timeout How much time can elapse before the request to the search
server (ESSearchServer) times out.

3. Obtain a Searchable object.


After you obtain a SearchService object, you can use it to obtain one or more
Searchable objects. Each search and index API searchable object is associated
with one collection. You can also use the SearchService object to obtain a
federator object. A federator object is a special kind of Searchable object that
enables you to submit a single query across multiple Searchable objects
(collections) at the same time.

14 IBM Watson Content Analytics: Programming Guide


When you request a Searchable object, you need to identify your application
by using an application ID. Contact your Watson Content Analytics
administrator for the appropriate application ID.
4. Issue queries.
The enterprise search application passes search queries to the search runtime on
the search server.
After the Searchable object is obtained, you issue a query to that Searchable
object. To issue a query to the Searchable object:
a. Create a Query object.
b. Customize the Query object.
c. Submit the Query object to the Searchable object.
d. Get the query results, which are specified in a ResultSet object.
5. Process query results.
Process queries with the ResultSet interface object and the Result interface
object. The search and index APIs have a variety of methods for interacting
with the ResultSet interface and individual Result interface objects.

The search and index APIs are a factory-based Java API. All of the objects that are
used in the enterprise search application are created by calling search and index
API object-factory methods or are returned by calling methods of factory-generated
objects. You can easily switch between search and index API implementations by
loading different factories.

The search and index API implementation in Watson Content Analytics is provided
by the com.ibm.es.api.search.RemoteSearchFactory class.

Use the following search and index API packages to create an enterprise search
application:
com.ibm.siapi
Root package
com.ibm.es.api.browse
Contains taxonomy browsing interfaces
com.ibm.siapi.common
Common SIAPI interfaces
com.ibm.siapi.search
Interfaces for searching collections
com.ibm.siapi.search.facets
Interfaces for faceted search

Obtaining a SearchFactory object

To create a search and index API enterprise search application, obtain the
implementation of the SearchFactory object as in the following example:
Class cls = Class.forName("com.ibm.es.api.search.RemoteSearchFactory");
SearchFactory factory = (SearchFactory) cls.newInstance();

Obtaining a SearchService object

Use the SearchFactory object to obtain a SearchService object. With the


SearchService object, you can access searchable collections.

Search and index APIs 15


Configure the SearchService object with the host name, port, and, if WebSphere
global security is enabled, a valid WebSphere user name and password for the
search server.

Configuration parameters are set in a java.util.Properties object. The parameters


are then passed to the getSearchService factory method that generates the
SearchService object. The following example shows how to obtain a SearchService
object:
Properties configuration = new Properties();
configuration.setProperty("hostname", "es.mycompany.com");
configuration.setProperty("port", "80");
configuration.setProperty("username", "websphereUser");
configuration.setProperty("password", "webspherePassword");
SearchService searchService =
factory.getSearchService(configuration);

Obtaining a Searchable object

Use the SearchService object to obtain a Searchable object. A Searchable object is


associated with a searchable collection. With a Searchable object, you can issue
queries and get information about the associated collection. Each collection has an
ID.

When you request a Searchable object, you need to identify your application by
using an application ID. Contact your Watson Content Analytics administrator for
the appropriate application ID.

The following example shows how to obtain a Searchable object:


ApplicationInfo appInfo = factory.createApplicationInfo
("my_application_id","my_password");
Searchable searchable =
searchService.getSearchable(appInfo, "some_collection_id");

Call the getAvailableSearchables method to obtain all of the Searchable objects


that are available for your application.
Searchable[] searchables =
searchService.getAvailableSearchables(appInfo);

Issuing queries

After the Searchable object is obtained, you issue a query to that Searchable
object. To issue a query to the Searchable object:
1. Create a Query object.
2. Customize the Query object.
3. Submit the Query object to the Searchable object.
4. Get the query results, which are specified in a ResultSet object.

The following example shows how to issue a query:


String queryString = "big apple";
Query query = factory.createQuery(queryString);
query.setRequestedResultRange(0, 10);
ResultSet resultSet = searchable.search(query);

16 IBM Watson Content Analytics: Programming Guide


Processing query results

With the ResultSet interface and Result interface, you can process query results,
as in the following example:
Result[] results = resultSet.getResults();
for ( int i = 0 ; i < results.length ; i++ ) {
System.out.println
( "Result " + i + ": " + results[i].getDocumentID()
+ " - " + results[i].getTitle() );
}
Related concepts:
“API documentation” on page 5
Related reference:
“Federated search sample application” on page 114

Controlling query behavior


With the methods and properties that belong to the Query interface, you can
control many aspects of query behavior, including how the query is processed,
how results are returned, and what metadata is returned with each result.

See the Javadoc documentation for more details about each method and property.
Related concepts:
“Query syntax” on page 39
Related reference:
“Query syntax structure” on page 53

Creating secure searches with access control list constraints


You can set the access control list constraints for a query for secure searches by
using the setACLConstraints(java.lang.String aclConstraints) method.

About this task

The setACLConstraints(java.lang.String aclConstraints) method supports the


following XML query string:
@SecurityContext::’<XML query string>’

Setting query properties


You can control query processing by using the setProperty method.

About this task

The setProperty method for query object has the following format:
query.setProperty("String name", "String value");

You can set the following query properties:


v HighlightingMode: Enables query terms to be highlighted in several areas of the
search result details. Values are:
– DefaultHighlighting: This is the default value and is equivalent to
ExtendedHighlighting.
– ExtendedHighlighting: Extends the highlighting of query terms to other areas
of the search result, for example, title, URL, and other fields.
– SummaryHighlighting: Highlights query terms in the document summary
only.

Search and index APIs 17


v FuzzyNGramSearch: Fuzzy search enables a non-strict search in n-gram
collections to be performed. This property is Boolean and its values are:
– false: A strict search will be performed. This is the default if your enterprise
search application does not set the FuzzyNGramSearch property.
– true: Fuzzy search will be performed.
v AllowStopwordRemoval: Determines whether stop words are removed during
query parsing. If this property is not set, the engine removes or does not remove
stop words according to the policy of the search engine. This property is Boolean
and its values are:
– false: Stop words are not removed during query parsing.
– true: Stop words are removed during query parsing.
v NearDuplicateDetection: Specifies whether documents that are nearly identical
are to be suppressed when search results are displayed. The default value is No.
– Yes: Enables documents with similar titles and summaries to be suppressed
when a user views search results. For nearly duplicate document analysis to
be performed, an administrator must ensure that the config.properties file
for the enterprise search application specifies the property
preferences.nearDuplicateDetection=Yes.
– No: Search results are not filtered to suppress documents that have similar
titles and summaries to documents that are already displayed in the search
results.

Enabling fuzzy searches in n-gram collections


A fuzzy search query searches for character sequences that are not only the same
but similar to the query term. All possible n-grams are treated as search terms, and
the query returns documents including specified n-grams. However, it does not
always mean that the documents have character sequences that are similar to the
query term.

In this example, the capital letters ABC and the at sign (@) refer to n-gram
characters. For example, if the query term is ABCDE, a typical n-gram fuzzy search
returns a document that includes character sequence such as
@@AB@@@BC@@@@@@CD@@@@@DE because this document has all the n-grams
that are generated from the specified query. However, for some languages, this
query result is not preferable because it often means completely different meanings
if those n-grams are far apart.

To improve fuzzy search results, you can control the level of ambiguity in the
query by specifying the FuzzyNGramAmbiguity property and optionally the
FuzzyNGramAmbiguityCondition property.

The FuzzyNGramAmbiguity property returns documents with the most (not


necessarily all) n-grams that are more closely related such as @@ABC@DE@@@ by
using an ambiguity calculation that is based on each query term.

FuzzyNGramSearch property

Fuzzy search performs a non-strict search in n-gram collections. This property is


Boolean:
v false: A strict search is performed. This is the default if your enterprise search
application does not set the FuzzyNGramSearch property.
v true: Fuzzy search is performed.

18 IBM Watson Content Analytics: Programming Guide


FuzzyNGramAmbiguity property

This property is activated only if the FuzzyNgramSearch property is set to true.


These properties are configured by the Query.setProperty method.

The ambiguity must be greater than 0.0 and less than or equal to 1.0. If the
ambiguity is set to 1.0, it is equivalent to an exact match. The lower the number
that the ambiguity is set to, the more it allows ambiguity determines whether each
document has similar character sequences to the search term. Thus, the search
query retrieves more documents.

Ambiguity is similar to the ratio of characters appearing in the same position and
the same order to the search query.
v Format: ambiguity
v Ambiguity: float value, 0.0< ambiguity <= 1.0, to specify ambiguity

This property is used to set the ambiguity that is applied to all search terms of the
query except for the terms that are specified by the FuzzyNGramAmbiguityCondition
property. The higher the ambiguity is, the more similar the returned document will
be. In other words, the document includes character sequences closer to the
original search term if the higher ambiguity is specified.

FuzzyNGramAmbiguityCondition property

Optional: Activated only if the FuzzyNgramSearch property is set to true. These


properties are configured by the Query.setProperty method.
v Format: term1=ambiguity1[,term2=ambiguity2]...[,termn=ambiguityn]
v Term: character string, to specify the term which overrides the ambiguity. (Note
that Term does not include operands such as +, -, ~, but it does include field
names such as tablename)
v Ambiguity: float value, 0.0< ambiguity <= 1.0, to specify ambiguity of the Term

This property is an optional property to specify ambiguity to each term. When a


term includes a comma (,), it must be two commas (,,) to escape the delimiter. This
rule does not apply to the equal sign (=) because the last equal sign before each
comma is treated as the term end.

In the following example, Watson Content Analytics searches for all terms except
for tablename: DATA_TBL with ambiguity 0.8, and search for tablename:
DATA_TBL with ambiguity 1.0 (exact):
q.setProperty(“FuzzyNGramAmbiguity”, “0.8”);
q.setProperty(“FuzzyNGramAmbiguityCondition”,
”tablename:DATA_TBL=1.0”);

Specifying query languages


You can use the setQueryLanguage(java.lang.String lang) method to specify a
language other than the collection default language on the search server.

About this task

Use Unicode identifiers for languages to set a specific language. For example, for
English, the query language parameter is en. For Chinese, use zh-CN for simplified
Chinese and zh-TW for traditional Chinese.

Search and index APIs 19


Setting linguistic modes
Use the setLinguisticMode(int mode) method to specify how you want the search
engine to match query terms.

About this task

The setLinguisticMode(int mode) method sets the linguistic mode for a query. You
can set one of the following modes:
LINGUISTIC_MODE_ENGINE_DEFINED
Unmodified terms are matched according to the engine's best-effort policy.
This is the default mode. Base and exact form matching is performed by
default.
LINGUISTIC_MODE_EXACT_MATCH
Unmodified terms are matched as entered without undergoing linguistic
processing. This method allows the search engine to find exact results.
LINGUISTIC_MODE_BASEFORM_MATCH
Unmodified terms are matched by their base form after undergoing
linguistic processing. For example, the query term jumping matches
documents that contain jump, jumped, jumps, and so on.
LINGUISTIC_MODE_EXACT_AND_BASEFORM
Unmodified terms are matched by their base form and their exact form
after undergoing linguistic processing. For example, the query term
jumping matches documents that contain jump, jumped, jumps, and so on.
The difference from the LINGUISTIC_MODE_BASEFORM_MATCH mode
is that although linguistic base form matching relies on the query language
that matches the identified languages of the result documents, the
LINGUISTIC_MODE_EXACT_AND_BASEFORM mode assures that
documents that contain the exact form jumping are returned regardless of
their identified language.

If your enterprise search application supports the ability to search within results,
the linguistic mode that you specify for the application influences the number of
results returned. If the application is configured to use
LINGUISTIC_MODE_ENGINE_DEFINED, then a search within results might
return more documents than the original search. For example, if a user searches for
the term Lien, and then searches within results for the term Custody, the query is
expanded to be the query Lien ^Custody, which can show documents that contain
Lien or Custody.

If this is not the behavior that you want to see in your enterprise search
application, use one of the other linguistic modes. If you do not want users to see
the No preference option when they configure preferences for the enterprise search
application, you can edit the WebContent/options.jsp file to comment out the
HTML text, including the item prompt.selection.mode.engine.

Returning metadata fields


You can use the setReturnedFields(String[] fieldNames) method to control
which metadata fields are returned in the Result object.

About this task

By default, no metadata fields are returned, so you must use this method to return
metadata fields.

20 IBM Watson Content Analytics: Programming Guide


Enabling predefined result attribute values
You can use the setReturnedAttribute(int attributeType, boolean isReturned)
method to enable or disable any of the predefined result attribute values that are
returned with each Result object.

About this task

By default, all of the predefined result attribute values are returned except for the
RETURN_RESULT_FIELDS metadata fields attribute.

The following values are valid for the attributeType object:


v RETURN_RESULT_TITLE: The Result.getTitle object returns null if the
isReturned object is set to false.
v RETURN_RESULT_DESCRIPTION: The Result.getDescription object returns
null if the isReturned object is set to false.
v RETURN_RESULT_FIELDS: The Result.getFields object and the
Result.getFields(String) object return null if the isReturned object is set to
false.
v RETURN_RESULT_CATEGORIES: The Result.getCategories object returns null
if the isReturned object is set to false.
v RETURN_RESULT_TYPE: The Result.getDocumentType object returns null if the
isReturned object is set to false.
v RETURN_RESULT_SOURCE: The Result.getDocumentSource object returns null
if the isReturned object is set to false.
v RETURN_RESULT_LANGUAGE: The Result.getLanguage object returns null if
the isReturned object is set to false.
v RETURN_RESULT_DATE: The Result.getDate object returns null if the
isReturned object is set to false.
v RETURN_RESULT_SCORE: The Result.getScore object returns 0.0 if the
isReturned object is set to false.
v RETURN_RESULT_URI: The Result.getDocumentURI object returns null if the
isReturned object is set to false.

Specifying the range of results


You can use the setRequestedResultRange(int fromResult, int numberOfResult)
method to specify the range of returned results.

About this task

The fromResult value controls which ranked document your result set starts from.
For example, a value of 0 means that you are requesting the first document in the
query results.

The numberOfResults value controls how many results to return in the current page
of results. The numberOfResults value must be smaller than the maximum number
of results that is configured in the administration console minus the fromResult
value.

Setting category details


You can specify the required category detail level for query results by using the
setResultCategoriesDetailLevel(int detailLevel) method.

Search and index APIs 21


About this task

The setResultCategoriesDetailLevel(int detailLevel) method is used if the


categories attribute RETURN_RESULT_CATEGORIES is enabled. The default value is
RESULT_CATEGORIES_ALL.
v RESULT_CATEGORIES_ALL: Each result category is returned with its complete path
(starting at the root path) information.
v RESULT_CATEGORIES_NO_PATH_TO_ROOT: Each result category is returned without
the full path information; that is, ResultCategory.getPathFromRoot() will return
null. Use the setReturnedAttribute(RETURN_RESULT_CATEGORIES, false) attribute
to stop the retrieval of result categories completely.

Enabling site collapse


You can use the setSiteCollapsingEnabled(boolean value) method to specify
whether the top results contain more than two results from the same website or
data source.

About this task

For example, if a particular query returned 100 results from http://www.ibm.com


and site collapsing was enabled, the ResultSet object contains only two of those
results in the top results. The other results from that site appear only after results
from other sites are listed.

To retrieve more results from that same site, use the samegroupas:result URL query
syntax or re-issue the same query with the site http://www.ibm.com added to the
query string. See “Query syntax” on page 39 for more information.

Setting predefined links


You can set predefined links by using the setPredefinedResultsEnabled (boolean
value) method.

You can specify whether query results contain predefined links in addition to the
regular results. Predefined links are enabled by default.

Sorting by relevance, date, numeric fields, or text fields


You can use the setSortKeys method to specify multiple sort keys to help you sort
results.

About this task

Any field that is defined for the collection and declared as "sortable" (for text
fields) or "parametric" (for numeric fields) can be specified as one of the sort keys
that are represented in SortKey objects in the call to the setSortKeys method.
Textual keys are sorted lexicographically according to the specified sort locale and
numeric keys are sorted arithmetically. Create a SortKey object by calling the
SearchFactory#createSortKey(String) method and modifying the sort order and
locale for the object. Then construct an array of SortKey objects and associate the
array to the query by calling the setSortKeys method.

Important:
v Because specifying multiple sort keys requires increased system resources such
as memory, specifying multiple sort keys might affect performance. Optimize

22 IBM Watson Content Analytics: Programming Guide


your application to not use more sort keys than are required. For example, the
sample enterprise search application allows a maximum of three sort keys per
query.
v In previous releases, only a single sort key was supported for a query by using
the Query#setSortKey() method. This method has been deprecated. Use the
Query#setSortKeys() method instead.
v The setSortPoolSize method no longer affects the search results. All results are
now sorted.

The collating sequence (that is, the order of characters in the alphabet to use for
sorting) is by default the sequence that is used by the collection. You can specify a
different sequence by providing a locale name as a second argument to the
setSortKeys method. For example, if you create a sortKey object
SearchFactory#createSortKey("title"), call the method setLocale("de_AT") for the
object, and then call the method Query#setSortKeys(new SortKey[]{sortKey}),
results are sorted by the value of their title field by using the alphabetic order that
is common in German as used in Austria. Use the standard five character locale
format xx_XX. For example, the locale for American English is en_US. The locale
for Japanese is ja_JP.

You can specify multiple SortKey objects. The order of SortKey objects in the array
specifies priority of sort keys. Search results are initially sorted by the first SortKey
object. After the search results are grouped by the first sort key, the results are
sorted by the next sort key.

Sorting order (ascending or descending) is specified by a call to the method


SortKey#setSortOrder. Two constants, SORT_ORDER_DESCENDING and
SORT_ORDER_ASCENDING are defined in com.ibm.siapi.search.BaseQuery and
can be used as arguments to the method. For example, the following method
causes sorting to be done in ascending order.
setSortOrder(com.ibm.siapi.search.BaseQuery.SORT_ORDER_ASCENDING)

The default sort order is descending: The first results to be output are those at the
top of the order. For example, the most relevant results are displayed at the top if
they are sorted by relevance, the most recent results are displayed at the top if they
are sorted by date, and so on.

Results whose sort key value is missing, undefined, or unavailable are sorted to
the end of the results list regardless of their sort order.

Several reserved field names can be used as an argument to the setSortKeys


method to indicate sort by relevance, date, or no sort. These predefined values are
defined in com.ibm.siapi.search.BaseQuery:
SORT_KEY_NONE
Specifies that results are not to be sorted.
SORT_KEY_DATE
Sorts results by date.
SORT_KEY_RELEVANCE
Sorts results by relevance. This is the default value.

You cannot specify both SORT_KEY_RELEVANCE and SORT_KEY_NONE as


arguments. If both sort keys are specified with other SortKey objects, or an invalid
field name is specified, the default SORT_KEY_RELEVANCE sort order is used.

Search and index APIs 23


The following restrictions apply to sorting:
v Unlike other text, sortable fields are used as is with no tokenization,
normalization, or stop word removal.
v Sorting is incompatible with streaming: when the engine is used in streaming
mode, results are output in the order in which they are encountered in the
collection without any sorting.

To indicate that the results from a query should be sorted by a field, use the
following methods:
BaseQuery.setSortKeys(<field_name>) or BaseQuery.setSortKeys
(<field_name><locale>)
BaseQuery.setSortOrder({SORT_ORDER_ASCENDING | SORT_ORDER_DESCENDING})
BaseQuery.setSortPoolSize({<int> | SORT_ALL_RESULTS})

Setting the sort order for results


You can use the setSortKeys(SortKey[] sortKeys) method to specify how search
results are sorted.

About this task

Specify the sort order as SORT_ORDER_ASCENDING or


SORT_ORDER_DESCENDING. If you specify a sortable text field as a sort key, you
can specify the locale for each sort key. Otherwise, the default locale of the
collection is used.

The sort order is ignored if the sort key is SORT_KEY_RELEVANCE or


SORT_KEY_NONE. The sort locale is ignored if the sort key is not a sortable text
field.

Enabling spelling correction


You can specify whether suggested spelling corrections are to be provided with the
search results,

About this task

Use setSpellCorrectionEnabled(boolean enable) method to specify that spelling


corrections for terms in the query are to be included in the search results. Spelling
correction is disabled by default.

Setting synonym expansion


You can set the synonym expansion mode for a query by using the
setSynonymExpansionMode (int mode) method.

About this task

You can use one of the following modes:


v SYNONYM_EXPANSION_OFF: Pass this constant to the
setSynonymExpansionMode method to prevent synonyms from being expanded
even if the query contains the synonym operator.
v SYNONYM_EXPANSION_MANUAL: Pass this constant to the
setSynonymExpansionMode method to expand synonyms only for the query
terms that are affected by the synonym operator.
v SYNONYM_EXPANSION_AUTOMATIC: Pass this constant to the
setSynonymExpansionMode method to do a best effort to expand all applicable
query terms.

24 IBM Watson Content Analytics: Programming Guide


Determining query evaluation times and query timeouts
You can use ResultSet methods to see how much time it takes to evaluate queries
and whether a query timeout occurred.

About this task

The ResultSet.getQueryEvaluationTime method returns the amount of time, in


milliseconds, that it takes to evaluate queries.

The ResultSet.isEvaluationTruncated method can show whether a query timed


out before it was completely processed or show whether any filtering, such as site
collapsing or near duplicate detection, eliminates results.

Creating a faceted enterprise search application


Faceted enterprise search applications can access enterprise search and content
analytics collections, issue queries, and process facets associated with the query
results.

About this task

See the Javadoc documentation for examples of the search and index APIs.

Procedure

To create a faceted enterprise search application with the search and index APIs:
1. Instantiate an implementation of a FacetsFactory object. The FacetsFactory
can then be used to obtain a FacetsService object.
2. Use the FacetsFactory object to obtain a FacetsService object. The
FacetsService object is configured with the connection information that is
necessary to communicate with the search engine. With the FacetsService
object, you can access faceted searchable collections. Configure the
FacetsService object with the host name, port, and, if WebSphere Application
Server global security is enabled, a valid WebSphere user name and password
for the search server. Configuration parameters are set in a
java.util.Properties object. The parameters are then passed to the
getFacetsService factory method that generates the FacetsService object.
3. Obtain a FacetedSearchable object. After you obtain a FacetsService object,
you can use it to obtain one or more FacetedSearchable objects. Each search
and index API searchable object is associated with one enterprise search or
content analytics collection. You can also use the FacetsService object to obtain
a federator object. A federator object is a special kind of FacetedSearchable
object that enables you to submit a single FacetedQuery object across multiple
FacetedSearchable objects (collections) at the same time.
When you request a FacetedSearchable object, you need to identify your
application by using an application ID. Contact your administrator for the
appropriate application ID.
4. Issue queries. The faceted enterprise search application passes search queries to
the search runtime on the search server. After the FacetedSearchable object is
obtained, you issue a query to that FacetedSearchable object. To issue a query
to the FacetedSearchable object:
a. Create a FacetedQuery object.
b. Customize the FacetedQuery object.
c. Submit the FacetedQuery object to the FacetedSearchable object.

Search and index APIs 25


d. Get the query results, which are specified in a FacetedResultSet object.
5. Process query results. Process queries with the FacetedResultSet interface
object, the Result interface object, the FacetSet interface object, the Facet
interface object, and the Cube interface object.
Related concepts:
“Search and index API federators” on page 33
Related reference:
“Faceted search sample application” on page 115
“Content mining sample applications” on page 116
“Federated faceted search sample application” on page 115

Faceted search API


The faceted search APIs are a factory-based Java API. All of the objects that are
used in the faceted enterprise search application are created by calling faceted
search API object-factory methods or are returned by calling methods of
factory-generated objects. You can easily switch between faceted search API
implementations by loading different factories.

The faceted search and index API implementation in Watson Content Analytics is
provided by the com.ibm.es.api.search.facets.RemoteFacetsFactory class.

Obtain a FacetsFactory object

To create a faceted search API enterprise search application, obtain the


implementation of the FacetsFactory object as in the following example:
Class cls = Class.forName("com.ibm.es.api.search.facets.RemoteFacetsFactory");
FacetsFactory factory = (FacetsFactory) cls.newInstance();

Obtain a FacetsService object

Use the FacetsFactory object to obtain a FacetsService object. With the


FacetsService object, you can access faceted searchable collections.

Configure the FacetsService object with the host name, port, and, if WebSphere
Application Server global security is enabled, a valid WebSphere user name and
password for the search server.

Configuration parameters are set in a java.util.Properties object. The parameters


are then passed to the getFacetsService factory method that generates the
FacetsService object. The following example shows how to obtain a FacetsService
object:
Properties configuration = new Properties();
configuration.setProperty("hostname", "es.mycompany.com");
configuration.setProperty("port", "80");
configuration.setProperty("username", "websphereUser");
configuration.setProperty("password", "webspherePassword");
FacetsService facetsService = facetsFactory.getFacetsService(config);

Obtain a FacetedSearchable object

Use the FacetsService object to obtain a FacetedSearchable object. A


FacetedSearchable object is associated with a faceted searchable collection. With a
FacetedSearchable object, you can issue queries and get information about the
associated collection. Each enterprise search and content analytics collection has an
ID.

26 IBM Watson Content Analytics: Programming Guide


When you request a FacetedSearchable object, you need to identify your
application by using an application ID. Contact your administrator for the
appropriate application ID.

The following example shows how to obtain a FacetedSearchable object:


ApplicationInfo appInfo = facetsFactory.createApplicationInfo(
"my_application_id","my_password");
FacetedSearchable searchable = facetsService.getFacetedSearchable(
appInfo, "some_collection_id");

Call the getAvailableFacetedSearchables method to obtain all of the


FacetedSearchable objects that are available for your application.
FacetedSearchable[] searchables = facetsService.getAvailableSearchables(appInfo);

Issue faceted queries

After the FacetedSearchable object is obtained, you issue a query to the


FacetedSearchable object. To issue a query to the FacetedSearchable object:
1. Create a FacetedQuery object.
2. Customize the FacetedQuery object.
3. Submit the FacetedQuery object to the FacetedSearchable object.
4. Get the query results, which are specified in a FacetedResultSet object.
You can retrieve the all of facets, as shown in the following example:
FacetedQuery query = factory.createFacetedQuery(queryString);
// set the various query parameters and options
setupQuery(query, config);
// set the facet result size if needed
query.setProperty("FacetResultSize", "5");
FacetedResultSet resultSet = searchable.search(query);

If you do not want to retrieve the facets, set the empty facet context, as in the
following example:
FacetContext facetContext = facetsFactory.createFacetContext();
query.setFacetContext(facetContext);
FacetedResultSet resultSet = searchable.search(query);

If you want to refine search results for a particular facet, specify faceted query
terms such as /country/Japan in the query string. See “Query syntax” on page 39
for more information.

Process faceted query results

With the FacetedResultSet, Result, FacetSet and Facet interfaces, you can process
query results, as in the following example:
FacetedResultSet resultSet = searchable.search(query);
Result[] results = resultSet.getResults();
if (results != null) {
for (int i = 0; i < results.length; i++) {
System.out.println(
"Result " + i + ": " + results[i].getDocumentID() + " - "
+ results[i].getTitle());
}
}
FacetSet[] facetSets = resultSet.getFacetSets();
if (facetSets != null) {
for (int i = 0; i < facetSets.length; i++) {
Facet selfFacet = facetSets[i].getSelfFacet();

Search and index APIs 27


System.out.println(
"Facet " + i + ": " + selfFacet[i].getCategoryInfo().getLabel());
Facet[] facets = facetSets[i].getFacets();
if (facets != null) {
for (int j = 0; j < facets.length; j++) {
System.out.println(
"\t" + facets[i].getCategoryInfo().getLabel() + "\t"
+ facets[i].getCount());
}
}
}
}

Sample programs

The following sample programs for faceted search are provided in the
ES_INSTALL_ROOT/samples/siapi directory:
v FacetedSearchExample
v DocumentsViewExample
Related concepts:
“Search and index API federators” on page 33
Related reference:
“Faceted search sample application” on page 115
“Content mining sample applications” on page 116
“Federated faceted search sample application” on page 115

Faceted search queries in content analytics collections


Use taxonomy browsers to issue a faceted query to the FacetedSearchable object in
a content analytics collection.

Content analytics collections provide the following types of taxonomy browsers:


v Date taxonomy browser
v Facet value taxonomy browser
v Subcategory taxonomy browser
v Rule-based taxonomy browser
v Range taxonomy browser
v Flag taxonomy browser
The following example shows how you can obtain TaxonomyBrowser objects:
BrowseFactory browseFactory = SiapiBrowseImpl.createBrowseFactory(
"com.ibm.es.api.browse.RemoteBrowseFactory");

Properties configuration = new Properties();


configuration.setProperty("hostname", "es.mycompany.com");
configuration.setProperty("port", "80");
configuration.setProperty("username", "websphereUser");
configuration.setProperty("password", "webspherePassword");
BrowseService browseService = browseFactory.getBrowseService(config);

ApplicationInfo appInfo = browseFactory.createApplicationInfo(


"my_application_id","my_password");
TaxonomyBrowser[] browsers = browseService.getAvailableTaxonomyBrowsers(
appInfo, "some_collection_id");

TaxonomyBrowser dateBrowser = null;


TaxonomyBrowser keywordBrowser = null;
TanonomyBrowser subcategoryBrowser = null;
TanonomyBrowser flagBrowser = null;

28 IBM Watson Content Analytics: Programming Guide


TanonomyBrowser rangeBrowser = null;
TanonomyBrowser rulebasedBrowser = null;
for (TaxonomyBrowser browser : browsers) {
TaxonomyInfo info = browser.getTaxonomyInfo();
if ("date".equales(info.getID())) {
dateBrowser = browser;
} else if ("keyword".equales(info.getID())) {
keywordBrowser = browser;
} else if ("subcategory".equales(info.getID())) {
subcategoryBrowser = browser;
} else if ("flag".equales(info.getID())) {
flagBrowser = browser;
} else if ("range".equales(info.getID())) {
rangeBrowser = browser;
} else if ("rulebased".equales(info.getID())) {
rulebasedBrowser = browser;
}
}

Date taxonomy browser

Use a date taxonomy browser to issue a faceted query to get the date facets on the
Time Series view, Deviations view, and Trends view of a content analytics
collection. The root category of the date taxonomy browser has the following
categories as children:
v Year (id = "$.year")
v Month (id = "$.month")
v Week (id = "$.week")
v Day (id = "$.day")
v Month of Year (id = "$.month_of_year")
v Day of Month (id = "$.day_of_month")
v Day of Week (id = "$.day_of_week")
Use the FacetsFactory object to obtain the QualifiedCategory object to issue a
faceted query to get the date facets, as in the following example:
QualifiedCategory qualifiedCategory = facetsFactory.createQualifiedCategory(
dateBrowser.getTaxonomyInfo().getID(),
dateBrowser.getCategory("$.day").getInfo());
Constraint constraint = facetsFactory.createConstraint();
constraint.set(Constraint.SUBCATEGORY_COUNT_MODE, -1, false, null);
qualifiedCategory.setConstraint(constraint);

Facet value and subcategory taxonomy browsers

Use facet value and subcategory taxonomy browsers to issue a faceted query to get
the facet values and subcategory facets on the following views of a content
analytics collection:
v Facets view
v Deviations view
v Trends view
v Facet Pairs view
The root category of facet value and subcategory taxonomy browsers has the
following system-defined categories and user-defined categories in the
administration console as children:
v Part of Speech (id = "$._word")
v Phrase Constituent (id = "$._phrase")

Search and index APIs 29


v Named entity (id = "$._ne")

Use the FacetsFactory object to obtain the QualifiedCategory object to issue a


faceted query to get the facet values or subcategory facets, as shown in the
following example:
QualifiedCategory qualifiedCategory = facetsFactory.createQualifiedCategory(
keywordBrowser.getTaxonomyInfo().getID(),
keywordBrowser.getCategory("$._word.noun.general").getInfo());
Constraint constraint = facetsFactory.createConstraint();
constraint.set(Constraint.SUBCATEGORY_COUNT_MODE, 100, false, null);
qualifiedCategory.setConstraint(constraint);

Flag, range, and rule-based taxonomy browsers

Use flag, range, and rule-based taxonomy browsers to issue a faceted query to get
the flag, range, and rule-based facets on the following views of a content analytics
collection:
v Facets view
v Deviations view
v Trends view
v Facet Pairs view
These taxonomy browsers are available if you configure document flagging, range
facets and rule-based categories in the administration console.

Issuing a faceted query

Add the TargetFacet object that was obtained from the FacetsFactory object to the
facet context of a faceted query on the Facets view and Time series view of a
content analytics collection, as shown in the following example:
TargetExpressions targetExpressions = facetsFactory.createTargetExpressions();
// if you want to get the correlation value of facet
targetExpressions.addExpression("correlation", "#correlation");
// if you want to get the expected count value of facet
targetExpressions.addExpression("expected_count", "#expected_count");
TargetFacet targetFacet = facetsFactory.createTargetFacet(qualifiedCategory,
targetExpressions);
FacetContext facetContext = facetsFactory.createFacetContext();
facetContext.add(targetFacet);
query.setFacetContext(facetContext);
FacetedResultSet resultSet = searchable.search(query);

On the Deviations view, Trends view, and Facet Pairs view of a content analytics
collection, add the TargetCube object to the facet context of a faceted query to get a
two dimensional facet. To get the correlation value of the cube, you can use the
following expressions:
v #topic_view_correlation" on the Deviations view
v #delta_view_correlation" on the Trends view
v #2dmap_view_correlation" on the Facet Pairs view
The following example shows how to issue a faceted query on facet pairs view:
QualifiedCategory[] dimensions = { verticalQualifiedCategory,
horizontalQualifiedCategory };
Expression expression = facetsFactory.createExpression(
"correlation", "#2dmap_view_correlation");
TargetExpressions targetExpressions = facetsFactory.createTargetExpressions();
targetExpressions.add(expression);
TargetCube targetCube = facetsFactory.createTargetCube(dimensions,

30 IBM Watson Content Analytics: Programming Guide


targetExpressions);
FacetContext facetContext = facetsFactory.createFacetContext();
facetContext.add(targetCube);
query.setFacetContext(facetContext);
FacetedResultSet resultSet = searchable.search(query);

Sample programs

The following sample programs for content analytics collections are provided in
the ES_INSTALL_ROOT/samples/siapi directory:
v FacetsViewExample
v TimeSeriesViewExample
v DeviationsViewExample
v TrendsViewExample
v FacetPairsViewExample
Related reference:
“Faceted search sample application” on page 115
“Content mining sample applications” on page 116
“Federated faceted search sample application” on page 115

Faceted search queries in enterprise search collections


Use taxonomy browsers to issue a faceted query to the FacetedSearchable object in
an enterprise search collection.

Enterprise search collections provide the following types of taxonomy browsers:


v Time scale taxonomy browser
v Facet taxonomy browser
v Flag taxonomy browser
v Range taxonomy browser
v Scopes taxonomy browser
v Rule-based taxonomy browser
The following example shows how you can obtain TaxonomyBrowser objects:
String collectionId = ...; // the collectionId
String taxnomyId = ...; // the taxnomyId

// obtain the specific SIAPI Browse factory implementation


Class cls = Class.forName("com.ibm.es.api.browse.RemoteBrowseFactory");
// new BrowseFactory instance
BrowseFactory factory = (BrowseFactory) cls.newInstance();

// create a valid Application ID that will be used


// by the Search Node to authorize this access to the collection
String applicationName = config.getProperty("applicationName");
ApplicationInfo applicationInfo = factory.createApplicationInfo(applicationName);
// obtain the Browse service implementation
BrowseService browseService = factory.getBrowseService(config);
// get a TaxonomyBrowser for the specified taxonomy id and collection id
TaxonomyBrowser browser = browseService.getTaxonomyBrowser(applicationInfo,
collectionId, taxonomyId);

The following example shows how you can get available browsers from
BrowseService:
TaxonomyBrowser[] browsers = browseService.getAvailableTaxonomyBrowsers
(applicationInfo, collectionId);

Search and index APIs 31


Time scale taxonomy browsers

Use a time scale taxonomy browser to issue a faceted query to get the specified
date scale counts in an enterprise search collection. The following example shows
how to issue a faceted query to get specified date scale counts.
TaxonomyBrowser browser = timescaleBrowser;
// get category id corresponding to the facet path
// /<the date facet name>/<the specified granularity>/
// Avaliable date facet names and granularities can be known by browsing.
Category category = browser.getCategory(getIdFromTaxonomyBrowser(browser,
this.facetPath));

QualifiedCategory qualifiedCategory = facetsFactory.createQualifiedCategory(


browser.getTaxonomyInfo().getID(),
category.getInfo());

Constraint constraint = facetsFactory.createConstraint();


constraint.set(Constraint.SUBCATEGORY_COUNT_MODE, 100, false, null);
qualifiedCategory.setConstraint(constraint);

TargetExpressions targetExpressions = facetsFactory.createTargetExpressions();

// if you want to get the correlation value of facet


targetExpressions.addExpression(facetsFactory.createExpression("correlation",
"#correlation"));
// if you want to get the expected count value of facet
targetExpressions.addExpression(facetsFactory.createExpression("expected_count",
"#expected_count"));
TargetFacet targetFacet = facetsFactory.createTargetFacet(qualifiedCategory,
targetExpressions);

FacetContext facetContext = facetsFactory.createFacetContext();


facetContext.add(targetFacet);
query.setFacetContext(facetContext);

// execute the search by calling the FacetdSearchable’s search method.


// A SIAPI FacetedResultSet object will be returned
FacetedResultSet rset = searchable.search(query);

Facet taxonomy browsers

Use a facet taxonomy browser to issue a faceted query to get the facets and facet
values in an enterprise search collection. The first level child categories of the facet
taxonomy browser returns user defined metadata facets.

Use the FacetsFactory object to obtain the QualifiedCategory object to issue a


faceted query to get facets or facet values, as shown in the following example:
String facetPath = ...; // the facet path

// obtain the specific SIAPI Facets factory implementation


Class facetsCls = Class.forName("com.ibm.es.api.search.facets.RemoteFacetsFactory");
FacetsFactory facetsFactory = (FacetsFactory) facetsCls.newInstance();

// obtain the Facets Service implementation


FacetsService facetsService = facetsFactory.getFacetsService(config);
// obtain a FacetedSearchable object to the specified collection ID
FacetedSearchable searchable = facetsService.getFacetedSearchable(applicationInfo,
collectionId);

// create a new FacetedQuery object using the specified


// query string
FacetedQuery query = facetsFactory.createFacetedQuery(queryString);

// set the target

32 IBM Watson Content Analytics: Programming Guide


Category category = browser.getCategory(facetPath);

QualifiedCategory qualifiedCategory = facetsFactory.createQualifiedCategory(


browser.getTaxonomyInfo().getID(),
category.getInfo());

Constraint constraint = facetsFactory.createConstraint();


constraint.set(Constraint.SUBCATEGORY_COUNT_MODE, 100, false, null);
qualifiedCategory.setConstraint(constraint);

TargetExpressions targetExpressions = facetsFactory.createTargetExpressions();

// if you want to get the correlation value of facet


targetExpressions.addExpression(facetsFactory.createExpression("correlation",
"#correlation"));
// if you want to get the expected count value of facet
targetExpressions.addExpression(facetsFactory.createExpression("expected_count",
"#expected_count"));
TargetFacet targetFacet = facetsFactory.createTargetFacet(qualifiedCategory,
targetExpressions);

FacetContext facetContext = facetsFactory.createFacetContext();


facetContext.add(targetFacet);
query.setFacetContext(facetContext);

// execute the search by calling the FacetdSearchable’s search method.


// A SIAPI FacetedResultSet object will be returned
FacetedResultSet rset = searchable.search(query);

Flag, range, scope, and rule-based taxonomy browsers

Use flag, range, scope, and rule-based taxonomy browsers to issue a faceted query
to get the flag, range, scope, and rule-based facets in an enterprise search
collection. These taxonomy browsers are available if you configure document
flagging, range facets, scopes, and rule-based categories in the administration
console.

To issue a faceted query to get flag, range, or rule-based facets, add the
TargetFacet object that was obtained from the FacetsFactory object to the facet
context of a faceted query, as shown in the previous example for facet taxonomy
browsers.

Sample programs

The BrowseExample and TimeScaleViewSearchExample sample programs are


provided in the ES_INSTALL_ROOT/samples/siapi directory.

Search and index API federators


Use a federator to issue a federated search request across a set of heterogeneous
searchable collections and get a unified document result set.

Search federators are intermediary components that exist between the requestors of
service and the agents that perform that service. They are coordinate resources to
manage the multitude of searches that are generated from a single request.

The following types of search and index API federators are available:
v Local federator
v Remote federator

Search and index APIs 33


In addition to using the local and remote federators to perform traditional
searches, you can use faceted federators to gather results of faceted search from
multiple collections.

Search federators are search and index API searchable objects. Multiple-level
federation is allowed, but too many levels of federation will decrease search
performance.

The local and remote federators can federate over collections that are created with
Watson Content Analytics or collections that are created with another product. You
can federate over collections that are not created with Watson Content Analytics if
those collections use lightweight directory access protocol (LDAP) or Java database
connectivity (JDBC).

To create an LDAP or JDBC searchable object, the application creates an


AdminService object by passing a fully qualified LDAP or JDBC AdminService
object class path. The createCollection method is used to create an LDAP or
JDBC collection. The LDAP or JDBC configuration information is passed through a
XML configuration file. After LDAP or JDBC collections are created, you can
retrieve the searchable objects through the Service interface and use those
searchable objects directly or through local or remote federators.
Related tasks:
“Creating a faceted enterprise search application” on page 25
Related reference:
“Federated faceted search sample application” on page 115
“Federated search sample application” on page 114
“Faceted search API” on page 26

Local federator
A local federator federates from the client over a set of searchable objects. In
addition to using a local federator to perform traditional searches, you can use a
local faceted federator to gather results of a faceted search from multiple
collections.

A local federator is created by using the createLocalFederator method from the


SIAPI SearchFactory class. The set of searchable collections on which the query is
to be run is specified when the federator is created. A subset of searchable objects
can also be specified during search calls.

A local faceted federator is created by using the createLocalFacetedFederator


method from the SIAPI FacetsFactory class. The set of faceted searchable
collections on which the query is to be run is specified when the federator is
created. A subset of faceted searchable objects can also be specified during search
calls.

Before you can create a local federator, you must create or retrieve searchable
objects by using a search and index API SearchFactory. The searchable object that is
passed to the local federator must be ready for search without any additional
information. The local federator uses the searchable object to issue a federated
search request. To complete this request, the local federator environment must have
all the necessary software components for using various searchable objects.

The following code sample shows how to create a LocalFederator object and issue
a search request:

34 IBM Watson Content Analytics: Programming Guide


Searchable[] finalSearchables;

// create searchables

// create a query and set query options


Query query = searchFactory.createQuery(queryString);
query.setRequestedResultRange(0, 100),
query.setQueryLanguage("en_US");
query.setSpellCorrectionEnabled(true);
query.setPredefinedResultsEnabled(true);

// create the local federator and call search


LocalFederator federator =
factory.createLocalFederator(finalSearchables);
ResultSet rs = federator.search(query);

Remote federator
A remote federator federates from a server over a set of searchable objects. In
addition to using a remote federator to perform traditional searches, you can use a
remote faceted federator to gather results of a faceted search from multiple
collections.

A remote federator is run on the server and consumes server resources. A remote
federator requires an extra step in which input collection IDs are mapped to the
matching searchable object.

A remote federator is created by using the getAvailableRemoteFederators method


from the SIAPI SearchFactory class. A local faceted federator is created by using
the getAvailableRemoteFacetedFederators method from the SIAPI FacetsFactory
class. During the construction of the RemoteFederator or RemoteFacetedFederator
object, the set of collection IDs must be passed. The collection IDs are mapped to
SIAPI searchable objects internally by the RemoteFederator. The remote federator
environment does not require any searchable related software components other
than a small proxy that enables the remote federator to be accessible.

Each enterprise search application will have its own federator, so the federator ID
is the same value as the ApplicationInfo ID value.

The following code sample shows how to create a RemoteFederator object and
issue a search request. Use the com.ibm.siapi.search.SearchService.getFederator()
method to obtain a remote federator.
// obtain the SearchFactory implementation
Class cls = Class.forName("com.ibm.es.api.search.RemoteSearchFactory");
SearchFactory factory = (SearchFactory) cls.newInstance();
Properties properties;
String applicationName="All", federatorId="Default";

ApplicationInfo applicationInfo = factory.createApplicationInfo(applicationName);

// Obtain the FacetsService implementation


FacetsService facetsService = factory.getFacetsService(config);

// Note facets federation is performed in the search server side.


RemoteFacetedFederator federator =
facetsService.getFacetedFederator(applicationInfo, federatorId);

The following code sample shows how to create a RemoteFacetedFederator object


and issue a request to gather results of a faceted search from multiple collections.
Use the com.ibm.siapi.search.facets.FacetsService.getFacetedFederator() method to
obtain a remote faceted federator.

Search and index APIs 35


// obtain the SearchFactory implementation
Class cls = Class.forName("com.ibm.es.api.search.facets.RemoteFacetsFactory");
FacetsFactory factory = (FacetsFactory) cls.newInstance();

Properties properties;
String applicationName="All", federatorId="Default";

ApplicationInfo applicationInfo = factory.createApplicationInfo(applicationName);

// Obtain the FacetsService implementation


SearchService searchService = factory.getFService(config);

// Note facets federation is performed in the search server side.


RemoteFacetedFederator federator =
searchService.getFacetedFederator(applicationInfo, federatorId);

Retrieving targeted XML elements


You can specify that a returned document must be accompanied by a result field.

About this task


In the opaque term that specifies the semantic search, you can prepend a pound or
hash sign (#) to one XML element (or annotation) in the xmlf2 query term. This
result field enumerates all the occurrences of an Unstructured Information
Management Architecture (UIMA) annotation that is designated in the XML query
term. These enumerated annotation occurrences are within the returned document,
and each of them makes a part of an occurrence of the whole XML query term in
the document.

The XML element is designated as the targeted XML element whose occurrences
are to be enumerated. When the semantic search is expressed by XPath, then by
definition of XPath, the deepest element that is not inside the bracketed phrase [..]
and not inside a predicate is the target element.

For example, the query <book language=en> <#author> </#author> </book>, or the
equivalent query <book language=en> <#author/> </book>, returns documents that
include at least one occurrence of the annotation book that has the attribute
language=en and includes within its span an occurrence of the annotation author.
The query also returns the enumeration of all the occurrences of the tag <author>
that appear within the occurrence of the tag <book> that has the attribute
language=en.

Each occurrence is enumerated by its unique ID. The UIMA annotators assign a
unique ID to each annotation that they generate. XML elements that are part of the
raw document rather than annotations that are generated by UIMA annotators do
not have unique IDs, and they are not enumerated in that result field. If the
summary field of the retrieved document includes text that is covered in the
document by an enumerated occurrence, that text is highlighted.

The following occurrences of the tag <author> in the retrieved document will not
be enumerated:
v An occurrence of the tag <author> within the span of the tag <journal>
v An occurrence of the tag <author> within the span of the tag <book> that has the
attribute language=ge
v An occurrence of the tag <author> within the span of the tag <book> that does
not have the attribute language

36 IBM Watson Content Analytics: Programming Guide


v An occurrence of the tag <author> that is part of an XML document, that is the
tag <author> is part of the raw document rather than a generated annotation

The enterprise search application can access the enumeration of the occurrences of
the target element through the TargetElement property of the Result object, for
example, Result.getProperty("TargetElement"). The returned value of that
property is a string of integers that are separated by spaces. Each integer is an ID
of a single occurrence of the target element.

The actual target elements that correspond to these integer values cannot be
retrieved by the API. If an application must access those elements, it must create
its own mapping table during parsing. For example, you can create a common
analysis for relational database mapping.

Fetching search results


The fetch API enables you to obtain the content of documents returned in the
search results.

About this task

The fetch API enables users to view content by clicking documents in the search
results. This API is especially useful for data sources that do not return a clickable
URI, such as documents from IBM DB2, IBM Content Manager Enterprise Edition,
and file system sources.

The fetch API uses client libraries that are installed when Watson Content
Analytics is installed. In a multiple server installation, the libraries are installed on
the crawler server. No additional application development work is required to take
advantage of this API because the API is provided with the esapi.jar file.

To fetch certain types of documents, an administrator must specify document


content options when the crawler is configured. The following discussion
summarizes the requirements for the various crawler types:
DB2 crawler
A document content field (column) must be specified when the crawler is
configured to crawl a DB2 database.
Content Integrator and Content Manager crawlers
A document content field must be specified when a crawler is configured
to crawl these types of data sources. Data sources that contain only
metadata are not supported.
Agent for Windows file systems, FileNet P8, UNIX file system, and Windows
file system, crawlers
The fetch API can retrieve the content of file system documents with no
special configuration by an administrator.
All other crawlers
For other types of crawlers, a clickable URI is returned by using the
getDocumentURI method. The fetch API is not used to retrieve these types
of documents.

The fetch API supports security at the search server level, collection level (through
application IDs), and at the document level (through indexed access controls and
current user validation). The security policy relies on the security settings in the
enterprise search application. If the enterprise search application returns a

Search and index APIs 37


document in the list of search results, the fetch API will retrieve the document
content when the user clicks the document.

A sample program, FetchSearchExample, is provided in the ES_INSTALL_ROOT/


samples/siapi directory. Javadoc documentation is provided in the
ES_INSTALL_ROOT/docs/api/fetch directory.

38 IBM Watson Content Analytics: Programming Guide


Query syntax
Extensive query syntax allows you to find specific documents.

Simple query syntax characters

The following list describes the characters that you can use in enterprise search
and content mining applications to refine query results.
Free style query syntax
Free style query syntax is used to describe queries that do not have an
explicit interpretation and for which there is no default behavior defined.
The default implementation for this type of query is to return documents
only if they match all terms in the free style query.
Query: computer software
Result: This query returns documents that include the term computer and
the term software, or something else depending on the semantics
implemented in the application.
~ (prefix)
Precede a term with a tilde sign (~) to indicate that a match occurs
anytime a document contains the word or one of its synonyms.
Query: ~fort
Result: This query finds documents that include the term fort or one of its
synonyms (such as garrison and stronghold).
~ (postfix)
Follow a single term with a tilde sign (~) to indicate that a match occurs
anytime a document contains a term that has the same linguistic base form
as the query term (also known as a lemma or stem).
Query: run~
Result: This query finds documents that include the term run, running, or
ran because run is the base form of the verb.
+ Precede a term with a plus sign (+) to indicate that a document must
contain the term for a match to occur. Because the plus sign is the default,
it is usually omitted. The plus sign is not needed because documents are
included in the search results only if they match all terms in a free style
query. In a free text query (without the plus sign) only matches in exact
form are returned.
Query: +computer +software
Result: This query returns documents that include the term computer and
the term software.
− Precede a term with a minus sign (-) to indicate that the term must be
absent from a document for a match to occur. The minus sign acts as a
filter to remove documents and must be associated with a query that
returns positive results.
Query: computer -hardware

© Copyright IBM Corp. 2009, 2014 39


Result: This query returns documents that include the term computer and
not the term hardware.
Query: $language::en -url:qa
Result: This query returns documents in the English language minus
documents that have URLs with the string qa.
Query: url:com -url:support
Result: This query returns documents that have URLs with the string com
minus those documents that have URLs with the string support.
= Precede a term with the equal sign (=) to indicate that the document must
contain an exact match of the term for a match to occur. (Lemmatization is
disabled.)
Query: =apples
Result: This query returns documents if and only if they include the plural
term apples.
\ Precede a character in a term with the backslash (\) escape character to
find terms or phrases that contain restricted characters, such as backslashes
and double quotation marks in phrases. Reserved query syntax terms can
also be escaped with the backslash character. For example, you can escape
terms such as AND, ANY, INORDER, NOT, OR, SENTENCE, and WITHIN
with a backslash character. You cannot escape wildcard characters (* and
?).
Query: "program files\\ibm"
Result: This query returns documents that contain the phrase program
files\ibm.
*:* Use this special query syntax to retrieve all available documents in the
collection without performing score computation. To use this syntax with
enterprise search collections, the Enable the query to return all documents
(*:*) check box must be selected on the Search Server Options page. This
check box is selected by default.
Query: *:*
Result: This query returns all available documents in the collection.
*
Place a wildcard character (*) anywhere in, before, or after a term or a field
to indicate that the document can contain any word that matches any of
the possible combinations. A term with a wildcard character is interpreted
as equivalent to an OR of all its applicable expansions. Wildcard support
applies the following rules:
v The set of expansions contain the maximal configured number of
expansions. If there are more expansions in the index than the maximal
number, those expansions are ignored. If some expansions of the
wildcard term were ignored, the query result will indicate that.
v The set of expansions contains all terms in the index that can be
obtained by replacing the wildcard characters with arbitrary sequences
of characters.
v Wildcard characters are supported only for plain text terms. Wildcard
characters are not supported for XML element names, attribute names,
or attribute values.

40 IBM Watson Content Analytics: Programming Guide


v A term that consists solely of a wildcard is not supported.
v Wildcard characters are supported within phrases.
v If the number of expansions for a wildcard term exceed the configured
maximum number of expansions, the expansions that exceed that
maximum are ignored by the query evaluation. In that case, the
ResultSet object's method isEvaluationTruncated() returns true. This
does not uniquely identify the situation, because it will also return true
if the evaluation was terminated early due to a timeout.
Query: app*
Result: This query finds documents that include the terms apple, apples,
application, and so on because these words begin with app.
Query: DB2 info*
Result: This query finds all documents that contain DB2 followed by a
word that begins with info.
Query: title:tech*
Result: This query finds all documents with titles that begin with tech.

Remember: To specify queries with wildcard characters, an administrator


must enable wildcard support when configuring search options for the
collection in the administration console.
? Replace a character in a term with the question mark (?) wildcard character
to find terms that match all other characters in the term.
Query: m?re
Result: This query returns documents that contain the terms mare, mere,
mire, and more.
""
Use double quotation marks (") to indicate that a document must contain
the exact phrase within the double quotation marks for a match to occur.
Words inside phrases are never lemmatized.
You can also add wildcard characters (* or ?) within phrases. The wildcard
character must be next to a letter or word. Standalone wildcard characters
are not supported. Wildcard character support must be enabled in the
administration console.
Query: "computer software programming"
Result: This query finds documents that include the exact phrase computer
software programming.
Phrases are designated as required by default. Hence the two queries
building "new york" and building +"new york" are equivalent. Phrases
can also be forbidden (-) and required but insufficient (^).
Query: "app* pea*"
Result: This query finds documents that include the terms apples pears,
appears peaceful, appreciate peas, and so on because these words begin with
app and pea. This query does not find documents with apples and pears or
other such combinations.
Query: "apple * pear"

Query syntax 41
Result: This query matches apples and pears or apples or pears, but it does
not match apples pears.

Restriction: Using double quotation marks for URL or email address


strings does not return appropriate results. To search for URL or email
strings such as www.ibm.com or [email protected], do not enclose the
string in double quotation marks.

To search for phrases that contain double quotation marks (") or backslash
characters (\), use the backslash character to escape the restricted character.
For example, "\"The Godfather\"" or "hardware\\software requirements".
/facet_name/value_level_1/.../value_level_n
If you search a collection that contains facets, you can search for
documents that contain a specific facet or facet value. For facets with
multiple value levels, such as hierarchical and date facets, you can search
for multiple-level facet values.
Query: /country/Japan
Result: This query finds documents that include the facet country with the
facet value Japan.
Query: /date/2009/1/15 /location/US/California
Result: This query finds documents that include the facet date with the
multiple-level facet values 2009, 1, and 15, and the facet location with the
multiple-level facet values US and California.
^boost Follow a search term by a boost value to influence how documents that
contain a specified term are ranked in the search results.
Query: ibm Germany^5.0
Result: This query finds documents that include the terms IBM and
Germany, and increases the relevance of these documents by a factor of 5 in
the search results.
~ambiguity
Query: ibm analytics~0.5
Result: This query does a fuzzy search and finds documents that include
the terms IBM and analytics, IBM and analyze, IBM and analysis, and so
on.
() Use parentheses ( ) to indicate that a document must contain one or more
of the terms within the parentheses for a match to occur. Use OR or a
vertical bar ( | ) to separate the terms in parentheses.
Do not use plus signs (+) or minus signs (-) within the parentheses.
Query: +computer (hardware OR software)
Query: +computer (hardware | software)
Result: Both of these queries find documents that include the term
computer and at least one of the terms hardware or software.
An OR of terms is designated as required (+) by default. Therefore, the
previous queries are equivalent to +computer +(hardware | software).

42 IBM Watson Content Analytics: Programming Guide


Query syntax for query keywords

The following list describes keywords that you can use to limit a search to specific
documents or specific parts of documents.
IN contextual view
If a content analytics collection contains contextual views, you can include
the IN keyword with other query operators and keywords to search only
the documents that belong to a specific contextual view.
Query: computer IN question “software maintenance” IN answer
Result: This query returns documents that contain the term computer in the
question view and contain the phrase software maintenance in the answer
view.
Query: /keyword$._word.noun/computer IN question IN answer
Result: This query returns documents that include the noun facet with the
facet value computer in the intersection of the question and answer views.
Query: (software maintenance) WITHIN 5 IN answer
Result: This query returns documents that contain the words software and
maintenance, or matching forms of the words, in any order, within 5 words
of each other in the answer view.
Query: @xmlf2::’<title>IBM computers</title>’ IN question
Result: This query returns documents that contains the phrase IBM
computers in the <title> element of an XML fragment in the question view.
(terms) WITHIN context IN ORDER
Follow a search term or phrase by proximity search operators to find
documents that contain terms within a specified number of words of each
other, in the same sentence, or in a specified order within a sentence. The
IN ORDER option is optional and specifies that words must appear in the
same order that you specify them in the query. The context can be:
v A positive number. For example, (a b c) WITHIN 5 matches documents
that contain the three specified words or matching forms of the words,
in any order, within 5 words of each other (that is, up to two words
between them ).
The query ("a" "b" "c") WITHIN 5 INORDER means that the three words
must appear in the same order, and in their exact form, within five
words of each other. No lemmatization is performed for the terms a, b,
or c.
v WITHIN SENTENCE means that the terms must appear in the same
sentence. Lemmatization does not occur if the terms are specified in
quotation marks.
The WITHIN context requires all terms to appear in the same field. For
example, all terms must appear in the subject field or in the body field. In
addition, the terms must appear in the same document part. For example,
a match does not occur across the body of a document and an attachment.
Sample proximity queries:

( x y z ) WITHIN 5
("x" y z ) WITHIN SENTENCE
( x "y z") WITHIN SENTENCE
subject:(world star) WITHIN SENTENCE

Query syntax 43
(lemmatization is done of world and star, in any order)
("Hello" "World") WITHIN SENTENCE INORDER
(no lemmatization and order is maintained)
(terms) ANY number
Use the ANY keyword to find documents that contain a certain number of
the specified query terms.
Query: (x y z) ANY 2
Result: This query returns documents that contain at least two of the
specified query terms.
site:text
If you search a collection that contains web content, use the site keyword
to search a specific domain. For example, you can return all pages from a
particular website.
Do not include the prefix http:// in a site query.
Query: +laptop site:www.ibm.com
Result: This query finds all documents on the www.ibm.com domain that
contain the word laptop.
url:text
If you search a collection that contains web content, use the url keyword
to find documents that contain specific words anywhere in the URL.
Query: url:support
Result: This query finds documents that have a URL with the word
support, such as http://www.ibm.com/support/fr/.
Query: url:support url:fr
Result: This query finds documents that have a URL with the words
support and fr in any order.
Query: url:support&fr
Result: This query finds documents that have a URL with the phrase
support fr. This query is similar to using double quotation marks to
search for an exact phrase.
link:text
If you search a collection that contains web content, use the link keyword
to find documents that contain at least one hypertext link to a specific web
page.
Query: link:http://www.ibm.com/us
Result: This query finds all documents that include one or more links to
the page http://www.ibm.com/us .
field:text
If the documents in a collection include fields (or columns), and the
collection administrator made those fields searchable by field name, you
can query specific fields in the collection.
Query: lastname:smith div:software
Result: This query returns all documents about employees with the last
name Smith (lastname:smith) who work for the Software division
(div:software).

44 IBM Watson Content Analytics: Programming Guide


docid:documentid
Use the docid keyword to find documents that have a specific URI (or
document ID). Typically, there is at most one document in a collection that
matches a specific URI.
Query: (docid:http://www.ibm.com/solutions/us/ OR
docid:http://www.ibm.com/products/us/)
Result: This query finds all documents with the URI http://www.ibm.com/
solutions/us/ or the URI http://www.ibm.com/products/us/.
samegroupas:URI
By default, IBM Watson Content Analytics treats the URLs with the same
host name as if they belong to the same group, and treats the news articles
from the same thread as if they belong to the same group. For URIs from
all other data sources, each URI forms its own group. However, an
administrator can organize URIs that match specific prefixes into groups.
For example, consider the following group definitions:
http://mycompany.server1.com/hr/ hr
http://mycompany.server2.com/hr/ hr
http://mycompany.server3.com/hr/ hr
http://mycompany.server1.com/finance/ finance

file:///myfileserver1.com/db2/sales/ sale
file:///myfileserver1.com/websphere/sales/ sale
file:///myfileserver2.com/db2/sales/ sale
file:///myfileserver2.com/websphere/sales/ sale
In this example, all the URIs with the prefix http://
mycompany.server1.com/hr/ or http://mycompany.server2.com/hr/ or
http://mycompany.server3.com/hr/ belong to one group: hr. All URIs
with the prefix http://mycompany.server1.com/finance/ belong to another
group: finance. And all the URIs with prefix file:///myfileserver1.com/
db2/sales/ or file:///myfileserver1.com/websphere/sales/ or
file:///myfileserver2.com/db2/sales/ or file:///myfileserver2.com/
websphere/sales/ belong to yet another group: sale. If
file:///myfileserver2.com/websphere/sales/mypath/mydoc.txt is a URI in
the collection, a query with the following search term will restrict the
search to the URIs in the sale group:
samegroupas:file:///myfileserver2.com/websphere/sales/mypath/mydoc.txt

All results for this query will have one of the following prefixes:
file:///myfileserver1.com/db2/sales/
file:///myfileserver1.com/websphere/sales/
file:///myfileserver2.com/db2/sales/
file:///myfileserver2.com/websphere/sales/
Query: samegroupas:http://www.ibm.com/solutions/us/
Result: This query finds all documents with URIs, in this case URLs, that
belong to the same group as http://www.ibm.com/solutions/us/.
facetName::/facet_name_1/.../facet_name_n
In a content analytics collection, you can search for documents that contain
a specific facet.
Query: facetName::/”Part of Speech”/Noun/”General Noun”
Result: This query finds documents that include the facet General Noun in
a content analytics collection.

Query syntax 45
facetValue::/facet_name_1/.../facet_name_n/value
In a content analytics collection, you can search for documents that contain
a specific facet value.
Query: facetValue::/”Part of Speech”/Noun/”General Noun”/Car
Result: This query finds documents that include the value Car of the facet
General Noun in a content analytics collection.
date::/facet_name/time_scale/value
In a content analytics collection, you can search for documents that contain
a specific date facet value.
Query: date::/date/Year/2010
Result: This query finds documents that include the value 2010 for the
year time scale of the default date facet in a content analytics collection.
Query: date::/modifieddate/Month/200905
Result: This query finds documents that include the value 200905 for the
month time scale of the modifieddate date facet in a content analytics
collection.
facet::/facet_name/value_level_1/.../value_level_n
In an enterprise search collection, you can search for documents that
contain a specific facet or facet value. For facets with multiple value levels,
such as hierarchical and date facets, you can search for multiple-level facet
values.
Query: facet::/country/Japan
Result: This query finds documents that include the facet country with the
facet value Japan in an enterprise search collection.
Query: facet::/date/2009/1/15 facet::/location/US/California
Result: This query finds documents that include the facet date with the
multiple-level facet values 2009, 1, and 15, and the facet location with the
multiple-level facet values US and California.
flag::/flag_name
If an administrator configured document flags for the collection, you can
use the flag prefix to search for documents that are assigned a particular
flag.
Query: flag::/"Important"
Result: This query finds documents that are flagged as Important.
scope::/scope_name
If an administrator configured scopes for the collection, you can use the
scope prefix to search for documents that are in a particular scope.
Query: scope::/TechSupport
Result: This query finds documents that are in the TechSupport scope.
rulebased::category_ID
Use the rulebased keyword to find documents that belong to a specific
rule-based category.
Sample category tree:

46 IBM Watson Content Analytics: Programming Guide


Root
juice
lemon
apple
Query: rulebased::.juice.lemon
Result: This query returns documents that belong to the rule-based
category juice.lemon.
$source::source_type
Use the $source keyword to find documents that come from a specific data
source type. Source queries are useful in collections that contain documents
from multiple sources.
To obtain a list of the available source types for a collection, call the
getAvailableAttributeValues(Searchable ATTRIBUTE_SOURCE) method of
that collection's Searchable object.
Query: $source::DB2 "computer science"
Result: This query finds documents that were added to a collection by the
DB2 crawler and that contain the phrase computer science.
$language::language_id
Use the $language keyword to find documents that were written in a
specific language.
To obtain a list of the available language IDs for a collection, call the
getAvailableAttributeValues(Searchable.ATTRIBUTE_LANGUAGE) method
of that collection's Searchable object.
Query: $language::en "computer science"
Result: This query finds documents in English that contain the phrase
computer science.
$doctype::document_type
Use the $doctype keyword to find documents that have a specific
document format or MIME type.
To obtain a list of the available document types for a collection, call the
getAvailableAttributeValues(Searchable.ATTRIBUTE_DOCTYPE) method of
that collection's Searchable object.
Query: $doctype::application/pdf "computer science"
Result: This query finds Portable Document Format (PDF) documents that
contain the phrase computer science.
$similar::document_id~similarity
Use the $similar keyword to find documents that are near duplicates of
the specified document. The similarity value specifies the level of
strictness to apply. The valid range is 0.0<=1.0. Specifying 1.0 does not
mean that exact content matching is performed. It means that the search
for similar documents is based on the highest level of similarity. The
higher the similarity value, the closer the documents must be to being near
duplicates of each other.
Query: $similar::http://www.ibm.com/solutions/us~1.0
Result: This query finds documents that are highly similar to
http://www.ibm.com/solutions/us.

Query syntax 47
#field::=value
Use parametric constraint syntax to find documents that have a numeric
field with a value equal to the specified number.
Query: #price::=1700 laptop
Result: This query finds documents that contain the term laptop and a
price field with a value equal to 1700.
#field::>value
Use parametric constraint syntax to find documents that have a numeric
field with a value greater than the specified number.
Query: #price::>1700 laptop
Result: This query finds documents that contain the term laptop and a
price field with a value greater than 1700.
#field::<value
Use parametric constraint syntax to find documents that have a numeric
field with a value less than the specified number.
Query: #price::<1700 laptop
Result: This query finds documents that contain the term laptop and a
price field with a value less than 1700.
#field::>=value
Use parametric constraint syntax to find documents that have a numeric
field with a value greater than or equal to the specified number.
Query: #price::>=1700 laptop
Result: This query finds documents that contain the term laptop and a
price field with a value greater than or equal to 1700.
#field::<=value
Use parametric constraint syntax to find documents that have a numeric
field with a value less than or equal to the specified number.
Query: #price::<=1700 laptop
Result: This query finds documents that contain the term laptop and a
price field with a value less than or equal to 1700.
#field::>value1<value2
Use parametric constraint syntax to find documents that have a numeric
field with a value that falls between a range of specified numbers.
Query: #price::>1700<3900 laptop
Result: This query finds documents that contain the term laptop and a
price field with a value greater than 1700 and less than 3900.
#field::>=value1<=value2
Use parametric constraint syntax to find documents that have a numeric
field with a value that matches or falls between a range of specified
numbers.
Query: #price::>=1700<=3900 laptop
Result: This query finds documents that contain the term laptop and a
price field with a value greater than or equal to 1700 and less than or
equal to 3900.

48 IBM Watson Content Analytics: Programming Guide


#field::>value1<=value2
Use parametric constraint syntax to find documents that have a numeric
field with a value that matches the criteria in the specified range of
numbers.
Query: #price::>1700<=3900 laptop
Result: This query finds documents that contain the term laptop and a
price field with a value greater than 1700 and less than or equal to 3900.
#field::>=value1<value2
Use parametric constraint syntax to find documents that have a numeric
field with a value that matches the criteria in the specified range of
numbers.
Query: #price::>=1700<3900 laptop
Result: This query finds documents that contain the term laptop and a
price field with a value greater than or equal to 1700 and less than 3900.
#field::>"Date"
Use parametric constraint syntax to find documents that match a specific
date or date range.
Query: #date::>"2007-12-01"
Result: This query finds documents that were created or last modified on 1
December, 2007 or later.
ACL constraints: (security_tokens)
For security, you cannot specify access control constraints in the query
string. Use the setACLConstraints(String aclConstraints) method of the
Query interface to specify access control constraints for the query. You can
specify parentheses, plus signs (+), minus signs (-) ,circumflexes (^), and an
XML security context string in the ACL constraints string
(@SecurityContext::'securityContext'). For information about the
securityContext string syntax, see the Javadoc documentation that
describes the setACLContstraints method. The symbols have the same
meaning as described in the previous syntax descriptions.
ACL constraints string in setACLConstraints method: (michelle_c |
dev_group)
ACL constraints string in setACLConstraints method: michelle_c
@SecurityContext::’securityContext’
Query: thinkpad
Result: This query finds documents that include the term thinkpad and the
security tokens michelle_c or dev_group in the first case, and michelle_c
and the specified security context constraints in the second case.

Query syntax characters for opaque terms

You can create query syntax for two types of opaque terms. An opaque term is one
that is expressed and handled by another query language, such as the XML query
languages XML Fragment and XPath. XML Fragment can also be used to query
UIMA structures. The sign for an opaque term is expressed with @xmlf2:: (XML
fragment) or @xmlxp:: (XPath query). The XML fragment or the XPath query is
enclosed in single quotation marks (' ').

Query syntax 49
The expression xmlf2 is used for XML fragments, and xmlxp is used for XPath
terms. An opaque term has the following syntax: @syntax_name::'value'. The
expression starts with the @ sign, followed by the syntax name (xmlf2 or xmlxp),
two colons (::), and a value that is enclosed in single quotation marks (' '). The
value parameter is sometimes preceded by -, +, or ^. If you need to use a single
quotation mark in the value section of the expression, escape the single quotation
by using a backslash (\), for example, \'.

For negative terms, use a minus sign (−) before the @ symbol, for example,
-@xmlf2::’<person>michelle</person>’. However, Watson Content Analytics does
not accept negative unique query terms. The query -@xmlf2::’<person>michelle</
person>’ does not return results. To get results, use one positive term in the query,
for example, documentation -@xmlf2::’<person>michelle</person>’.

In an XML fragment query, specify term modifiers inside an XML element. For
example:
@xmlf2::’<Element>IBM +computers</Element>’
@xmlf2::’<Element>IBM =computers</Element>’
@xmlf2::’<Element>IBM computers~</Element>’

In an XPath query, use the contains operator instead of the ftcontains operator to
restrict search results by the occurrence of a word. For example:
@xmlxp::’personarecord[country contains("Germany") or title contains("IBM")]’
@xmlf2::'<tag1> text1 </tag1>'
Use the @xmlf2:: prefix and enclose the query in single quotation marks to
indicate a fragment query as a new search and index API opaque term.
Query: @xmlf2::’<title>"Data Structures"</title>’
Result: This query finds documents that contain the phrase Data Structures
within the span of an indexed annotation called title.
@xmlf2::<tag1><.depth value="$number"><tag2> ... </tag2></.depth></tag1>
@xmlf2::<tag1><.depth value='$number'><tag2> ... </tag2></.depth></tag1>
The first query uses double quotation marks. The second query uses single
quotation marks. However, each query returns the same results. This query
syntax looks for occurrences of tag2 exactly $number levels under tag1.
$number is a positive integer. You can use single quotation marks (' ') or
double quotation marks (" ") around the numerical value. This query
syntax is not applicable to Unstructured Information Management
Architecture (UIMA).
Query: (This query should appear on one line.)
@xmlf2::’<author>Albert Camus<.depth value=’1’>
<publisher>Carey Press</publisher></.depth></author>’
Result: This query finds documents of the publisher one level under the
author. A document with the following XML elements
<author>Albert Camus
<ISBN>002-12345</ISBN>
<country>USA
<publisher>Carey Press</publisher>
</country>
</author>

will not be returned with the example query because the publisher
(<publisher>) element occurs two levels under the author (<author>)
element.

50 IBM Watson Content Analytics: Programming Guide


@xmlf2::'<tag1> ... </tag1>'
You can distinguish between elements and attributes. Attributes are written
either explicitly within the element.
You can define words and phrases within attributes, which is the same as
the normal terms of the query. However, you can write expressions only of
words and phrases, not of tags. These words or phrases support the same
features as the normal terms of the query.
Query: @xmlf2::’<author country="USA"></author>’
Result: This query finds documents where the author originates from the
USA.
Query:
@xmlf2::’<author country="USA">
<firstName>Michelle</firstName>
<lastName>Ropelatto</lastName></author>’
Result: This query finds documents where the author name is Michelle
Ropelatto and is from the USA.
@xmlf2::'+text1 ... +text2 -text3 ... -text4 text5'
Use a plus sign (+) or a minus sign (-) as prefixes to words or phrases
(always between quotation marks (" ")). At each query level, whether for
the text or the tag name, "+" means that the terms must appear; "-" means
that the terms should not appear and others are optional and contribute
only to ranking. If no "+" terms exist, then at least one of the optional
terms must appear. The data under elements creates a new nested query
level.
Query: @xmlf2::’+"Graph Theory" -network’
Result: This query finds documents that contain the phrase Graph Theory,
and do not contain the term network.
Query:
@xmlf2::’<book><author>hemingway</author> -<title>old man</title></book>
Result: This query finds documents that contain a book by Hemingway
but not the book The Old Man and the Sea.
@xmlf2::'<tag1> <.or> ... </.or> <.and> ... </.and> </tag1>'
Use Boolean syntax for AND (<.and>) and OR (<.or>) expressions in a
query.
Query: @xmlf2::’<book><.or><author>Sylvia Plath</author><title>XML
-Microsoft</title></.or></book>’
Result: This query finds documents that specify a book whose author is
Sylvia Plath or where the title of the book includes the word XML but not
Microsoft.
@xmlf2::'<annotation1+annotation2> ... </annotation1+annotation2>'
You can express the concatenation of consecutive annotations in a fragment
query by using the plus sign (+) between the start and end tags of the
element. The consecutive annotations must overlap by at least one word
(they must intersect). The concatenation of two or more overlapping
annotations is a new virtual annotation that spans the sum of the text
spanned by the annotations.
Query: @xmlf2::’<Report+HoldsDuring> +Pakistan +March
+Reuters</Report+HoldsDuring>’

Query syntax 51
Result: This query finds documents from Reuters about events in Pakistan
in March that are contained in the concatenated annotation formed by the
“Report” and “HoldsDuring” annotations.
@xmlf2::'<annotation1*annotation2> ... </annotation1*annotation2>'
You can express the intersection of annotations in a fragment query using
the asterisk sign (*) between the start and end tags of an element. The
intersection of two or more overlapping annotations is a new virtual
annotation that spans just the text that is covered by the intersection of the
overlapping annotations.
Query: @xmlf2::’<Inhibits* Activates>Aspirin</Inhibits*Activates>’
Result: This query finds documents in which Aspirin occurs in both the
'Inhibits' and 'Activates' annotations.
@xmlxp::'/tag1/@tag1'
You can distinguish between elements (XML start and end tags) and
attributes. Attributes are written explicitly with a leading @ sign. The @
sign enables you to distinguish between elements and attributes that might
have the same name. Concatenations and intersections are applicable only
to UIMA documents, and not to pure XML documents, where spands do
not cross over by definition.
Query: @xmlxp::’/author[@country="USA"]’
Result: This query finds documents in which USA is included in the
character string that is the value of the attribute country that is associated
with author.
@xmlxp::'/tag1[tag2 or tag3 and tag4]'
Use full Boolean to express AND and OR scope in an XPath query.
Query: @xmlxp::’book[author ftcontains("Jose Perez") or title
ftcontains("XML -Microsoft")]’
Result: This query finds documents that specify a book whose author is
Jose Perez or where the title of the book includes the word XML, but not
Microsoft.
@xmlxp::'tag1//tag2/tag3'
You can distinguish between descendent nodes (//) and child nodes (/).
Query: @xmlxp::’/books//book/name’
Result: This query finds documents that specify a book element as a
descendant of a books element and that specify a name element as a direct
child of the book.
@xmlxp::'tag1/.../tagn'
Use the @xmlxp:: prefix and enclose the query in single quotation marks to
indicate an XPath query as an search and index API opaque term.
Query: @xmlxp::’books[booktitle ftcontains("Data Structures")]’
Result: This query finds documents that contain the phrase "Data
Structures" within the span of an indexed annotation called "title."
Related reference:
“Controlling query behavior” on page 17

52 IBM Watson Content Analytics: Programming Guide


Query syntax structure
A query is a list of space-separated terms that follow a specific structure.

A query has the following structure:


<space>*{Query_term <space>+}* Query_term

Query terms can be one of the following types:


v Word
v Phrase
v AttributeConstraint
v CategoryConstraint
v RangeConstraint
v OrTerm
v OpaqueTerm

Appearance modifiers

Appearance modifiers control whether a query term is:


v Mandatory: it must appear within result documents.
v Forbidden: it must not appear within result documents.
v Mandatory but insufficient: documents that contain only insufficient terms do
not qualify as search results.

The semantics correspond to the BaseQueryTerm object's constants of the search


and index API: Appearance_Modifier = { + | −| ^ }
v + denotes a mandatory term
v − denotes a forbidden term
v ^ denotes a mandatory term that is not sufficient for a document to qualify as a
search result

OR terms cannot be denoted as forbidden: OR_Appearance_Modifier = { + | ^ }

Match type modifiers

Prematch type modifiers appear just before the word that they modify:
PreMatch_Type = { = | ~ }
v = denotes that the word should be matched as is, that is, it should not be
stemmed or lemmatized, and that the search should not be expanded to include
synonyms of the word
v ~ denotes that the search should be expanded to include synonyms of the word

Postmatch modifiers appear directly after the word that they modify:
PostMatch_Type = { * | ~ }
v * matches words having the indicated prefix
v ~ matches words that share the same base form, for example, stem or lemma
with this word

By default, words that are explicitly modified by an appearance modifier but not
by a match type use exact-match (“as is”) semantics.

Query syntax 53
Fielded search notation

A fielded search notation, or field name (token), is immediately followed by a


colon, that is, no space between the field name and the colon: Field = field_name:

OR terms

An OR term is comprised of a sequence of ORable terms, separated by spaces and


an OR-SIGN, enclosed within parentheses. All query term types except “OrTerm”
and “OpaqueTerm” qualify as ORable terms. Parentheses surround the OR
expression, in which terms are separated from each other by three mandatory
sequences:
v One or more spaces
v Either a vertical bar ‘|' or the upper case word ‘OR' (both notations are allowed)
v One or more spaces

Semantically, at least one of the OR-ed terms must appear in documents that
qualify as search results.
ORable_term = Query_term \ { OrTerm, OpaqueTerm }

OR-SIGN = | | OR

ORable_query = <space> * { ORable_term <space> + OR-SIGN <space> +}*


ORable_term

OrTerm = OR_Appearance_Modifier? ( ORable_query )

If no OR_Appearance_Modifier is given, a + is implicitly assumed. The individual


terms of the ORable_query cannot have appearance modifiers of their own.

Words and phrases


Word = PreMatch_Type? Appearance_Modifier? Field? value |
Appearance_Modifier? Field? value PostMatch?
v A word (or fielded word), possibly with a single match type indicator (at the
beginning or the end), and possibly with an appearance modifier
v If neither a match type indicator nor prefix indicator are given, it is
implementation dependent which form is searched
v If no appearance modifier is given, it is implementation dependent whether the
results must contain the term
v The value can contain the wildcard symbol ‘*' anywhere; however, at least one
non-wildcard character must exist in the value.

The following phrase should be expressed on one line:


Phrase = Appearance_Modifier? Field?"
<space>*{value<space>+}* value <space>*"
v A non-empty sequence of space-separated values inside quotation marks. The
field and appearance modifier are both optional.
v If no appearance modifier is given, a + is implicitly assumed.
v Each value inside the phrase is searched as is, that is, with exact-match
semantics.

Attribute, category, and range constraints


Attribute_Name: { language | source | doctype }
AttributeConstraint = $Attribute_Name::value

54 IBM Watson Content Analytics: Programming Guide


The $ sign is followed by an attribute name, which followed by two colons and a
value. If no Appearance_Modifier is given, a + is implicitly assumed.

The following category constraint should be expressed on one line:


CategoryConstraint =
PreMatch_Type?Appearance_Modifier?taxonomy_id::category_id

A taxonomy ID is followed by two colons and a category ID:


v Match_Type = restricts the documents to be members of the given category,
while ~ means that documents can also belong to descendents (subcategories) of
the given category.
v If no Match_Type is given, ~ is implicitly assumed.
v If no Appearance_Modifier is given, a + is implicitly assumed.

The parametric field must be greater than (or equal to in the second case) the
double value:
Grelation = > double_value | >= double_value

The parametric field must be less than (or equal to, in the second case) the double
value:
Lrelation = < double_value | <= double_value

The # character is followed by the field name, two colons, and at least one relation
(or =). The Appearance_Modifier can be either + or ^ (- is not allowed). If no
Appearance_Modifier is given, a + is implicitly assumed:
RangeConstraint = Appearance_Modifier?# field :: Grelation Lrelation? |
Appearance_Modifier?# field :: Grelation? Lrelation |
Appearance_Modifier?# field :: =double_value

Opaque terms

An @ sign is followed by some syntax name, two colons, and a value enclosed in
single quotation marks. The opaque term can be preceded by an appearance
modifier. If a single quote is needed in the value part, it should be escaped by \,
as in \':
OpaqueTerm = Appearance_Modifier?@ syntax_name :: ’ value ’

For the semantics of opaque terms, the search and index APIs:
v Do not attempt to parse the value inside the single quotation marks; rather, that
string will be passed as-is to a parser that corresponds to the syntax_name.
v Does not define which external query languages should be supported by
implementations.
v Does not define how many opaque terms can exist inside a query, and how they
interact with the rest of the terms. All this is implementation defined. It is
assumed that in most cases, a query either consists solely of an opaque term, or
does not contain such terms at all.

Tokens, field names, and values in queries

Tokens, field names, and values have the following rules:


v Any sequence of characters without any of the special characters is a token.
v The characters = ( " have only special meaning if they are preceded by a space
or at the beginning of the query string. Thus, these characters can exist inside
tokens, but they cannot exist at the beginning of a token.

Query syntax 55
v An exception to the previous rule is that a '(' can begin a token inside an
OrTerm because those cannot be nested and so '(' has no special meaning there.
v The characters + - ^ have special meaning only if they are preceded by a space,
by one of = ~ (, or at the beginning of the query string.
v The colon has meaning only as a separator between a field/constraint-type and
a value. The colon is considered a regular character in all other cases.
v The character ) has special meaning only inside an OrTerm, but outside of a
phrase inside the OrTerm. There, it will terminate the OrTerm. In all other cases,
it is considered a regular character.
v The character * has special meaning only for values; that is, wildcard characters
are not applied to field names.
v The sequences <,<=,>,>= have special meaning only within a range constraint.
v All special characters except " are considered regular characters inside a phrase:
they lose their special functions inside phrases. The " ends the phrase. This rule
trumps all previous rules.
v Wildcard characters are allowed inside phrases

As a general, if one of the special characters has no meaning in a certain setting, it


will be considered as part of the token/value.

The behavior of the query parser is undefined for nonconforming strings. In some
cases, the parser implicitly overcomes problems, such as ending phrases that are
not terminated, and in some cases it does not overcome such problems.

ACL expression syntax

The syntax of ACL expressions is a subset of the full query syntax. Basically, it
consists of words, OR expressions over several words, and opaque terms.

ACL_Expression = <space>* {ACL_term<space>+}* ACL_term


ACL_term = { ACL_Value |
ACL_OrTerm |
Security_OpaqueTerm }

ACL_Value = Appearance_Modifier? value


v A value, possibly with an appearance modifier
v If no appearance modifier is given, ^ is assumed.

The ACL_OrTerm should be expressed on one line:


ACL_OrTerm = {+|^}? (<space>*{value <space>+
OR-SIGN<space>+}* value <space>*)

The sequence of values is separated by spaces and an OR-SIGN and enclosed


inside parentheses. The OrTerm optionally has an appearance modifier, either + or
^ (- is not allowed) If no Appearance_Modifier is given, ^ is implicitly assumed.
The individual values inside the OrTerm cannot have appearance modifiers of their
own.

Security opaque terms


Security_OpaqueTerm = Appearance_Modifier?@ syntax_name :: ’ value ’

56 IBM Watson Content Analytics: Programming Guide


An @ sign is followed by some syntax name, two colons, and a value enclosed in
single quotation marks. The opaque term can be preceded by an appearance
modifier. If a single quotation mark is needed in the value, it should be escaped by
\, as in \'.

The semantic disclaimers that were specified with respect to opaque terms in query
strings apply here.

The behavior of the ACL expression parser is undefined for nonconforming strings.
In some cases, the parser implicitly overcomes problems, such as ending OR-terms
that are not terminated, and in some cases it does not overcome such problems.

By default, all terms are assumed to be required but insufficient, as if qualified by


‘^'. Although you can qualify an ACL_term by a +, it does not seem to match the
use of ACLs as filters.

The expressiveness of this syntax is broader than simple document-level security.


First, it allows forbidden tokens, for example, “do not return documents that are
viewable with this ACL”. Second, the ability to include several OrTerms allows this
syntax to support multiple-level security:
(server_ACL1 | server_ACL2) (group_ACL1 | group_ACL2)
(user_ACL1 | user_ACL2)
Related reference:
“Controlling query behavior” on page 17

Query syntax 57
58 IBM Watson Content Analytics: Programming Guide
Real-time NLP API
The real-time natural language processing (NLP) API allows users to perform
ad-hoc text analytics on documents.

Real-time text analysis uses the existing text analytics resources that are defined for
a collection, but analyzes documents without adding them to the index. Users can
immediately check the analysis results without waiting for the index to be built or
updated.

Requirements

The following system set-up is required to use the real-time NLP API:
v Real-time NLP requires a content analytics collection that hosts text analytics
resources. The collection must not be enabled to use IBM InfoSphere BigInsights.
v Administrators configure the collection for real-time NLP by configuring the
facet tree, dictionaries, and patterns for text extraction, just as they would for
typical content analytics collections. The result of real-time NLP reflects the
configuration of that collection.
v The parse and index sessions for the collection must be running because these
sessions provide the document processing engine for the real-time NLP API.
v Search sessions for the collection must be running because these sessions serve
as the gateway for the real-time NLP API.

Typical usage

The following steps summarize the typical workflow for using real-time NLP:
v A dictionary developer creates a content analytics collection with dictionaries for
testing results, and uses the real-time NLP API to examine how the dictionaries
attach facets for various input documents.
v A workflow system uses real-time NLP to determine how to process documents
based on the facets attached to the documents.
v An alert system constantly processes input documents, such as chat logs or news
feeds, and sends email to managers immediately if a particular facet is attached
to an input document.

A call of the real-time NLP API might require additional time if the call needs to
initialize a document processor. Document processors are initialized when parse
and index or document processors are started, or analytic resources are deployed.
Document processors are also initialized after the parse and index configuration is
changed. Real-time NLP API requests and normal document processing, such as
building the index, share the resources of the document processors. Therefore,
index creation might affect the performance of the real-time NLP performance.
Similarly, real-time NLP API requests might affect the performance of the index
creation.

Both SIAPI and REST API versions of the real-time NLP API are provided. The
NLP REST API accepts both text and binary content, but the SIAPI version only
accepts content in text format.

© Copyright IBM Corp. 2009, 2014 59


Restriction: The SIAPI version of the real-time NLP API is being deprecated and
will not be supported in future releases. Use the REST API version instead of the
SIAPI version to create custom applications.

The real-time NLP API is also supported with enterprise search collections for
advanced users.

Information about using the NLP REST API is available in the


ES_INSTALL_ROOT/docs/api/rest directory. For more information about the SIAPI
version of the NLP API, see the ES_INSTALL_ROOT/samples/siapi/
RealtimeNLPExample.java sample program.
Related reference:
“Sample real-time NLP application” on page 119

60 IBM Watson Content Analytics: Programming Guide


Application security
The search and index APIs communicate remotely through HTTP to the Watson
Content Analytics search servers.

When the application issues remote search and index API requests that must be
secure, you must set the user name and password on the Service classes with a
valid user name that is stored in the user registry that is for authentication. Any
requests that do not contain valid user names and passwords are rejected.

In an enterprise search application, the Properties object is passed in the call to the
getSearchService method or getBrowseService method. The Properties object
specifies property names called username and password.

Watson Content Analytics supports HTTP basic authentication, HTTPS secure


socket layer (SSL) version 3 proxy servers, and proxy servers that require a user
name and password for basic authentication.

Applications and collections must have IDs. For applications that need to access
specific collections, the collection ID must be associated with the application ID.
You can specify which collections the application can access in the administration
console.
Related reference:
Programming guidance for developing secure search applications with Java
API

Document-level security
To support prefiltering and post-filtering of search results, the search request must
provide a user's security context by using the setACLConstraints method on the
Query object.

The user's security context is provided as an XML string as part of an opaque


query term, for example:
Query q = factory.createQuery("IBM"); q.setACLConstraints
("@SecurityContext::’<User’s Security Context XML string>’"

You can create the user's security context XML string in two ways:
v By using the identity management API to programmatically create the XML
string.
v By using Java String classes to create the XML string
Use this method only if you cannot build applications with the identity
management API.
Related reference:
Programming guidance for developing secure search applications with Java
API

© Copyright IBM Corp. 2009, 2014 61


Identity management for single sign-on security
You can use the identity management APIs to create a single sign-on system that
manages the multiple identities of users and to automatically generate the security
context strings of users. IDs can be reused on subsequent searches without users
logging on multiple times.

How the identity management component works

With the identity management Java APIs, you can create an application to manage
the security credentials of your users. The following graphic shows how users log
in to a system such as WebSphere Portal and authenticate with the registry.

WebSphere Portal Enterprise user


User logs into authenticates the user registry
WebSphere Portal

User ID

Identity management Application login


User updates component profile
his or her profile

User’s security
context string

Search runtime Collections


component
User searches

Figure 2. How users log in to WebSphere Portal or other systems

When users attempt to access an application, the identity management component


repeats the process of authenticating those users.

Sample code and Java APIs


You can access a sample Java program and Javadoc documentation for the identity
management in the following locations:
IdentityManagementExample.java
A standalone sample program that is available in the ES_INSTALL_ROOT/
samples/siapi directory. You can build this code by running the ANT
command.
Javadoc documentation
Provides descriptions of the available APIs to build identity management
into your enterprise search applications. The Javadoc documentation is in
the ES_INSTALL_ROOT/docs/api/imc directory.

62 IBM Watson Content Analytics: Programming Guide


Running the sample application

To run the Java sample program, make sure that you have the following JAR files
in your class path:
v esapi.jar
v siapi.jar
v es.security.jar
v es.oss.jar

To run the sample program, enter the following command on a single command
line.
Windows
java –classpath $ES_INSTALL_ROOT\lib\esapi.jar;$ES_INSTALL_ROOT\lib\
siapi.jar;$ES_INSTALL_ROOT\lib\es.security.jar;.
IdentityManagementExample
AIX® or Linux
java –classpath $ES_INSTALL_ROOT/lib/esapi.jar:$ES_INSTALL_ROOT/lib/
siapi.jar:$ES_INSTALL_ROOT/lib/es.security.jar:.
IdentityManagementExample
Related reference:
Programming guidance for developing secure search applications with Java
API

Creating the user's security context XML string with the


identity management API
The identity management API provides several Java classes that can be used to
create the USC XML string programmatically.

To create the USC XML string for a particular user, first instantiate a
SecurityContext object. The SecurityContext object contains a user name, an array
of Identity objects, and optionally a Single Sign-On (SSO) token. The user name
that is assigned to the SecurityContext is typically the value that the user specified
to log in to your application.

After you create a SecurityContext object, you create an array of Identity objects.
Each Identity object contains a user name and a password, a String array of
group tokens, a source type, and a domain identifier. If the SecurityContext object
contains an SSO token, then the user name is required but the password is
optional. For example:
SecurityContext context = new SecurityContext();
context.setUserID("uid=wpsadmin,o=default organization");

Identity[] identities = new Identity[1];


identities[0] = new Identity();
identities[0].setDomain("portalserver.ibm.com:9081");
identities[0].setType("wp");
identities[0].setUsername("uid=wpsadmin,o=default organization");

String[] groups = new String[3];


groups[0] = "uid=wpsadmin,o=default organization";
groups[1] = "all authenticated portal users";
groups[2] = "wpsadmins";

Application security 63
identities[0].setGroups(groups);
identities[0].setProperties(new Properties());

context.setIdentities(identities);

After you create the context, you can easily set the ACL constraints in the query by
calling the context.serialize(true) method. The Boolean parameter indicates that
the XML string values should be Base64 encoding to ensure proper transmission to
the search server. For example:
q.setACLConstraints("@SecurityContext::’" + context.serialize
(true) + "’");
Related reference:
Programming guidance for developing secure search applications with Java
API

64 IBM Watson Content Analytics: Programming Guide


Crawler plug-ins
Crawler plug-ins are Java application programming interfaces (APIs) that you can
use to change content or metadata in crawled documents.

Data source crawler plug-ins

You can apply business and security rules to enforce document-level security and
add, update, or delete the crawled metadata and document content that is
associated with documents in an index. The data source crawler plug-in APIs
cannot be used with the web crawler.

You can also create a plug-in that extracts entries from archive files. The extracted
files can then be parsed individually and included in collections.

Restriction: The following type B data source crawlers do not support plug-ins to
extract or fetch documents from archive files:
v Agent for Windows file systems crawler
v BoardReader crawler
v Case Manager crawler
v Exchange Server crawler
v FileNet P8 crawler
v SharePoint crawler

Web crawler plug-ins

You can add fields to the HTTP request header that is sent to the origin server to
request a document. You can also view the content, security tokens, and metadata
of a document after the document is downloaded. You can add to, delete from, or
replace any of these fields, or stop the document from being parsed.

Web crawler plug-ins support two kinds of filtering: prefetch and postparse. You
can specify only a single Java class to be the web crawler plug-in, but because the
prefetch and postparse plug-in behaviors are defined in two separate Java
interfaces and because Java classes can implement any number of interfaces, the
web crawler plug-in class can implement either or both behaviors.

The web crawler plug-in has two specific plug-in types:


Prefetch plug-in
A prefetch plug-in is called before the crawler downloads a document.
Your plug-in is given the document URL, the fetch method, the HTTP
version, and the HTTP request header. Your plug-in can use these elements
to decide whether to modify the request header (for example, to add
cookies) or even to cancel the download.
Postparse plug-in
The postparse plug-in is called after any download attempt. Before the
plug-in is called, the target content is downloaded and parsed by the
crawler. The plug-in is given the document URL, the metadata that is
extracted by the crawler from various sources, and the document's content.

© Copyright IBM Corp. 2009, 2014 65


The plug-in can determine whether to alter any of these items in the
document and whether to save the content of the document before it is
parsed.

Javadoc documentation for crawler plug-ins

For detailed information about each plug-in API, see the Javadoc documentation in
the following directory: ES_INSTALL_ROOT/docs/api/.
Related concepts:
“Crawler plug-ins for non-web sources”
“Web crawler plug-ins” on page 75
“API documentation” on page 5
Related tasks:
“Creating and deploying a plug-in for archive files” on page 71
Related reference:
“Sample plug-in application for non-web crawlers” on page 121

Crawler plug-ins for non-web sources


Data source crawler plug-ins are Java applications that can change the content or
metadata of crawled documents. You can configure a data source crawler plug-in
for all non-web crawler types.

With the crawler plug-in for data source crawlers, you can add, change, or delete
crawled content or metadata. You can also create a plug-in for extracting files from
archive files and extend that plug-in to enable users to view the extracted content
when they view the search results.

Restriction: The following type B data source crawlers do not support plug-ins to
extract or fetch documents from archive files:
v Agent for Windows file systems crawler
v BoardReader crawler
v Case Manager crawler
v Exchange Server crawler
v FileNet P8 crawler
v SharePoint crawler

When you specify the Java class as the new crawler plug-in, the crawler calls the
class for each document that it crawls.

For each document, the crawler passes to your Java classes the document identifier,
the security tokens, the metadata, and the content that was specified by an
administrator. Your Java class can return a new or modified set of security,
metadata, and content.

Restriction: The crawler plug-in allows you to add security tokens, but it does not
allow you to access the native access control lists (ACLs) that are collected by the
crawlers that are provided with Watson Content Analytics.
Related concepts:
“Crawler plug-ins” on page 65
Related reference:

66 IBM Watson Content Analytics: Programming Guide


“Sample plug-in application for non-web crawlers” on page 121

Creating a crawler plug-in for type A data sources


You can create a Java class to programmatically update the value of security
tokens, metadata, and the document content of type A data sources.

About this task

When the crawler session starts, the plug-in process is forked. An


AbstractCrawlerPlugin object is instantiated with the default constructor and the
init, isMetadataUsed, and isContentUsed methods are called one time. During the
crawler session, the activate method is called when the crawler starts its crawling
and the deactivate method is called when the crawler finishes its crawling. When
the crawler session ends, the term method is called and the object is destroyed. If
the crawler scheduler is enabled, the activate method is called when the crawling
is scheduled to start and the deactivate method is called when the crawling is
scheduled to end. Because a single crawler session runs continuously when the
crawler scheduler is enabled, the term method is not called to destroy the object.

Tip: For information about creating a crawler plug-in for the following type B data
sources, see “Creating a crawler plug-in for type B data sources” on page 69:
v Agent for Windows file systems crawler
v BoardReader crawler
v Case Manager crawler
v Exchange Server crawler
v FileNet P8 crawler
v SharePoint crawler

Procedure

To create a Java class for use as a crawler plug-in with content-related functions for
type A data sources:
1. Extend com.ibm.es.crawler.plugin.AbstractCrawlerPlugin and implement the
following methods:
init()
isMetadataUsed()
isContentUsed()
activate()
deactivate()
term()
updateDocument()
The AbstractCrawlerPlugin class is an abstract class. The init, activate,
deactivate, and term methods are implemented to do nothing. The
isMetadataUsed method and isContentUsed method are implemented to return
false by default. The updateDocument method is an abstract method, so you
must implement it.
For name resolution, use the ES_INSTALL_ROOT/lib/dscrawler.jar file.
2. Compile the implemented code and make a JAR file for it. Add the
ES_INSTALL_ROOT/lib/dscrawler.jar file to the class path when you compile.
3. In the administration console, follow these steps:
a. Edit the appropriate collection.
b. Select the Crawl page and edit the crawler properties for the crawler that
will use the custom Java class.

Crawler plug-ins 67
c. Specify the following items:
v The fully qualified class name of the implemented Java class, for example,
com.ibm.plugins.MyPlugin. When you specify the class name, ensure that
you do not specify the file extension, such as .class or .java.
v The fully qualified class path for the JAR file and the directory in which
all files that are required by the Java class are located. Ensure that you
include the name of the JAR file in your path declaration, for example,
C:\plugins\Plugins.jar. If you need to specify multiple JAR files, ensure
that you use the correct separator depending on your platform, as shown
in the following examples:
– AIX or Linux: /home/esadmin/plugins/Plugins.jar:/home/esadmin/
plugins/3rdparty.jar
– Windows: C:\plugins\Plugins.jar;C:\plugins\3rdparty.jar
4. On the Crawl page, click Monitor. Then, click Stop and Start to restart the
session for the crawler that you edited. Click Details and start a full crawl.

Results

If the crawler stops when it is loading the plug-in, view the log file and verify that:
v The class name and class path that you specified in the crawler properties page
are correct.
v All necessary libraries are specified for the plug-in class path.
v The crawler plug-in does not throw a CrawlerPluginException error.

Tip: If a crawler gets NullPointerException after it is configured to use a custom


crawler plug-in, override
com.ibm.es.crawler.plugin.AbstractCrawlerPlugin#isMetadataUsed() to return
true instead of false.

Metadata field definitions: If you want to add a new metadata field in your
crawler plug-in, you must create an index field and add the metadata field to the
collection by configuring parsing and indexing options in the administration
console. Ensure that the name of the metadata field is the same as the name of the
index field.

The following methods in the FieldMetadata class are deprecated. These field
characteristics are overwritten by field definitions in the parser configuration:
public void setSearchable(boolean b)
public void setFieldSearchable(boolean b)
public void setParametricSearchable(boolean b)
public void setAsMetadata(boolean b)
public void setResolveConflict(String string)
public void setContent(boolean b)
public void setExactMatch(boolean b)
public void setSortable(boolean b)

Using Plug-inLogger to log messages: The Plug-inLogger is a class that you can
use to include log statements from the plug-in in the Watson Content Analytics log
files. To use the Plug-inLogger, specify the following statement in the import
statements:
import com.ibm.es.crawler.plug-in.logging.Plug-inLogger;

Add the following statements after the start of the class declaration:

68 IBM Watson Content Analytics: Programming Guide


/** Logger */
private static final PluginLogger logger;
static {
PluginLogger.init(PluginLogger.LOGTYPE_OSS,PluginLogger.LOGLEVEL_INFO);
logger = PluginLogger.getInstance();
}
/** End Logger **/

In the updateDocument section, add the following statements to output test


logging statements of the type INFO, WARN and ERROR:
/* Testing Logging Statements*/
logger.info("This is info.");
logger.warn("This is warning.");
logger.error("This is error.");
/* End Testing Logging Statements */

With the default collection settings, these statements cause warning and error
messages to be shown in the collection log file. For example:
W FFQD2801W 2013/04/27 23:02:05.619 CDT plug-in plug-in.WIN_50605.crawlerplug-in
FFQD2801W A warning was generated from the crawler plug-in.
Message: This is a warning message.
E FFQD2800E 2013/04/27 23:02:05.681 CDT plug-in plug-in.WIN_50605.crawlerplug-in
FFQD2800E An error was generated from the crawler plug-in.
Message: This is an error message.

To show informational messages in the collection log file, open the administration
console. Select the collection, click Actions > Logging > Configure log file
options, and then select All messages for the type of information to log and trace.
After you stop and restart the crawler session, informational messages appear in
the collection log file.
Related tasks:
Configuring search fields
Related reference:
“Sample plug-in application for non-web crawlers” on page 121

Creating a crawler plug-in for type B data sources


You can create a Java class to programmatically update the value of metadata and
the document content of Agent for Windows file systems, BoardReader, Case
Manager, Exchange Server, FileNet P8, and SharePoint data sources.

About this task

Unlike type A data source crawler plug-ins, the type B data source crawler plug-in
process is not forked. The plug-in always runs in the same process of the crawler.
When the crawler session starts, a CrawlerPlugin object is instantiated with the
default constructor. During the crawler session, the activate method is called
when the crawler starts its crawling and the deactivate method is called when the
crawler finishes its crawling. When the crawler session ends, the object is
destroyed. If the crawler scheduler is enabled, the activate method is called when
the crawling is scheduled to start and the deactivate method is called when the
crawling is scheduled to end. Because a single crawler session runs continuously
when the crawler scheduler is enabled, the object is not destroyed.

Procedure

To create a Java class for use as a crawler plug-in for type B data sources:

Crawler plug-ins 69
1. Extend com.ibm.ilel.crawler.plugin.CrawlerPlugin and implement the
following methods:
activate()
deactivate()
updateDocument()
The CrawlerPlugin class is an abstract class. The activate and deactivate
methods are implemented to do nothing. The updateDocument method is an
abstract method, so you must implement it.

Deprecated methods: The init and term methods in the CrawlerPlugin class
are deprecated. For compatibility purposes, the init method is called at the
same time as the activate method when the crawler starts its crawling and the
term method is called at the same time as the deactivate method when the
crawler stops its crawling. Do not use the init and activate methods in the
same plugin. Similarly, do not use the deactivate and term methods in the
same plugin.
For name resolution, use one of the following JAR files:
v AIX or Linux: $ES_INSTALL_ROOT/lib/ilel-crawler.jar
v Windows: %ES_INSTALL_ROOT%\lib\ilel-crawler.jar
2. Compile the implemented code and create a JAR file for it. Add the
ilel-crawler.jar file to the class path when you compile.
3. In the administration console, follow these steps:
a. Edit the appropriate collection.
b. Select the Crawl page and edit the crawler properties for the crawler that
will use the custom Java class.
c. Specify the following items:
v The fully qualified class name of the implemented Java class, for example,
com.ibm.plugins.MyPlugin. When you specify the class name, ensure that
you do not specify the file extension, such as .class or .java.
v The fully qualified class path for the JAR file and the directory in which
all files that are required by the Java class are located. Ensure that you
include the name of the JAR file in your path declaration, for example,
C:\plugins\Plugins.jar. If you need to specify multiple JAR files, ensure
that you use the correct separator depending on your platform, as shown
in the following examples:
– AIX or Linux: /home/esadmin/plugins/Plugins.jar:/home/esadmin/
plugins/3rdparty.jar
– Windows: C:\plugins\Plugins.jar;C:\plugins\3rdparty.jar
4. On the Crawl page, click Monitor. Then, click Stop and Start to restart the
session for the crawler that you edited. Click Details and start a full crawl.

Results

If the crawler stops when it is loading the plug-in, view the log file and verify that:
v The class name and class path that you specified in the crawler properties page
are correct.
v All necessary libraries are specified for the plug-in class path.
v The crawler plug-in does not throw a CrawlerPluginException error.

Metadata field definitions: If you want to add a new metadata field in your
crawler plug-in, you must create an index field and add the metadata field to the

70 IBM Watson Content Analytics: Programming Guide


collection by configuring parsing and indexing options in the administration
console. Ensure that the name of the metadata field is the same as the name of the
index field.
Related tasks:
Configuring search fields

Creating and deploying a plug-in for archive files


Crawler plug-ins for archive files are Java application programming interfaces
(APIs) that you can add your own logic to. You can use this type of plug-in with
type A data source crawlers to extract entries from archive files, which can then be
parsed and included in collections.

Before you begin

Ensure that the correct version of Java is installed. The crawler plug-in for archive
files must be compiled with the IBM Software Development Kit (SDK) for Java
Version 1.6.

Restriction: You cannot use this plug-in with the following type B data source
crawlers:
v Agent for Windows file systems crawler
v BoardReader crawler
v Case Manager crawler
v Exchange Server crawler
v FileNet P8 crawler
v SharePoint crawler

About this task

Type A data source crawlers provide a plug-in interface that enables you to extend
their crawling capabilities and crawl archive files in Watson Content Analytics. The
crawler uses the specified crawler plug-in for archive files to extract archive entries
from an archive file and send the extracted archive entries to the parsers.

To use this capability, you must develop a crawler plug-in for archive files that
implements the com.ibm.es.crawler.plugin.archive.ArchiveFile interface and register
the plug-in in the crawler configuration file.

Important: To enable users to fetch and view files that are extracted from an
archive file when they view search results, you must extend your archive plug-in
to view extracted files.

Procedure

To create and deploy a plug-in for archive files:


1. Create a Java class to use as a crawler plug-in for archive files.
a. Implement the com.ibm.es.crawler.plugin.archive.ArchiveFile interface and
implement the following methods:
public interface ArchiveFile {
/**
* Creates a new archive file with the specified InputStream instance.
*/
public void open(InputStream input) throws IOException;

Crawler plug-ins 71
/**
* Close this archive file.
*/
public void close() throws IOException;

/**
* Reads the next archive entry and positions stream at the beginning of
* the entry data.
*
* @param charset the name of charset
* @return the next entry
*/
public ArchiveEntry getNextEntry(String charset) throws IOException;

/**
* Returns an input stream of the current archive entry.
*
* @return the input stream
*/
public InputStream getInputStream() throws IOException;
}

For name resolution, use the ES_INSTALL_ROOT/lib/dscrawler.jar file.


b. Implement the com.ibm.es.crawler.plugin.archive.ArchiveEntry interface and
implement the following methods:
public interface ArchiveEntry {
/**
* Returns the name of this entry.
*
* @return the name of this entry
*/
public String getName();

/**
* Returns the modify time of this entry.
*
* @return the modify time of this entry
*/
public long getTime();

/**
* Returns the length of file in bytes.
*
* @return the length of file in bytes
*/
public long getSize();

/**
* Tests whether the entry is a directory.
*
* @return true if the entry is a directory
*/
public boolean isDirectory();
}
c. Compile the implemented code and create a JAR file for it. Add the
dscrawler.jar file to the class path when you compile. The crawler plug-in
for archive files must be compiled with the IBM Software Development Kit
(SDK) for Java Version 1.6.
2. Verify the crawler plug-in with the
com.ibm.es.crawler.plugin.archive.ArchiveFileTester class. Add the
dscrawler.jar file and your plug-in code to the class path when you run this
Java application.

72 IBM Watson Content Analytics: Programming Guide


a. List the archive entries with your plug-in code. Confirm that this command
returns correct information about the archive file.
AIX or Linux
java -classpath $ES_INSTALL_ROOT/lib/
dscrawler.jar:path_to_plugin_jar
com.ibm.es.crawler.plugin.archive.ArchiveFileTester
plugin_classname -tv input_archive_filepath
Windows
java -classpath %ES_INSTALL_ROOT%\lib\
dscrawler.jar:path_to_plugin_jar
com.ibm.es.crawler.plugin.archive.ArchiveFileTester
plugin_classname -tv input_archive_filepath
b. Extract the archive entries with your plug-in code. Confirm that this
command extracts all archive entries successfully.
AIX or Linux
java -classpath $ES_INSTALL_ROOT/lib/
dscrawler.jar:path_to_plugin_jar
com.ibm.es.crawler.plugin.archive.ArchiveFileTester
plugin_classname -xv input_archive_filepath
Windows
java -classpath %ES_INSTALL_ROOT%\lib\
dscrawler.jar:path_to_plugin_jar
com.ibm.es.crawler.plugin.archive.ArchiveFileTester
plugin_classname -xv input_archive_filepath
3. Deploy the crawler plug-in.
a. In the administration console, stop the crawler that you want to use with
your crawler plug-in for archive files.
b. Enter the following command to create a configuration file named
crawler_typecrawler_ext.xml, where crawler_ID identifies the crawler that
you want to configure, and crawler_type identifies the prefix of the existing
crawler configuration file. The existing file is named
crawler_typecrawler.xml and it is located in the ES_NODE_ROOT/
master_config/crawler_ID directory.
AIX or Linux
$ES_NODE_ROOT/master_config/crawler_ID/
crawler_typecrawler_ext.xml
Windows
%ES_NODE_ROOT%/master_config/crawler_ID/
crawler_typecrawler_ext.xml
c. Use a text editor to update the crawler_typecrawler_ext.xml file and add
the rules for your crawler plug-in for archive files. Here is a template
crawler configuration file for enabling your crawler plug-in for archive files.
<ExtendedProperties>
<AppendChild XPath="/Crawler" Name="ArchiveFileRegistry" />
<AppendChild XPath="/Crawler/ArchiveFileRegistry" Name="ArchiveFile" />
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Type">archive_file_type</SetAttribute>
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Class">plugin_classname</SetAttribute>
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Classpath">path_to_required_jars</SetAttribute>

Crawler plug-ins 73
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Path"></SetAttribute>
<AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Extensions" />
<AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile/Extensions"
Name="Extension">archive_file_extension</AppendChild>
</ExtendedProperties>

where:
archive_file_type
Specifies the type of the archive files.
plugin_classname
Specifies the fully qualified class name of your crawler plug-in for
archive files.
path_to_required_jars
Specifies the class path, delimited by the path separator, that are
required to run your crawler plug-in for archive files.
archive_file_extension
Specifies the file extension of the archive files that you want to process
with your crawler plug-in for archive files.
d. Restart the crawler that you stopped.

Example

Here is a sample crawler configuration for enabling the crawler plug-in for LZH
archive files.
<ExtendedProperties>
<AppendChild XPath="/Crawler" Name="ArchiveFileRegistry" />
<AppendChild XPath="/Crawler/ArchiveFileRegistry" Name="ArchiveFile" />
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Type">lzh</SetAttribute>
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Class">com.ibm.es.sample.archive.lzh.LzhFile</SetAttribute>
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Classpath">C:\lzhplugin;C:\lzhplugin\lzhplugin.jar</SetAttribute>
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Path"></SetAttribute>
<AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Extensions" />
<AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile/Extensions"
Name="Extension">.lzh</AppendChild>
</ExtendedProperties>
Related concepts:
“Crawler plug-ins” on page 65
“API documentation” on page 5

Extending the archive plug-in to view extracted files


You can create a crawler plug-in that enables users to view documents that are
extracted from archive files, such as .zip, .tar, or .rar files.

Watson Content Analytics provides Java APIs for implementing a crawler plug-in
that extracts archive entries from archive files that are crawled by type A data
source crawlers. The fetch capabilities, however, do not allow users to view the
extracted files. You can extend the archive plug-in so that users can fetch and view

74 IBM Watson Content Analytics: Programming Guide


documents that are extracted from archive files. To implement the plug-in, you use
the same implementation that you use for other type A data source crawler
plug-ins.

Restriction: You cannot use this plug-in with the following type B data source
crawlers:
v Agent for Windows file systems crawler
v BoardReader crawler
v Case Manager crawler
v Exchange Server crawler
v FileNet P8 crawler
v SharePoint crawler

To register the plug-in, update the customcommunication.properties file and add


the following properties:
es.ext.dirs.type=classpath
archive.plugin.type=classname;.extension

where:
type
Specifies the identifier of the archive document type, such as .rar or .lzh. You
can also choose your own type.
classpath
Specifies the list of paths for the class path that is required to run your archive
plug-in. Separate the paths by a semicolon (;) on Windows or a colon (:) on
AIX or Linux.
classname
Specifies the class name of your archive plug-in.
extension
Specifies the file extension. Your archive plug-in is invoked for the files that
match this extension.

The following example shows a sample customcommunication.properties file that


registers an archive plug-in named RarFile to view documents extracted from .rar
files:
# extension files and directories
es.ext.dirs=C:\\Program Files\\IBM\\es\\lib\\es.repo.jar;C:\\Program
Files\\IBM\\es
\\lib\\rdsutil.jar;C:\\Program Files\\IBM\\es\\lib\\ESSearchServer.jar;C:\\Program
Files\\IBM\\es
\\lib\\trevi.tokenizer.jar;C:\\Program Files\\IBM\\es\\lib\\es.workmgr.jar;
C:\\Program Files\\IBM\\es\\lib\\dscrawler.jar;

es.ext.dirs.rar=C:\\rarplugin;C:\\rarplugin\rarplugin.jar;
archive.plugin.rar=RarFile;.rar
Related concepts:
“API documentation” on page 5

Web crawler plug-ins


The web crawler plug-in provides two types of plug-ins: a prefetch plug-in and a
postparse plug-in.

Crawler plug-ins 75
With the prefetch plug-in, you can use Java APIs to add fields to the HTTP request
header that is sent to the origin server to request a document.

With the postparse plug-in, you can use Java APIs to view the content, security
tokens, and metadata of a document before the document is parsed and tokenized.
You can add to, delete from, or replace any of these fields, or stop the document
from being sent to the parser.

If your plug-in requires Java classes or non-Java libraries or other files besides the
plug-in, you must write the plug-in to handle that requirement. For example, your
plug-in can invoke a class loader to bring in more Java classes and can also load
libraries, make network connections, make database connections, or do anything
else that it needs.

Plug-ins run as part of the crawler JVM process. Exceptions and errors will be
caught, but crawler performance is affected by plug-in execution. You should write
plug-ins to do the minimum amount of processing and catch all anticipated
exceptions. Plug-in code must be multithread-safe. If you have 200 concurrent
downloads, you might have 200 concurrent calls to your plug-in.

Using a plug-in to crawl secure WebSphere Portal sites

If application security is enabled in WebSphere Application Server and you want to


crawl secure WebSphere Portal sites with the web crawler, you must create a
crawler plug-in to handle the form-based authentication requests. For a discussion
about form-based authentication and a sample program that you can adapt for
your custom web crawler plug-in, see http://www.ibm.com/developerworks/
db2/library/techarticle/dm-0707nishitani.

The plug-in is required if you use the web crawler to crawl any sites through
WebSphere Portal, including Workplace Web Content Management sites and
Lotus® Quickr® sites.
Related concepts:
“Crawler plug-ins” on page 65

Creating a prefetch plug-in for the web crawler


To create a prefetch plug-in, you write a Java class that implements the interface
com.ibm.es.wc.pi.PrefetchPlugin.

Procedure

To create a prefetch plug-in:


1. Inherit the com.ibm.es.wc.pi.PrefetchPlugin interface and implement the
following methods:
public class MyPrefetchPlugin implements com.ibm.es.wc.pi.PrefetchPlugin {
public MyPrefetchPlugin() { ... }
public boolean init() { ... }
public boolean processDocument(PrefetchPluginArg[] args) { ... }
public boolean release() { ... }
}
The init method is called once when the plug-in is instantiated. If you specify
that you have a plug-in class, the crawler loads that class when the crawler is
started and creates a single instance of the class. Your plug-in class must have a
no-argument constructor. The crawler creates only one instance of the class.
After creating the instance of the class, the crawler calls the init method before

76 IBM Watson Content Analytics: Programming Guide


the first use. This method does the required setup tasks that cannot be done
until an instance of the class is in memory.
If the plug-in is not supposed to be used or other errors occur, the init method
can return false, and the crawler removes this instance from the list of prefetch
plug-ins. If the init method returns true, the plug-in is ready to use. The init
method cannot throw an exception.
The processDocument method is called on the single plug-in instance for every
document that will be downloaded. The crawler uses from one to several
hundred download threads, which run asynchronously, so this method can be
called from multiple threads concurrently.
The release method is called once when the crawler stops to allow the plug-in
object to release any system resources or flush any queued objects. This method
cannot throw exceptions. A true result means success. A false result is logged.
For name resolution, use the ES_INSTALL_ROOT/lib/URLFetcher.jar file.
2. Compile the implemented code and make a JAR file for it. Add the
ES_INSTALL_ROOT/lib/URLFetcher.jar file to the class path when you compile.
3. In the administration console, follow these steps:
a. Edit the appropriate collection.
b. Select the Crawl page and edit the crawler properties for the crawler that
will use the custom Java class.
c. Specify the following items:
v The fully qualified class name of the implemented Java class, for example,
com.ibm.plugins.MyPlugin. When you specify the class name, ensure that
you do not specify the file extension, such as .class or .java.
v The class path for the plug-in, including all needed JAR files. Ensure that
you include the name of the JAR files in your path declaration, for
example, /ics/plugins/Plugins.jar
d. Stop and restart the session for the crawler that you edited. Then, start a
full crawl.

Results

If an error occurs and the web crawler stops while it is loading the plug-in, view
the log file and verify that:
v The class name and class path that you specified on the crawler properties page
is correct.
v All necessary JAR files were specified for plug-in class path.
v The crawler plug-in does not throw CrawlerPluginException or any other
unexpected exception, and no fatal errors occur in the plug-in.

You must write this method to be thread-safe, which you can do by wrapping its
entire contents in a synchronized block, but that permits only one thread to
execute the method at a time, which causes the crawler to become single-threaded
during plug-in operation, creating a performance bottleneck.

A better way to make the method multithread-safe is by using local (stack)


variables for all states, which minimizes the amount of global data and
synchronizes only during access to objects that are shared between threads. This
method cannot throw an exception. It can return true to indicate successful
processing of a document or false to indicate a problem. A false return value is
logged with the URL by the crawler.

Crawler plug-ins 77
Prefetch plug-in example

You can use a prefetch plug-in to add a cookie to the HTTP request header before
the document is downloaded.
package com.mycompany.ofpi;
import com.ibm.es.wc.pi.PrefetchPlugin;
import com.ibm.es.wc.pi.PrefetchPluginArg;
import com.ibm.es.wc.pi.PrefetchPluginArg1;
public class MyPrefetchPlugin implements PrefetchPlugin {
public boolean init() { return true; }
public boolean release() { return true; }
public boolean processDocument(PrefetchPluginArg[] args) {
PrefetchPluginArg1 arg = (PrefetchPluginArg1)args[0];
String header = arg.getHTTPHeader();
header = header.substring(0, header.lastIndexOf("\r\n"));
header += "Cookie: class=TestPrefetchPlugin\r\n\r\n";
arg.setHTTPHeader(header);
return true;
}
}

This example shows:


v The first element ([0]) in the argument array that is passed to your plug-in is an
object of type PrefetchPluginArg1, which is an interface that extends the
interface PrefetchPluginArg. This is the only argument and the only argument
type that is passed to the prefetch plug-in. You can safely cast to it. To be
completely safe, you can enclose the cast in a try/catch block and look for a
ClassCastException object or do an "instanceof" test first.
v After you have the argument, you can call any method in the
PrefetchPluginArg1 interface. The getURL method returns the URL (in String
form) of a document that the crawler downloads. You can use this URL to
decide if the document requires additional information in the request header,
such as a cookie.
v The getHTTPHeader method returns a String that contains the all of the content of
the HTTP request header that the crawler sends so that the crawler can
download the document. The plug-in can inspect and modify this header if
necessary. For example, a single cookie can be added to the header or any other
information if it is valid for an HTTP request header. You can also remove any
of this information. If you modify the header, you must conform to HTTP
protocol requirements. For example, every line must end with a CRLF sequence,
and the header must use ISO-8859-1 encoding.
v The setHTTPHeader method sets the request header that you modified. The
request header will be parsed in the web crawler after returning the
processDocument method, and additional headers are extracted to add the actual
request header. Method line and headers that are generated internally, such as
the authentication headers and host headers, are protected against this
modification.
v The processDocument method is called once for every document that the crawler
downloads. If the processDocument method returns false, its results are ignored.
If it returns true, the crawler checks what it did. To stop the download, the
Prefetch plug-in calls the setFetch(false) method.

Deploying a prefetch plug-in


To identify your plug-in class to the crawler, put the class in a JAR file and enter
the name of the plug-in class and the location of the JAR file in the crawler
window in the administration console. You must enter the fully qualified name of
the plug-in class, and the absolute path name of the JAR file.
78 IBM Watson Content Analytics: Programming Guide
Procedure

To deploy a prefetch plug-in:


1. Compile the JAVA file and create a JAR file for it. The JAR can also contain
supporting classes and resources, so you might name it ofplugins.jar.

Requirment: Use the Java JAR utility that is included in the Java Development
Kit (JDK).
2. Copy this JAR file to the computer that runs the web crawler. Enter the
absolute path for the JAR file in the administration console on the crawler
window when you enable plug-in.
3. In the administration console, specify the following items:
v The fully qualified class name of the implemented Java class, for example,
com.mycompany.ofpi.MyPrefetchPlugin
v The qualified class path for the JAR file
Ensure that the information that you enter is correct. The system does not
check that the JAR file exists.
When the crawler is started and finds a plug-in JAR file and class name, the
crawler loads the JAR and instantiates the class by using the no-argument
constructor. The crawler then initializes the instance by calling the init method.
If that method returns true, the plug-in is added to the list of prefetch plug-ins.

Results

After you run the crawler, the return value is logged in the collection log file as
informational message. To see information messages, choose All messages as Type
of information to log.

Creating a postparse plug-in for the web crawler


With the postparse plug-in, you can use Java APIs to view the content, security
tokens, and metadata of a document that is parsed by the simple HTML parser
that is provided by the Web crawler. You can add to, delete from, or replace any of
these fields, or stop the document from being sent to the document processing
pipeline, including the parser, tokenizer, and indexer.

About this task

To create a postparse plug-in, you write a Java class that implements the interface
com.ibm.es.wc.pi.PostparsePlugin, for example:
public class MyPostparsePlugin implements
com.ibm.es.wc.pi.PostparsePlugin {
public MyPostparsePlugin () { ... }
public boolean init() { ... }
public boolean processDocument(PostparsePluginArg[] args) { ... }
public boolean release() { ... }
}

The plug-in class can implement both interfaces, but it needs only one init
method and one release method. If the class does both prefetch and postparse
processing, you need to initialize and release resources for both tasks. Both the
init method and the release method are called once.

The processDocument method is called on the single plug-in instance for every URL
for which a download was attempted. Not all downloads return content. The
HTTP return codes, such as 200, 302, or 404, can be used by your plug-in to
Crawler plug-ins 79
determine what to do when called. If content was obtained and if the content was
suitable for HTML parsing, the content is put through the parser, and the results of
parsing are available when your plug-in is called.

Postparse plug-in examples

The following example shows how to add security ACLs to the metadata that the
crawler sends with documents that are downloaded from a particular site. You can
use a postparse plug-in to add those ACLs just before the crawler writes the
document to the parser's input buffer:
package com.mycompany.ofpi; // Plug-ins

import com.ibm.es.wc.pi.*;

public class MyPostparsePlugin implements PostparsePlugin {


public MyPostparsePlugin() { }
public boolean init() { return true; }
public boolean release() { return true; }
public boolean processDocument(PostparsePluginArg[] args) {
try {
PostparsePluginArg1 arg = (PostparsePluginArg1)args[0];
if (arg.getURL().startsWith("http://mysite.com/users/")) {
// Extract user name from URL; look up appropriate tokens.
String acls = // Create a comma-separated list of the
// additional ones.
arg.addSecurityACLs(acls);
}
return true;
} catch (Exception e) {
return false; // disregard returned results
}
}
}

You can also use a postparse plug-in to add a new metadata field to your crawled
documents. For example, if some of your documents contain a particular facet
value, you might want to add a metadata field called "MyUserSpecificMetadata" to
the search record that contains a string that you need to query when the crawler is
running with various "searchability" attributes. In another example, because the
built-in parsers cannot extract metadata from binary documents, you might want
to add enterprise-specific metadata to binary documents after they are crawled to
ensure that the metadata fields can be searched when users search the collection.

The following example shows how to add a metadata field:


public class MyPostparsePlugin implements PostparsePlugin {

public MyPostparsePlugin() { }
public boolean init() { return true; }
public boolean release() { return true; }
public boolean processDocument(PostparsePluginArg[] args) {
try {
PostparsePluginArg1 arg = (PostparsePluginArg1)args[0];
if (arg.getContent() != null && arg.getContent().length > 0) {
String content = new String( arg.getContent(), arg.getEncoding() );
if (content != null && content.indexOf(keyword) > 0) {
final String userdata = null; // look up string by keyword.
FieldMetadata mf = new FieldMetadata(
"MyUserSpecificMetadata", // field name
userdata, // field value
false, // searchable?
true, // field-searchable?
false, // parametric-searchable?

80 IBM Watson Content Analytics: Programming Guide


true, // can be extracted by search?
"MetadataPreferred", // metadata value rather
// than content
false); // show in summary?
arg.addMetadataField(mf); // Add it to the list.
return true; // Use results.
}
}
return false; // ignore results
} catch (Exception e) {
return false; // disregard returned results
}
}
}

The document content is available from the plug-in argument (arg.getContent).


The encoding that the crawler found is available. With the content and encoding,
you can create a String. You can then look for some keyword (content.indexOf(...)),
associate new data with it (userdata = ...), and insert that new data as the content
of the new field.

To define a new metadata field, create an instance of the FieldMetadata object and
set its field values.

Crawler plug-ins 81
82 IBM Watson Content Analytics: Programming Guide
Creating and deploying a plug-in for post-filtering search
results
You can create a Java class to programmatically apply your own security logic for
post-filtering search results.

Before you begin

To be able to use a custom post-filtering plug-in, you must enable document-level


security in your collection.

About this task

When a search for a collection is started, the plug-in is also initialized. An object
that implements the SecurityPostFilterPlugin interface is instantiated with the
default constructor. When the search is stopped, the object is destroyed. Before the
interim search result candidates that are returned by a query are post-filtered by
the plug-in, the init method is invoked. The term method is called after the plug-in
finishes filtering results for a query.

If you want to apply only your custom plug-in for post-filtering the search results
and not use the system-defined post-filtering functions, you can disable the
system-defined function. If any crawler is configured to use the system-defined
post-filtering function, your custom plug-in is applied in addition to the
post-filtering that is done automatically by the system.

To disable system-defined post-filtering of documents in the crawl space, use the


administration console to edit the crawl space for a crawler. Edit the
document-level security options and clear the Validate current credentials during
query processing check box. Then, run a full crawl of all documents in the crawl
space and build a new index.

Important: Your custom plug-in is processed in the ESSearchServer process. If the


custom plug-in causes processes to stop unexpectedly or causes other problems,
the ESSearchServer process might stop providing search services.

Procedure

To create a Java class and deploy a plug-in for post-filtering search results:
1. Create a Java class that implements the
com.ibm.es.security.plugin.SecurityPostFilterPlugin interface and implement the
following methods:
v init()
v term()
v verifyUserAccess()
For name resolution, use the ES_INSTALL_ROOT/lib/trevi.tokenizer.jar file.
2. Compile the implemented code and create a JAR file for it. To deploy the
plug-in, you must provide the plug-in as a JAR file. Add the file
trevi.tokenizer.jar to the class path when you compile.
3. Do the following steps on all search servers.

© Copyright IBM Corp. 2009, 2014 83


a. Log in as the Watson Content Analytics administrator.
b. Ensure that the following properties identify the plug-in in the
ES_NODE_ROOT/master_config/searchserver/dock/config.properties file:
PostFilterPluginClassName=fully_qualified_class_path_name
PostFilterPluginClassPath=absolute_path_to_jar_files
For example, on Windows, if your plug-in class name is
user.plugin.SecurityPostFilterPlugin and all classes are bundled in the
C:\user\plugin.jar file, update the two properties as follows:
PostFilterPluginClassName=user.plugin.SecurityPostFilterPlugin
PostFilterPluginClassPath=C\:\\user\\plugin.jar

If these properties are not configured in the config.properties file, your


custom post-filtering plug-in is disabled.
c. Stop and restart the Watson Content Analytics system.

Results

If you see an error message similar to the following message, and no results are
returned the first time that you submit a search after you configure the plug-in,
then it is possible that your plug-in was not applied successfully:
FFQR0648E A general exception was caught while processing document level security.
Exception text: com.ibm.es.security.plugin.SecurityPostFilterPluginException:
Failed to load plug-in class.

In this case, check the ESSearchServer log file and verify the following conditions:
v The class name and class path that you specified in the ESSearchServer
properties file are correct. To ensure that the plug-in can be found on Windows,
be sure to escape, by using the backslash character ( \ ), characters like colons ( :
) and backslashes in the class path.
v All necessary libraries are specified for the plug-in class path.
v The plug-in does not throw a SecurityPostFilterPluginException exception in the
ESSearchServer log file.

Example

The sample plug-in application for post-filtering search results shows how you can
eliminate documents that users are not authorized to view from the search results.
Related concepts:
“API documentation” on page 5
Related reference:
“Sample plug-in for post-filtering search results” on page 123

84 IBM Watson Content Analytics: Programming Guide


Creating and deploying a plug-in for exporting documents or
deep inspection results
You can create a Java class to programmatically apply your own logic for exporting
crawled, analyzed, or searched documents from collections. You can also create
custom plug-in to export analysis results for each document that is included in a
deep inspection request.

Before you begin

The plug-in must be compatible with Java 6.

About this task


For name resolution, use the ES_INSTALL_ROOT/lib/es.indexservice.jar JAR file.

Procedure

To create a Java class and deploy a plug-in for exporting documents or deep
inspection results:
1. Create a Java class that extends the
com.ibm.es.oze.api.export.ExportDocumentPublisher abstract class. The
com.ibm.es.oze.api.export.ExportDocumentPublisher class has the following
methods:
v init()
v initPublish()
v publish()
v termPublish()
v term()
The init, initPublish, termPublish, and term methods are implemented to do
nothing. The publish method is an abstract method, so you must implement it.
If you plan to export content from an InfoSphere BigInsights collection and
export directly from Hadoop MapReduce tasks, the plug-in class must have the
annotation com.ibm.es.oze.api.export.ExecuteOnHadoop. The plug-in can
override the abortPublish method that cleans up the output of an aborted
Hadoop task. The abortPublish method is called when a Hadoop task is
aborted and it calls the termPublish method by default.
2. Optional: If you want to control which documents are exported, extend the
com.ibm.es.oze.api.export.ExportDocumentFilter abstract class. The class has the
following method:
v accept()
3. Optional: If you want to export deep inspection results, implement the
following interfaces:
interface: com.ibm.es.oze.api.export.document. InspectionContent
Use this interface to export metadata about the deep inspection request.
package com.ibm.es.oze.api.export.document;
public interface InspectionContent extends Content {
public InspectionRecord[] getInspectionRecords();
}

© Copyright IBM Corp. 2009, 2014 85


interface: com.ibm.es.oze.api.export.document.InspectionRecord
Use this interface to export analysis results for each document that is
included in a deep inspection request.
package com.ibm.es.oze.api.export.document;
public interface InspectionRecord {
public double getIndex();
public String[] getFacetNames();
public int getCount();
}
4. Compile the implemented code and create a JAR file for it. To deploy the
plug-in, you must provide the plug-in as a JAR file. Add the
ES_INSTALL_ROOT/lib/es.indexservice.jar file to the class path when you
compile.
If you plan to export content from an InfoSphere BigInsights collection and
export directly from Hadoop MapReduce tasks, all required resources for the
plug-in, such as classes and resource files, must be included in JAR files. All
JAR files must be explicitly listed in the class path.
5. To integrate the custom plug-in for exporting documents, configure export
options for a collection in the administration console and specify the class path
of the JAR files, the class name, and the properties that you want to pass to the
plug-in. If no filter class is specified, all documents are exported.
To integrate the custom plug-in for exporting deep inspection results, configure
text analytics options for a collection in the administration console and specify
class path of the JAR files, the class name, and the properties that you want to
pass to the plug-in.
Related concepts:
“API documentation” on page 5
Related tasks:
Exporting documents for use in other applications
Exporting documents from search results
Exporting deep inspection results
Integrating with IBM InfoSphere BigInsights
Related reference:
“Sample plug-ins for custom document export” on page 125
Exporting crawled or analyzed documents

86 IBM Watson Content Analytics: Programming Guide


Creating and deploying a plug-in to add custom widgets for
user applications
You can create plug-ins to add custom widgets to search and analytics applications.

About this task

You can create custom widgets by using the Dojo Toolkit. You must create a
separate plug-in for each custom widget.

Sample plug-ins are provided in the ES_NODE_ROOT/master_config/searchapp/


icaplugin directory. You can start developing your custom plug-ins by modifying
the sample code.

Procedure

To create and deploy a plug-in that adds a custom widget in a search or analytics
application:
1. Develop the JavaScript (.js) file for the custom plug-in by using the Dojo
Toolkit. Develop the plug-in as a Dojo widget that extends the
ica/pane/PanePluginBase class. For information about the available functions,
see the MyFirstSearchPane and MyFirstAnalyticsPane sample plug-ins.
2. Add the plug-in file to the ES_NODE_ROOT/master_config/searchapp/icaplugin
directory.
3. Register the widget.
a. Back up and edit the appropriate widgets.json file for the type of
application to which you want to add the custom widget:
v To register a custom widget for a search application, edit the
ES_NODE_ROOT/master_config/searchserver/repo/search/
Application_Name/widgets.json file.
v To register a custom widget for an analytics application, edit the
ES_NODE_ROOT/master_config/searchserver/repo/analytics/
Application_Name/widgets.json file.
Application_Name is the application ID, such as default, social, or advanced.
You can determine the ID by viewing the list of applications in the
application customizer.
b. Add an entry for the widget in the following format:
} ,
"MyCustomAnalyticsPane" : {
"available" : true,
"label" : "My Custom Analytics Pane" ,
"widgetName" : "icaplugin/MyCustomAnalyticsPane" ,
"properties": [
{"value":"test", "name":"defaultQuery", "editable":true, "sync":false,
"type":"TextBox", "label":"Default Query", "widgetOptions":{},
"requried":false}
]
}
The MyCustomAnalyticsPane field is the internal ID of this widget. You can
assign any value that includes alphabetic and numeric characters only.

© Copyright IBM Corp. 2009, 2014 87


The label field is the name to display for this widget in the list of available
widgets in the layout customizer.
The widgetName field is the Dojo module path of the plug-in. You must
include the icaplugin directory in the path.
The properties field defines the customizable properties for the widget.
You must specify the following parameters for each property:
value Default value of the widget property.
name Name of the widget property.
editable
Specify true to display the property in Preferences and the layout
customizer so that users can change the value.
sync Always set this parameter to false. true is not supported.
type Type of customization user interface for this property. Only TextBox
is supported.
label Label to display for this property in Preferences and the layout
customizer.
widgetOptions
Not supported. Always set this parameter to {}.
required
Specify true to display an asterisk (*) next to the property to
indicate that this property is required.
In this example, the entry defines the defaultQuery property that is
displayed in the layout customizer with the label Default Query and the
default value test.

Tip: Ensure that you include a comma (,) before each entry to conform to
JSON syntax.
4. Restart the user application.
v If you use the embedded web application server, enter the following
commands, where node_ID identifies the search server:
esadmin searchapp.node_ID stop
esadmin searchapp.node_ID start
To determine the node ID for the search server, run the esadmin check
command to view a list of session IDs. Look for the node ID that is listed for
the searchapp session.
v If you use WebSphere Application Server:
a. Enter the following command:
esadmin config sync
b. Stop and restart the user application.

Tip: To test plug-in code without restarting the server, you can add the
plug-ins to the ES_INSTALL_ROOT/wlpapps/servers/searchapp/apps/
commonui.ear/commonui.war/icaplugin directory. After you update the contents
of this directory, clear the browser cache to immediately view the changes in
your application. However, this directory is automatically overridden when the
server is restarted. When the server is restarted, the ES_NODE_ROOT/
master_config/searchapp/icaplugin directory is copied to the
ES_INSTALL_ROOT/wlpapps/servers/searchapp/apps/commonui.ear/
commonui.war/icaplugin directory.

88 IBM Watson Content Analytics: Programming Guide


Related tasks:
Customizing applications
Related reference:
“Sample plug-ins for custom widgets in user applications” on page 127

Creating and deploying a plug-in to add custom widgets for user applications 89
90 IBM Watson Content Analytics: Programming Guide
Creating and deploying a custom global analysis plug-in
You can create plug-ins to use custom logic in addition to the default global
analysis tasks that occur during the indexing process.

About this task

Restriction: Custom global analysis is available only for collections that use IBM
InfoSphere BigInsights. Jaql must be installed on the InfoSphere BigInsights server.

Sample plug-ins are provided in the ES_INSTALL_ROOT/samples/jaql directory.

Procedure

To create and deploy a custom global analysis plug-in:


1. Develop one or more Jaql scripts to specify the custom global analysis
processing to perform.
2. Create the custom global analysis configuration file. In a text editor, create a file
with the name install.jaql. The format of the file is a JSON record, as shown
in the following example.
{
"name" : "CustomGAJob-TfIdf",
"description" :"Compute \"TF-IDF\" of noun words for each document",
"variables" : {
"jaql.files.list" : [ "./modules/tfidfMain.jaql" ],
"jars.list" : ["./lib/es.jaql.example.jar","./lib/es.jaql.example2.jar"],
"es.jaql.path" : "./modules"
}
}
The record has three primary fields.
name The name of the custom global analysis task. This field is required and
is used as the ID for monitoring and logging.
description
The description of the custom global analysis task. This field is
optional. If specified, the description is displayed in the administration
console.
variables
Contains one or more of the following fields that specify paths to
required files. All paths are relative to the install.xml file.
jaql.files.list
An array of paths to the Jaql script files to run.
jars.list
An array of paths to the JAR files that are used by the Java
user-defined functions (UDFs). If Java UDFs are not used, this
entry is not required.
es.jaql.path
A string that specifies the directory that contains additional Jaql
script files. These Jaql script files contain functions that are
imported by the Jaql script files that are specified in the
jaql.files.list field.

© Copyright IBM Corp. 2009, 2014 91


3. Add all required files for the plug-in to an archive file that has the .zip file
extension. In addition to the Jaql scripts, the archive file must contain the
custom global analysis configuration file (install.jaql) and any JAR files that
are needed by the Jaql scripts. Save the install.jaql file at the top level of the
archive file, as shown in the following example:
./install.xml
./modules/tfidfMain.jaql
./modules/tfidf.jaql
./modules/icautil.jaql
./lib/es.jaql.example.jar
./lib/es.jaql.example2.jar
4. To deploy the custom global analysis plug-in, configure a custom global
analysis task for a collection in the administration console to specify which
fields and facets to pass to the script for analysis. After you configure the task,
you must restart the parse and index services for the collection. If you do not
set a schedule for custom global analysis, the task starts automatically after all
documents are parsed and indexed.
Related concepts:
Custom global analysis
Custom global analysis
Related reference:
“Sample plug-ins for custom global analysis” on page 129

Jaql scripts for custom global analysis


The custom global analysis logic is implemented by creating a Jaql (Query
Language for JSON) script.

The inputs for the script are the fields, facets, and text that are extracted from the
content during the document processing stage. Use the readGAInput(GAOptions)
function to get document fields, facets, and text content in JSON format. The
output from the script can be stored as document fields or facets in the Watson
Content Analytics index by using the writeGAOutput(GAOptions) function.
GAOptions is a JSON record that contains the necessary parameters. GAOptions can
be obtained by using the getGAOptions($MetaTrackerJaqlVars) function.
$MetaTrackerJaqlVars is always needed as an argument. To call these functions,
modules with the namespace ica::ga must be imported. The following example
shows a sample custom global analysis Jaql script:
import ica::ga(*);

options:=getGAOptions($MetaTrackerJaqlVars);

readGAInput(options)
-> someOperation()
-> anotherOperation()
-> writeGAOutput(options);

Input JSON format

The function readGAInput() returns an array of JSON records. Each record


represents a separate document that was processed by Watson Content Analytics.
Each record can contain field values, facet values, and textual content, as
configured in the administration console. The following table lists the fields that
can be included in the JSON records.
92 IBM Watson Content Analytics: Programming Guide
Required
or
Field name Optional Remarks
uri Required The document ID, such as the URL.
content Required Contains the text that the parser extracted from
the document content
metadata Required Contains information about the document
metadata.
fields Optional An array of records that contain information
about the metadata fields. Each record contains
the name and value fields
name Optional The name of a document field.
value Optional The value of a document field.
docfacets Optional An array of records that contain information
about the metadata facets. Each record contains
the path and keyword fields.
path Optional The facet path.
keyword Optional A value associated with this facet.
textfacets Optional An array of records that contain information
about the facets that comes from an annotation.
Each record contains the begin, end, path, and
keyword fields.
begin Optional For facets that come from an annotation, the
character position that marks the beginning of
the annotation.
end Optional For facets that come from an annotation, the
character position that marks the end of the
annotation.

The following code is an example of input data for two documents.


[
{
"uri" : "jdbc://ICA/APP.CLAIM/ID/0",
"content" : "[Pack] The straw was peeled off from the juice pack.",
"metadata" : {
"fields" : [ {
"name" : "date",
"value" : "1199113200000"
}, {
"name" : "title",
"value" : "lemon tea - Package / container"
} ],
"docfacets" : [ {
"path" : [ "date", "2008", "1", "1", "0" ],
"keyword" : ""
}, {
"path" : [ "product" ],
"keyword" : "lemon tea"
} ],
"textfacets" : [ {
"begin" : 1,
"end" : 5,
"path" : [ "_word", "noun", "general" ],
"keyword" : "pack"
}, {
"begin" : 11,

Creating and deploying a custom global analysis plug-in 93


"end" : 16,
"path" : [ "_word", "adj" ],
"keyword" : "straw"
}]
}
}, {
"uri" : "jdbc://ICA/APP.CLAIM/ID/1",
"content" : "I got some ice cream for my children, but there was something like
a piece of thread inside the cup.",
"metadata" : {
"fields" : [ {
"name" : "date",
"value" : "1199199600000"
},{
"name" : "title",
"value" : "vanilla ice cream - Contamination / tampering"
} ],
"docfacets" : [ {
"path" : [ "date", "2008", "1", "2", "0" ],
"keyword" : ""
}, {
"path" : [ "product" ],
"keyword" : "vanilla ice cream"
} ],
"textfacets" : [ {
"begin" : 2,
"end" : 5,
"path" : [ "_word", "verb" ],
"keyword" : "get"
}, {
"begin" : 11,
"end" : 14,
"path" : [ "_word", "noun", "general" ],
"keyword" : "ice"
}]
}
}
]

Output JSON format

To store the output into the Watson Content Analytics index, pass an array of
JSON records to the first argument of the writeGAOutput() function. Each record
must include a field with the name uri. The specified values of the record are
stored in the index for the document whose URI matches the value of the uri field.
Any other field in the record besides the uri field is stored as an index field or
document-level facet for the document. In which index field or facet to store the
data is determined by the field name in the JSON record. For example, the
following array of JSON records adds values for the rank field and ranking facet in
the index for the documents with the URIs jdbc://ICA/APP.CLAIM/ID/0 and
jdbc://ICA/APP.CLAIM/ID/1.
[{"uri":"jdbc://ICA/APP.CLAIM/ID/0","rank":"1","$.ranking":"1"},
{"uri":"jdbc://ICA/APP.CLAIM/ID/1","rank":"2","$.ranking":"2"}
]

Requirement: To store data in fields and facets in the index, you must first create
the fields and facets in the administration console. If a field or facet does not exist,
the value is not added to the index.

For index fields, the value of the JSON record field is stored in a new index field.
For the name of the new index field, the prefix custom_ is added to the name of
the index field. In the previous example, if an index field with the name rank is

94 IBM Watson Content Analytics: Programming Guide


configured for the collection, a new index field with the name custom_rank and the
value 1 is added to the document in the index. Some attributes of the custom_
index fields are inherited from the original index field, as described in the
following table. For example, if the parametric searchable attribute is selected for
the rank index field, the custom_rank index field is also parametric searchable.

Attributes of custom_ search fields How value is determined


Returnable Inherited from the original index field.
Faceted search To create a facet, directly assign a value to
the facet by using notation that starts with $.
to indicate the facet path.
Free text search Not free text searchable
In summary Not in summary
Fielded search Inherited from the original index field.
Exact match Inherit from the original index field.
Case sensitive Inherit from the original index field.
Parametric search Inherit from the original index field.
Text sortable Inherit from the original index field.
Analyzable Not analyzable

For facets, if the collection includes a facet with the same facet path as the JSON
record field name, the value of the JSON record field is stored to that facet. In the
previous example, if there is an existing facet with the facet path $.ranking, the
value 1 is stored in that facet. When you specify the facet path, ensure that the
facet path starts with $ and that each facet path component is concatenated by a
period. For example, the facet path $.ranking corresponds to the root facet with
the name ranking.

You can also specify in the Jaql script to save the output in a file or some other
format so that another application can use the data. For example, you can output
the data to a JSON file on the local computer:
readGAInput(options)
-> someOperation()
-> write(file(’/home/biadmin/ica_out.json’));
Related concepts:
Custom global analysis
Custom global analysis
Related reference:
“Sample plug-ins for custom global analysis” on page 129

Creating and deploying a custom global analysis plug-in 95


96 IBM Watson Content Analytics: Programming Guide
Creating and deploying a custom analyzer for document
ranking filters
You can create custom analyzers for use with document ranking filters in an
enterprise search collection. Custom analyzers are used for parsing the text that is
extracted to the associated field.

About this task

A sample analyzer is provided in the ES_INSTALL_ROOT/samples/customAnalyzer


directory.

Important: Before you can upload custom analyzers or associate analyzers with
fields in the administration console, you must enable the custom analyzer support.

Procedure

To create and deploy a custom analyzer:


1. Develop a Java program that extends the Apache Lucene
org.apache.lucene.analysis.Analyzer class. This class specifies how to generate
tokens when provided with a field name and text. Ensure that the custom
analyzer is compatible with Apache Lucene 3.5 libraries. For more information
about the org.apache.lucene.analysis.Analyzer class, see the Apache Lucene
documentation.
2. Package your Java classes into one or more JAR files. Ensure that you include
all Java classes and JAR files that are required for the custom analyzer.
However, you do not need to include the lucene-core-3.5.0.jar file because it
is installed with Watson Content Analytics.
3. Create the custom analyzer configuration file. This file specifies the Java class
path and the path to the analyzer definition file. In a text editor, create a file
with the name stg.xml. The format of the file is an XML file, as shown in the
following example.
<?xml version="1.0" encoding="UTF-8"?>
<stg>
<descriptor>config/analyzers_definition.xml</descriptor>
<classpath>
<pathelement path="CustomCode.jar"/>
<pathelement path="DependingLibrary.jar"/>
</classpath>
</stg>
The file contains the following XML elements:
descriptor
This required element specifies the path to the analyzer definition file
that defines one or more custom analyzers. Each analyzer consists of
two Java classes that implement the
org.apache.lucene.analysis.Analyzer class for use at indexing time and
run time. The indexing analyzer is used to tokenize documents when
parsing and extracting text to the index, and the runtime analyzer is
used to tokenize the search query. The format of the analyzer definition
file is an XML file, as shown in the following example:

© Copyright IBM Corp. 2009, 2014 97


<definition>
<field name="MyFirstAnalyzer">
<indexingAnalyzer impl="com.example.my.MyIndexingAnalyzer"/>
<runtimeAnalyzer impl="com.example.my.MyRuntimeAnalyzer"/>
</field>
<field name="MySecondAnalyzer">
<indexingAnalyzer impl="com.example.my.MyIndexingAnalyzer2"/>
<runtimeAnalyzer impl="com.example.my.MyRuntimeAnalyzer2"/>
</field>
</definition>

The XML file consists of the following elements:


field This element specifies a set of indexing and runtime analyzers
for each custom analyzer. The value of the name attribute is
used as the display name for the analyzer in the administration
console. The element must contain <indexingAnalyzer> and
<runtimeAnalyzer> elements.
indexingAnalyzer
The element specifies a class that implements the
org.apache.lucene.analysis.Analyzer class to tokenize document
at indexing time. The impl attribute specifies the name of the
class. Ensure that the class can be loaded from the specified
class path.
runtimeAnalyzer
The element specifies a class that implements the
org.apache.lucene.analysis.Analyzer class to tokenize query text.
The impl attribute specifies the name of the class. Ensure that
the class can be loaded from the specified class path.
classpath
This element specifies the Java class path. It contains at least one
<pathelement> element. There must be separate <pathelement>
elements for each JAR file.
pathelement
This element specifies each entry in the Java class path. The path
attribute specifies the path to the JAR file.
4. Add all required files to an archive file that has the .zip file extension. The
archive file must contain the custom analyzer configuration file (stg.xml), the
analyzer definition file, and all JAR files that contain the Java classes that are
used by the analyzer. Save the stg.xml file at the top level of the archive file, as
shown in the following example:
./stg.xml
./config/analyzers_definition.xml
./DependingLibrary.jar
./CustomCode.jar
5. In the administration console, deploy the custom analyzer:
a. Upload the archive file that contains the custom analyzer. Open the System
page, click the Parse tab, and click Configure custom analyzer packages.
Click Add Package and browse to the archive file that contains the custom
analyzer.
b. Define document ranking filters to associate the sample analyzer to one or
more index fields that are enabled for fielded search. In the Parse and Index

98 IBM Watson Content Analytics: Programming Guide


pane for an enterprise search collection, click Configure > Custom
analyzers for document ranking filters and click Associate Analyzer with
Field.
c. Configure the document ranking filter groups. In the Search pane for the
enterprise search collection, click Configure > Rules to tune queries and
results. Click Edit Document Ranking Filters to add the document filters to
a document ranking filter group. Ensure that you selected the Enable
document ranking filters check box in the Document Ranking Filters area.
d. Restart the search servers to apply the changes.
Related tasks:
Configuring document ranking filters
Enabling custom analyzer support
Related reference:
“Sample custom analyzer for document ranking filters” on page 131

Creating and deploying a custom analyzer for document ranking filters 99


100 IBM Watson Content Analytics: Programming Guide
Sample code
To help you customize the system, sample code is provided with IBM Watson
Content Analytics.

You can use the sample code as a guideline when you do the following application
development activities:
v Create enterprise search or content mining applications
v Run real-time text analytics on documents without adding them to the index
v Create administration applications
v Create plug-ins for crawlers, pre-filtering and post-filtering search results,
exporting documents, and exporting deep inspection results

© Copyright IBM Corp. 2009, 2014 101


102 IBM Watson Content Analytics: Programming Guide
Sample REST API scenarios
The REST API includes several sample programs that demonstrate how to perform
administrative and search tasks.

The sample programs are installed in the ES_INSTALL_ROOT/samples/rest directory.


Documentation about the REST API is available in the ES_INSTALL_ROOT/docs/api/
rest directory.

The following sample scenarios are provided for the Search REST API:
v Search
v Faceted search against single index

For the administration REST APIs, several sample REST API Java programs are
provided. The programs illustrate two different methods of using the REST APIs:
Apache HttpClient or Java API for XML Web Services (JAX-WS). The sample
programs must be compiled and their class files are in the ES_INSTALL_ROOT/
samples/rest/admin/es.admin.rest.jar file.

HttpClient sample programs

The following REST API sample programs for HttpClient are provided in the
ES_INSTALL_ROOT/samples/rest/admin/com/ibm/es/admin/control/api/samples/
commons directory:
DocumentExample sample program
The DocumentExample class provides an example of how to add or
remove a document in a collection. The sample program allows you to add
a document to a collection and remove a document from the collection.
The type of function depends on the method that you input as a command
argument. This program requires that a collection was already created.
The program builds the HTTP request based on the specified host name,
port, method, collection ID, and other options. It uses the user name and
password to perform authentication. With the HTTP request, the program
obtains the HTTPMethod object that is executed by HttpClient. For the add
and addMultiDocs functions, the program uses the file path of an existing
file and some of the metadata of the document to add the file content as a
document in the collection.
The usage statement is as follows:
DocumentExample -hostname host_name -port port -method method
-username user_name -password password -collectionId collection_ID
additional_parameters

The following example shows how to add the content of the


C:\samples\sample1.txt file as an English document in the col_12345
collection:
DocumentExample -hostname es.ibm.com -port 8390 -method add
-username user1 -password mypassword -collectionId col_12345
-language en -file C:\samples\sample1.txt
AddMultiDocsExample sample program
The AddMultiDocsExample class provides an example of how to add two

© Copyright IBM Corp. 2009, 2014 103


sample documents, a PDF file and an HTML file, to a collection. This
program requires that a collection was already created.
The program builds the HTTP POST request based on the specified host
name, port, collection ID, and other parameters. The program specifies the
document ID, title, format, and language of the documents in JSON format
as the value of the docs parameter and the content of the documents is
specified as the value of the file parameter. The program uses the
specified user name and password to perform authentication. With the
HTTP request, the program obtains the HTTPMethod object that is
executed by HttpClient.
The usage statement is as follows:
AddMultiDocsExample -hostname host_name -port port -username user_name
-password password -collectionId collection_ID

The following example shows how to add the documents to the col_12345
collection:
AddMultiDocsExample -hostname es.ibm.com -port 8390 -username user1
-password mypassword -collectionId col_12345
Field administration sample program
The FieldExample class provides an example of performing the available
methods on search fields within a collection. It allows you to add a search
field, list the fields, map a search field to a crawler, map the search field to
a facet, and remove a search field from the collection. The type of field
function depends on the method that you input as a command
argument. This program requires that a collection was already created.
It builds the HTTP Request URL based on the inputted host name, port,
method, and other options. It uses the inputted user name and password
to perform authentication. With the HTTP request, the program obtains the
HTTPMethod object that is executed by HttpClient.
The usage statement is as follows:
FieldExample -hostname host_name -port port -method method
-username user_name -password password additional_parameters

The following shows an example of obtaining a list of fields for the


col_12345 collection:
FieldExample -hostname es.ibm.com -port 8390 -method getList
-username user1 -password mypassword -collectionId col_12345
GetAccess sample program
The GetAccess example shows how to use REST API by directly inputting
the URLs. You need to pass the URL with required parameters to this
program. It then executes the HTTPMethod based on the URL that you
pass in as an argument and prints the HTTPMethod response.
The usage statement is as follows:
GetAccess URL -username user_name -password password

The following example shows how to obtain the state of the index in xml
format for the Test collection.
GetAccess "http://localhost:8390/api/v10/admin/indexer?
method=getState&output=xml&collectionId=Test" -username esadmin
-password password
IndexerExample sample program

104 IBM Watson Content Analytics: Programming Guide


The IndexerExample class shows how to perform various operations on the
index, such as start and stop the index and get the state of the indexer. The
type of function depends on the method that you input as a command
argument.
The usage statement is as follows:
IndexerExample -hostname host_name -port port -method method
-username user_name -password password

The following example shows to start the index for the col_12345
collection.
IndexerExample -hostname es.ibm.com -port 8390 -username esadmin
-password password -method start -collectionId col_12345
Administering PEAR files sample program
The PearExample shows how to manipulate a custom annotator PEAR
file. The program allows you to add a PEAR file to the system, associate
and disassociate it with a collection, obtain a list of deployed PEAR files,
and remove a PEAR file from the system. The type of function depends on
the method that you input as a command argument.
The usage statement is as follows:
PearExample -hostname host_name -port port -method method
-username user_name -password password additional_parameters

The following example shows how to add the of_regex.pear PEAR file
and name it RegexPear in Watson Content Analytics.
PearExample -hostname es.ibm.com -port 8390 -method add
-username esadmin -password password -pearName RegexPear
-content "C:\\IBM\\es\\packages\\uima\\regex\\of_regex.pear"

JAX-WS sample programs

The following REST API sample programs for JAX-WS are provided in the
ES_INSTALL_ROOT/samples/rest/admin/com/ibm/es/admin/control/api/samples/
jaxws directory:
Administering facets sample program
The FacetExample shows how to work with facets for a specified collection
by using the JAX-WS service to access the REST API. Some of the available
functions for the Facet API include adding and removing a facet from a
specified collection and obtaining a list of facets. The type of function
depends on the method that you input as a command argument.
The usage statement is as follows:
FacetExample -hostname host_name -port port -method method
-username user_name -password password additional_parameters

The following example shows how to obtain a list of facets for the
col_12345 collection.
FacetExample -hostname es.ibm.com -port 8390 -method getList
-username esadmin -password password -collectionId col_12345
GetAccess sample program
The GetAccess example shows how to use REST API by directly inputting
URLs You need to pass the URL with required parameters to this program.
It then invokes the JAX-WS service to invoke the REST API based on the
URL that you passed in as an argument and prints the result stream.
The usage statement is as follows:

Sample REST API scenarios 105


GetAccess URL

The following example shows how to obtain the state of the indexer for the
Test collection.
GetAccess “http://localhost:8390/api/v10/admin/indexer?
method=getState&output=xml&collectionId=Test
&api_username=esadmin&api_password=password”
Administering PEAR files sample program
The PearExample shows how to manipulate a custom annotator PEAR file
by using the JAX-WS service to access the REST API. It allows you to add
a PEAR file to the system, associate and disassociate it with a collection,
obtain a list of deployed PEAR files, and remove a PEAR file from the
system. The type of function depends on the method that you input as a
command argument.
The usage statement is as follows:
PearExample -hostname host_name -port port -method method
-username user_name -password password additional_parameters

The following example shows how to add the of_regex.pear PEAR file
and name it RegexPear in Watson Content Analytics.
PearExample -hostname es.ibm.com -port 8390 -method add
-username esadmin -password password -pearName RegexPear
-content "C:\\IBM\\es\\packages\\uima\\regex\\of_regex.pear"
Related concepts:
“API overview” on page 3
“REST APIs” on page 7

Compiling the sample REST API applications


The code for the sample REST API code must be compiled with the IBM Software
Development Kit (SDK) for Java 1.6.

Before you begin

Before you can build the sample REST API Java applications, you must install and
configure Apache ANT, a Java-based build tool. For information about how to
install and configure Apache ANT, see http://ant.apache.org/.

The sample applications in the ES_INSTALL_ROOT/samples/rest directory must run


in a JRE Version 1.6 environment.

Procedure

To compile and run the sample REST API applications:


1. From the command line, change to the ES_INSTALL_ROOT/samples/rest/admin
directory. The default installation paths are:
v AIX: /opt/IBM/es/samples/rest/admin
v Linux: /opt/IBM/es/samples/rest/admin
v Windows: C:\Program Files\IBM\es\samples\rest\admin
2. Run the ANT script by entering the following command. The directory includes
a build.xml file that ANT uses to compile the applications.
ant You see the following message after the Java source code compiles:

106 IBM Watson Content Analytics: Programming Guide


BUILD SUCCESSFUL
Total time: xx seconds
3. Run the application by specifying the appropriate command on a single line,
where application is the name of the application that you want to run:
For the samples in the ES_INSTALL_ROOT/samples/rest/admin/com/ibm/es/
admin/control/api/samples/jaxws directory
java -cp es.admin.rest.jar
com.ibm.es.admin.control.api.samples.jaxws.application
arguments_to_run_the_application

For example:
java -cp es.admin.rest.jar
com.ibm.es.admin.control.api.samples.jaxws.GetAccess
For the samples in the ES_INSTALL_ROOT/samples/rest/admin/com/ibm/es/
admin/control/api/samples/commons directory
java -cp "%ES_INSTALL_ROOT%\lib\axis2\commons-fileupload-1.2.jar;
%ES_INSTALL_ROOT%\lib\axis2\commons-httpclient-3.1.jar;
%ES_INSTALL_ROOT%\lib\axis2\commons-logging-1.1.1.jar;
%ES_INSTALL_ROOT%\lib\axis2\commons-codec-1.3.jar;.\es.admin.rest.jar"
com.ibm.es.admin.control.api.samples.commons.application
arguments_to_run_the_application

For example:
AIX or Linux
java -cp "/opt/IBM/es/lib/axis2/commons-fileupload-1.2.jar:
/opt/IBM/es/lib/axis2/commons-httpclient-3.1.jar:
/opt/IBM/es/lib/axis2/commons-logging-1.1.1.jar:
/opt/IBM/es/lib/axis2/commons-codec-1.3.jar:./es.admin.rest.jar"
com.ibm.es.admin.control.api.samples.commons.GetAccess
Windows
java -cp
"C:\Program Files\IBM\es\lib\axis2\commons-fileupload-1.2.jar;
C:\Program Files\IBM\es\lib\axis2\commons-httpclient-3.1.jar;
C:\Program Files\IBM\es\lib\axis2\commons-logging-1.1.1.jar;
C:\Program Files\IBM\es\lib\axis2\commons-codec-1.3.jar;
.\es.admin.rest.jar"
com.ibm.es.admin.control.api.samples.commons.GetAccess

On Windows, you can also run the applications by using the resttest.bat
sample batch file in the ES_INSTALL_ROOT/samples/rest/admin directory. The
sample batch file contains the command to run the
com.ibm.es.admin.control.api.samples.jaxws.GetAccess sample application. You
can edit the batch file to run other sample applications. To run samples in the
ES_INSTALL_ROOT/samples/rest/admin/com/ibm/es/admin/control/api/samples/
commons directory, ensure that you specify the correct class path, as shown in
the previous command.

Running the sample REST API applications in Eclipse


You can run the sample REST API applications in an Eclipse environment.

Before you begin

Before you can run the sample REST API Java applications, you must install and
configure Apache ANT, a Java-based build tool. For information about how to
install and configure Apache ANT, see http://ant.apache.org/.

Sample REST API scenarios 107


In addition, you must configure Eclipse to use a JRE Version 1.6 environment.

Procedure

To run the sample REST API applications in Eclipse:


1. Create an Eclipse project and import the files from the ES_INSTALL_ROOT/
samples/rest/admin directory.
2. Right-click the project, click Properties, and click Java Build Path > Libraries >
Add jars. Select es.admin.rest.jar as the JAR file to include.
3. Right-click the project, click Properties, and click Java Build Path > Add
external Archives. Select the following files from the ES_INSTALL_ROOT\lib\
axis2 directory:
v commons-codec-1.3.jar
v commons-fileupload-1.2.jar
v commons-httpclient-3.1
v commons-io-1.4.jar
v commons-logging-1.1.1.jar
4. To run the application, right-click the project and click Run as > Run
Configurations.
5. Click the Arguments tab and type the command arguments for the REST API
sample.
6. Click Run.

108 IBM Watson Content Analytics: Programming Guide


Sample SIAPI enterprise search and content mining
applications
The search and index API (SIAPI) includes several sample applications that show
you how to create simple or advanced enterprise search applications and
applications for exploring content analytics collections.

After you install Watson Content Analytics, the Javadoc documentation for SIAPI
enterprise search applications is available in the ES_INSTALL_ROOT/docs/api/siapi
directory

The following sample applications demonstrate how to do various search tasks:


Simple and advanced search
The SearchExample class provides a simple example of the minimum
requirements that are needed to submit a search to the search server. The
AdvancedSearchExample class is an example that demonstrates some of the
advanced query settings and result processing options.
Streaming queries
The StreamingSearchExample class gives a simple example of how to
submit and process a streaming query against the search server. Streaming
in this case is used to return all results from a particular collection. The
results are returned unsorted and only the document ID and the score are
provided.
Browse and navigate
The BrowseExample class provides an example of accessing a collection's
taxonomy tree and displaying some of the basic navigation properties.
Retrieve all search results
The sample code provided in the “Retrieve all search results sample
application” on page 112 topic shows how to set a query to return
unsorted results and loop over the query results.
Fetch search result documents
The FetchSearchExample class provides an example of how to submit a
fetch request to retrieve the content search result documents.
Federated search
The FederatedSearchExample class provides a simple example of the
minimum tasks that are required to submit a federated search to the search
server.
Federated faceted search
The FederatedFacetedSearchExample class provides a simple example of
the minimum tasks that are required to submit a federated faceted search
to the search server.
Secured search
The SecuredSearchExample class gives a simple example of how to submit
a search to the search server when document level security is enabled for
the collection. This example takes a user name and looks up the user's
credentials in the identity management credential store, then it passes that
information through the SIAPI Query.setACLConstraints method.

© Copyright IBM Corp. 2009, 2014 109


Identity management
The IdentityManagementExample class provides a working sample program
that demonstrates how to use the Identity Management API.
Faceted search
The FacetedSearchExample example demonstrates faceted search in an
enterprise search application.
Content analytics
Several sample applications are provided for working with facets in a
content analytics collection and various analytical views in a content
mining application:
v DocumentsViewExample
v FacetsViewExample
v TimeSeriesViewExample
v DeviationsViewExample
v TrendsViewExample
v FacetPairsViewExample

Compiling the sample enterprise search and content mining


applications
The code for the sample applications and the search and index API code must be
compiled with the IBM Software Development Kit (SDK) for Java 1.6.

Before you begin

Before you can build Java applications for searching or exploring collections, you
must install and configure Apache ANT, a Java-based build tool. For information
about how to install and configure Apache ANT, see http://ant.apache.org/.

The enterprise search application in the ES_INSTALL_ROOT/samples directory must


run in a JRE Version 1.6 environment.

Procedure

To compile and run a sample enterprise search or content mining application:


1. From the command line, change to the ES_INSTALL_ROOT/samples/siapi
directory. The default installation paths are:
v AIX: /opt/IBM/es/samples/siapi
v Linux: /opt/IBM/es/samples/siapi
v Windows: C:\Program Files\IBM\es\samples\siapi
2. Run the ANT script. The directory includes a build.xml file that ANT uses to
build the file:
ant You see the following message after the Java source code compiles:
BUILD SUCCESSFUL
Total time: xx seconds
3. Run the application by specifying the following commands in a single line,
where application is the application that you want to execute:
AIX or Linux
java -classpath %ES_INSTALL_ROOT%/lib/esapi.jar:%ES_INSTALL_ROOT%
/lib/siapi.jar:. application

110 IBM Watson Content Analytics: Programming Guide


Windows
java -classpath "%ES_INSTALL_ROOT%\lib\esapi.jar;%ES_INSTALL_ROOT%
\lib\siapi.jar;."application

Simple and advanced sample enterprise search applications


The SearchExample class provides a simple example of the minimum number of
required tasks that the application does to submit a search query to the search
server. The AdvancedSearchExample class shows the same tasks as the simple
example, but it prints the full ResultSet object instead of only a few values

The simple sample application demonstrates how to:


v Access the service
v Specify a collection
v Specify an application
v Submit a query
v Process the returned results

The advanced sample application does the same tasks as the simple sample except
that it processes the returned results differently than the simple sample.

The simple sample application (SearchExample.java) and the advanced sample


application (AdvancedSearchExample.java) are in the following default directories:
v AIX or Linux: /opt/IBM/es/samples/siapi
v Windows: C:\Program Files\IBM\es\samples\siapi

Browse and navigation sample application


The BrowseExample class provides a sample application that accesses a collection's
taxonomy tree and displays some basic navigation properties.

This sample demonstrates how to:


v Obtain the browse factory
v Obtain a browse service
v Obtain a browser reference
v Get and display the root category
v Get the root's first child category
v Display the child category and its path from root

The sample BrowseExample.java application is in the following directories:


v AIX or Linux: /opt/IBM/es/samples/siapi
v Windows: C:\Program Files\IBM\es\samples\siapi

Time scale view sample application


The TimeScaleViewSearchExample class provides a simple example of the minimum
number of tasks that are required to submit a faceted search query to the search
server and get the time scale view result for an enterprise search collection.

This sample demonstrates how to:


v Obtain the search, browse, and facet factories
v Obtain a browse service

Sample SIAPI enterprise search and content mining applications 111


v Obtain a browser reference
v Obtain a facets service
v Obtain a faceted searchable object
v Create a faceted query
v Get the category corresponding to the specified path, such as /modifieddate/New
v Create a qualified category
v Create a target facet
v Create a facet context
v Set the facet context to the query to be issued
v Perform facet counting with a specified date granularity, such as "Year",
"Month", "Day" or "Hour" by using the time scale taxonomy
v Display the facet count results of the specified date scale

The sample TimeScaleViewSearchExample.java application is in the following


directories:
v AIX or Linux: /opt/IBM/es/samples/siapi
v Windows: C:\Program Files\IBM\es\samples\siapi

Retrieve all search results sample application


This sample code shows how to set a query to return unsorted results and loop
over the query results. You can obtain all unsorted or sorted results because search
result sorting is performed for all matches if a sort key is specified.

The following sample code shows you how to:


v Obtain a SearchFactory and a Searchable object
v Create a new Query object
v Set the query to return unsorted results
v Run the search

Obtain a SearchFactory and a Searchable object

Obtain a SearchFactory and a Searchable object as explained in “Simple and


advanced sample enterprise search applications” on page 111 sample.
SearchFactory factory;
Searchable searchable;

... // obtain a SearchFactory and Searchable object

Create a new Query object

Query q = factory.createQuery("big apple");

Set the query to return unsorted results

q.setSortKey(Query.SORT_KEY_NONE);

Run the search

Run the query in a loop to obtain one page of results at a time. The maximum
result page size that is allowed is 100.

112 IBM Watson Content Analytics: Programming Guide


When you receive the results pages, you need to interpret the
getAvailableNumberOfResults method and getEstimatedNumberOfResults method
differently from the way that you interpret them for sorted query results:
v The getEstimatedNumberOfResults method always returns 0 because the system
does not provide a number-of-results estimate for unsorted results.
v The getAvailableNumberOfResults method returns one of two values: 0 if this is
the last result page, and 1 if more results exist.
v You can use the length of the array that is returned by the getResults method to
find out how many results are within this result page.
int fromResult = 0;
int pageSize = 100;
boolean moreResults = true;

// loop over query results, pageSize results at a time


while (moreResults) {

// set the result range for the next page of results


q.setRequestedResultRange(fromResult, pageSize);

// execute the search


ResultSet resultPage = searchable.search(q);

// loop over the results from the ResultSet


Result[] results = resultPage.getResults();
for (int i=0;i<results.length;i++) {
... // process result
}

// check if there are more available results


moreResults = (resultPage.getAvailableNumberOfResults() == 1);

// modify the range for getting the next page of results


fromResult += pageSize;
}

Fetch document content sample application


This sample code shows how to fetch the content of documents that cannot be
viewed by clicking a clickable URI in the search results.

For a complete example, see the sample program, FetchSearchExample, in the


ES_INSTALL_ROOT/samples/siapi directory.

The fetch API provides the com.ibm.es.fetch package in the esapi.jar file and
the following interfaces:
v com.ibm.es.fetch.Document
v com.ibm.es.fetch.Fetcher
v com.ibm.es.fetch.FetchRequest
v com.ibm.es.fetch.FetchService
v com.ibm.es.fetch.FetchServiceFactory
You can use these classes the same way that you use other SIAPI classes.

Fetching a document

First, create the factory object. Using this factory class, create the FetchService
object and FetchRequest object. The Fetcher class can be created through the

Sample SIAPI enterprise search and content mining applications 113


FetchService object. You can then get the Document object by calling the fetch
method of the Fetcher object. Finally, you can get the binary data by calling the
getBytes method of the Document object.
// obtain the Fetch Service Factory factory implementation
Class cls = Class.forName("com.ibm.es.api.fetch.RemoteFetchFactory");
FetchServiceFactory factory = (FetchServiceFactory) cls.newInstance();

// create a valid Application ID that will be used


// by the Search Node to authorize this access to the collection
ApplicationInfo applicationInfo = factory.createApplicationInfo(applicationName);

// obtain the Fetch Service implementation


FetchService fetchService = factory.getFetchService(config);
// create a new Fetch Request object using the specified uri string
FetchRequest fetchRequest = factory.createFetchRequest(uri, null);
// obtain a Fetcher object to the specified collection ID
Fetcher fetcher = fetchService.getFetcher(applicationInfo, collectionId);

// execute the search by calling the Fetcher’s fetch method.


// A Document object will be returned
Document doc = fetcher.fetch(fetchRequest);

// dump the binary content of the document


byte[] buf = doc.getBytes();

Enforcing document security

You can set ACL constraints to the FetchRequest object. If its value is set, ACL
constraints will be delivered to the search server, and the search server will verify
the user's authority to access the document by checking the ACL constraints.
String aclConstraints = (String) parameters.get("SecurityContext");
aclConstraints = "@SecurityContext::’" + aclConstraints + "’";
FetchRequest fetchRequest = factory.createFetchRequest(uri, aclConstraints);

The ACL constraints value is a String value that must conform to the SIAPI format.

Federated search sample application


The FederatedSearchExample class provides a simple example of the minimum
tasks that are required to submit a federated search to the search server.

Restriction: The FederatedSearchExample class does not support faceted search


queries and results.

The FederatedSearchExample application shows how to:


v Obtain a RemoteFederator object with a federator ID. This ID is the same as the
ApplicationInfo object ID.
v Create a new query object.
v Set the result range.
v Run the search by calling the RemoteFederator object's default search method.

The FederatedSearchExample.java file is in the following directories:


v AIX or Linux: /opt/IBM/es/samples/siapi
v Windows: C:\Program Files\IBM\es\samples\siapi
Related concepts:
“Search and index API federators” on page 33
Related reference:

114 IBM Watson Content Analytics: Programming Guide


“Enterprise search applications” on page 13

Federated faceted search sample application


The FederatedFacetedSearchExample class provides a simple example of the
minimum tasks that are required to submit a federated faceted search to the search
server. A federated faceted search allows you to gather results of faceted search
from multiple collections.

The FederatedFacetedSearchExample application shows how to:


v Obtain a RemoteFacetedFederator object with a federator ID. This ID is the same
as the ApplicationInfo object ID.
v Create a new query object.
v Set the result range.
v Run the faceted search by calling the default search method of the
RemoteFacetedFederator object and browse the facets that are retrieved.

The FederatedFacetedSearchExample.java file is in the following directories:


v AIX or Linux: /opt/IBM/es/samples/siapi
v Windows: C:\Program Files\IBM\es\samples\siapi
Related concepts:
“Search and index API federators” on page 33
Related tasks:
“Creating a faceted enterprise search application” on page 25
Related reference:
“Faceted search API” on page 26
“Faceted search queries in content analytics collections” on page 28

Faceted search sample application


The FacetedSearchExample class provides a simple example of the minimum
number of required tasks that the application does to submit a faceted search
query to the search server. The FacetedSearchExample sample code can be used
with enterprise search collections, not content analytics collections.

The sample application demonstrates how to:


v Access the service
v Specify a collection
v Specify an application
v Submit a query to get all of the facets
v Process the returned results
v Submit a query to get the specified facet
v Process the returned results

The sample application, FacetedSearchExample.java, is in the following default


directories:
v AIX or Linux: /opt/IBM/es/samples/siapi
v Windows: C:\Program Files\IBM\es\samples\siapi
Related tasks:
“Creating a faceted enterprise search application” on page 25

Sample SIAPI enterprise search and content mining applications 115


Related reference:
“Faceted search API” on page 26
“Faceted search queries in content analytics collections” on page 28

Content mining sample applications


Watson Content Analytics provides several sample applications that can be used
with content analytics collections. The applications demonstrate various views of
how you can explore analysis results in a content mining application.

The sample content mining applications are in the following default directories:
v AIX or Linux: /opt/IBM/es/samples/siapi
v Windows: C:\Program Files\IBM\es\samples\siapi

Documents view sample application

The DocumentsViewExample class provides a simple example of the minimum


number of required tasks that the application does to submit a faceted search
query to the search server and get the Documents view result. The sample
application, DocumentsViewExample.java, demonstrates how to:
v Access the service
v Specify a collection
v Specify an application
v Submit a query
v Process the returned results

Facets view sample application

The FacetsViewExample class provides a simple example of the minimum number


of required tasks that the application does to submit a faceted search query to the
search server and get the Facets view result. The sample application,
FacetsViewExample.java, demonstrates how to:
v Access the service
v Specify a collection
v Specify an application
v Get the available taxonomy browsers
v Submit a query to get the specified facet
v Process the returned results

Time Series view sample application

The TimeSeriesViewExample class provides a simple example of the minimum


number of required tasks that the application does to submit a faceted search
query to the search server and get the Time Series view result. The sample
application, TimeSeriesViewExample.java, demonstrates how to:
v Access the service
v Specify a collection
v Specify an application
v Get the available taxonomy browsers
v Submit a query to get the specified facet
v Process the returned results

116 IBM Watson Content Analytics: Programming Guide


Deviations view sample application

The DeviationsViewExample class provides a simple example of the minimum


number of required tasks that the application does to submit a faceted search
query to the search server and get the Deviations view result. The sample
application, DeviationsViewExample.java, demonstrates how to:
v Access the service
v Specify a collection
v Specify an application
v Get the available taxonomy browsers
v Submit a query to get the specified cube
v Process the returned results

Trends view sample application

The TrendsViewExample class provides a simple example of the minimum number


of required tasks that the application does to submit a faceted search query to the
search server and get the Trends view result. The sample application,
TrendsViewExample.java, demonstrates how to:
v Access the service
v Specify a collection
v Specify an application
v Get the available taxonomy browsers
v Submit a query to get the specified cube
v Process the returned results

Facet Pairs view sample application

The FacetPairsViewExample class provides a simple example of the minimum


number of required tasks that the application does to submit a faceted search
query to the search server and get the Facet Pairs view result. The sample
application, FacetPairsViewExample.java, demonstrates how to:
v Access the service
v Specify a collection
v Specify an application
v Get the available taxonomy browsers
v Submit a query to get the specified cube
v Process the returned results
Related tasks:
“Creating a faceted enterprise search application” on page 25
Related reference:
“Faceted search API” on page 26
“Faceted search queries in content analytics collections” on page 28

Sample SIAPI enterprise search and content mining applications 117


118 IBM Watson Content Analytics: Programming Guide
Sample real-time NLP application
The real-time natural language processing (NLP) API makes use of a subset of
SIAPI.

The client program must have the siapi.jar and esapi.jar files in the
ES_INSTALL_ROOT/lib directory of the Watson Content Analytics server.

The sample program ES_INSTALL_ROOT/samples/siapi/RealtimeNLPExample.java


explains all of the supported operations of the real-time NLP API. See the sample
program for usage information.
Related concepts:
“Real-time NLP API” on page 59

© Copyright IBM Corp. 2009, 2014 119


120 IBM Watson Content Analytics: Programming Guide
Sample plug-in application for non-web crawlers
The sample crawler plug-in application shows how you can change security token
values, metadata, and the content of crawled documents.

The sample application, MyCrawlerPlugin.java, is provided in the following


directories:
v For type A data source crawlers: $ES_INSTALL_ROOT/samples/dscrawler
v For type B data source crawlers: $ES_INSTALL_ROOT/samples/ilelcrawler
The following crawlers are type B data source crawlers:
– Agent for Windows file systems crawler
– BoardReader crawler
– Case Manager crawler
– Exchange Server crawler
– FileNet P8 crawler
– SharePoint crawler
Related concepts:
“Crawler plug-ins” on page 65
“Crawler plug-ins for non-web sources” on page 66
Related tasks:
“Creating a crawler plug-in for type A data sources” on page 67

© Copyright IBM Corp. 2009, 2014 121


122 IBM Watson Content Analytics: Programming Guide
Sample plug-in for post-filtering search results
The SampleSecurityPostFilterPlugin.java sample crawler plug-in shows how you
can apply your own security logic for post-filtering search results.
package sample.plugin;

import com.ibm.es.security.plugin.NameValuePair;
import com.ibm.es.security.plugin.SecurityPostFilterIdentity;
import com.ibm.es.security.plugin.SecurityPostFilterPlugin;
import com.ibm.es.security.plugin.SecurityPostFilterPluginException;
import com.ibm.es.security.plugin.SecurityPostFilterResult;
import com.ibm.es.security.plugin.SecurityPostFilterUserContext;

/**
* The sample SecurityPostFilterPlugin class.
*/
public class SampleSecurityPostFilterPlugin implements SecurityPostFilterPlugin {

/**
* We should reuse the context for a bunch of results.
*/
private SecurityPostFilterUserContext context = null;

/**
* Default constructor.
* The <code>SecurityPostFilterPlugin</code> implementation is initialized
* using this constructor.
*/
public SampleSecurityPostFilterPlugin() {
// Initialize resources required for the entire this instance.
// For example, logging.
}

/* (non-Javadoc)
* @see com.ibm.es.security.plugin.SecurityPostFilterPlugin#init
* (com.ibm.es.security.plugin.SecurityPostFilterUserContext)
*/
public void init(SecurityPostFilterUserContext context)
throws SecurityPostFilterPluginException {
// Initialize resources for the bunch of results.
// i.e. for results from a query.
this.context = context;
}

/* (non-Javadoc)
* @see com.ibm.es.security.plugin.SecurityPostFilterPlugin#term()
*/
public void term() throws SecurityPostFilterPluginException {
// finalize plugin here after verifying access to documents
// i.e deallocate system resources, close remote
// datasource connections...
}

/* (non-Javadoc)
* @see com.ibm.es.security.plugin.SecurityPostFilterPlugin#verifyUserAccess
* (com.ibm.es.security.plugin.SecurityPostFilterResult)
*/
public boolean verifyUserAccess(SecurityPostFilterResult result)
throws SecurityPostFilterPluginException {

if(false) {
return false; // If you don’t want to return this result to user,

© Copyright IBM Corp. 2009, 2014 123


//return false.
}

// We can refer to a result’s information


String domain = result.getDomain();
String source = result.getDocumentSource();
NameValuePair[] fields = result.getFields();

SecurityPostFilterIdentity id = null;

// If domain and source information is associated to a result,


// we can utilize them.
if(domain != null && source != null) {
// We can validate current credential here in case that domain is
// assigned to the result.
// But usually in such cases, we can ask the system defined post-filtering.
id = this.context.getIdentity(domain, source);
} else {
// We should walk through identities specified to the query in case
// we can’t retrieve domain information from a result.
SecurityPostFilterIdentity[] identities = this.context.getIdentities();
// EXAMPLE: we choose the first identity
id = identities[0];
}

// EXAMPLE :
// verify access to documents from a document source "MyDocs"
// only users in group "OmniFind" are allowed to see documents
// from "MyDocs".
if ("OmniFindDocs".equals(source)) {
// obtain a list of user groups from the identity
String[] groups = null;
if (id != null) {
groups = id.getGroups();
}

for (int i = 0; groups != null && i < groups.length; i++) {


// this user belongs to "OmniFind" group,
// therefore has access to the document.
if ("OmniFind".equals(groups[i])) {
return true;
}
}
return false;
}

// EXAMPLE :
// always allow access to documents from other sources
// (winfs, notes, quickplace...).
return true;
}

}
Related tasks:
“Creating and deploying a plug-in for post-filtering search results” on page 83

124 IBM Watson Content Analytics: Programming Guide


Sample plug-ins for custom document export
Use these sample plug-ins to apply custom logic when exporting documents.

The DocumentIDListCreator class is an example of a document export plug-in that


generates a file that lists the IDs of all exported documents. This sample code
shows how you can access content and metadata of the documents to be exported
and how you can get properties that are set by using the administration console.

The ExtensionFilter class is an example of a document export filter plug-in that


filters the documents to be exported according to the document extension. This
sample code shows how you can implement custom logic to determine which
documents are exported

The sample code is installed in the ES_INSTALL_ROOT/samples/export directory.

To use the sample plug-ins for custom document export:


1. Ensure that the ES_INSTALL_ROOT/lib/es.indexservice.jar file is in your class
path before you compile the code.
2. Compile the sample code and create a JAR file that includes both compiled
classes.
3. On the export configuration page in the administration console, select the
Export documents by using a custom plug-in option.
4. In the Export plug-in class path, specify the class path of the JAR file that you
created.
5. In the Publisher area, specify
com.ibm.es.oze.api.export.sample.DocumentIDListCreator in the Class name
field.
6. In the Publisher Properties area, specify docid.list.save.path as the name
and specify the path of an existing directory in which to save the generated file
as the value.
7. In the Filter area, specify com.ibm.es.oze.api.export.sample.ExtensionFilter
in the Class name field.
8. In the Filter Properties area, specify filter.extensions as the name and
specify the file extension of the files to export. For example, specify txt.
9. Export documents.
Related tasks:
“Creating and deploying a plug-in for exporting documents or deep inspection
results” on page 85
Exporting documents for use in other applications
Exporting documents from search results
Exporting deep inspection results
Related reference:
Exporting crawled or analyzed documents

© Copyright IBM Corp. 2009, 2014 125


126 IBM Watson Content Analytics: Programming Guide
Sample plug-ins for custom widgets in user applications
The sample widget plug-ins show how you can add custom widgets to search and
analytics applications.

The followings samples are provided in the ES_NODE_ROOT/master_config/


searchapp/icaplugin directory:
v MyFirstSearchPane is a simple plug-in to add a custom widget for search
applications. This sample widget displays query input fields and displays a list
of the events that occur when the query is processed.
v MyFirstAnalyticsPane is a simple plug-in to add a custom widget for analytics
applications. This sample widget displays the values of the selected facet in a
simple HTML table.

To use the sample custom widgets plug-ins:


1. Register the widgets.
a. Back up and edit the appropriate widgets.json file for the type of
application to which you want to add the sample custom widget:
v To register the MyFirstSearchPane widget for a search application, edit
the ES_NODE_ROOT/master_config/searchserver/repo/search/
Application_Name/widgets.json file.
v To register the MyFirstAnalyticsPane widget for an analytics application,
edit the ES_NODE_ROOT/master_config/searchserver/repo/analytics/
Application_Name/widgets.json file.
Application_Name is the application ID, such as default, social, or advanced.
You can determine the ID by viewing the list of applications in the
application customizer.
b. Add an entry for the widget in the following format. If an entry for the
sample widget already exists, ensure that the value of the available field is
true.
} ,
"MyFirstAnalyticsPane" : {
"available" : true,
"label" : "My First Analytics Pane" ,
"widgetName" : "icaplugin/MyFirstAnalyticsPane" ,
"properties" : []
}
The MyFirstAnalyticsPane field is the internal ID of this widget. You can
assign any value that includes alphabetic and numeric characters only.
The label field is the name to display for this widget in the list of available
widgets in the layout customizer.
The widgetName field is the Dojo module path of the plug-in. You must
include the icaplugin directory in the path.
The properties field defines the customizable properties for the widget. No
properties are defined for the sample plug-ins.

Tip: Ensure that you include a comma (,) before each entry to conform to
JSON syntax.
2. Restart the search and analytics applications.

© Copyright IBM Corp. 2009, 2014 127


v If you use the embedded web application server, enter the following
commands, where node_ID identifies the search server:
esadmin searchapp.node_ID stop
esadmin searchapp.node_ID start
To determine the node ID for the search server, run the esadmin check
command to view a list of session IDs. Look for the node ID that is listed for
the searchapp session.
v If you use WebSphere Application Server:
a. Enter the following command:
esadmin config sync
b. Stop and restart the search and analytics applications.
3. Add the widget to a layout in your user application by using the layout
customizer.
a. In a web browser, open the search or analytics application and click
Customize Layout from the User Session menu in the banner.
b. Select a container.
c. Select the custom widget from the Widget list and click Add Widget.
Related tasks:
“Creating and deploying a plug-in to add custom widgets for user applications”
on page 87

128 IBM Watson Content Analytics: Programming Guide


Sample plug-ins for custom global analysis
The sample plug-ins for custom global analysis show how you can use custom
logic in addition to the default global analysis tasks that occur during the indexing
process.

The followings samples are provided in the ES_INSTALL_ROOT/samples/jaql


directory:
v The simple.zip sample is an example of a plug-in that uses only the built-in
functions of Jaql. As specified in the install.json configuration file, this plug-in
runs the ../modules/dateRanking.jaql script. This script sorts and assigns a
rank to all documents according to the value of the date field. The script then
adds the rank for each document to the rank field and facet in the index.
v The javaudf.zip sample is an example of a plug-in that uses Java user-defined
functions (UDF) and multiple Jaql module scripts. As specified in the
install.json configuration file, this plug-in runs the ../modules/
tfidfMain.jaql script. This script computes the TF-IDF (term frequency–inverse
document frequency) weight for all nouns in the entire document set, and then
adds the TF-IDF values for each document to the tfidf field and facet in the
index.

Prerequisite: Before you can build the samples, you must install and configure
Apache ANT, a Java based build tool. For information about how to install and
configure Apache ANT, see http://ant.apache.org/.

To use the sample plug-ins for custom global analysis:


1. Compile the custom global analysis archive files. From the command line,
change to the ES_INSTALL_ROOT/samples/jaql directory and enter the following
command to run Apache Ant on the provided build.xml file.
ant -f build.xml
If you receive a ClassNotFoundException error, update the following line in the
build.xml file to specify the absolute file path to the jaql.jar file. The
jaql.jar file is installed by IBM InfoSphere BigInsights in the $JAQL_HOME
directory.
<property name="path.jaql" value="/opt/ibm/biginsights/jaql/jaql.jar" >
2. In the administration console, create a collection and select the Use IBM
InfoSphere BigInsights option.
v For the Simple.zip sample, create an enterprise search collection
v For the javaudf.zip sample, create a content analytics collection.
3. Create the search fields that are used by the samples:
v For the Simple.zip sample, create a field with the name rank and select the
Returnable, Faceted search, and Fielded search attributes.
v For the javaudf.zip sample, create a field with the name tfidf and select the
Returnable and Faceted search attributes. Because the value generated by
this sample is a string, ensure that the Parametric search attribute is not
selected
4. Configure the custom global analysis task. In the Parse and Index pane of the
administration console, click Configure > Global processing > Custom global
analysis and click the Add icon.

© Copyright IBM Corp. 2009, 2014 129


a. On the Custom Global Analysis Fields and Custom Global Analysis Facets
pages, select the fields and facets to pass to the script for analysis.
v For the Simple.zip sample, select the date field. You do not need to select
any facets.
v For the javaudf.zip sample, select the Part of Speech ($._word) facet. You
do not need to select any fields.
b. On the Custom Global Analysis Archive File page, specify the path to the
sample archive file on your local computer.
5. Restart the parse and index services for the collection. For the javaudf.zip
sample, you must also deploy the analytic resources. In the Parse and Index
pane, click Analytic Resources and click the icon to start the resource
deployment task.
6. Configure a crawler for the collection and build the index.
7. After the documents are indexed, you can view the results of the custom global
analysis processing.
v For the Simple.zip sample, open the enterprise search application and search
for documents. Each document now has a custom_rank field and a rank
facet.
v For the javaudf.zip sample, open the content analytics miner, and explore
documents. Each document now has a custom_tfidf field and tfidf facet.
However, the value is not added if the TF-IDF value does not exceed the
threshold, as specified in the $ES_INSTALL_ROOT/samples/jaql/javaudf/
modules/tfidf.jaql file.
Related concepts:
Custom global analysis
Custom global analysis
Related tasks:
“Creating and deploying a custom global analysis plug-in” on page 91
Related reference:
“Jaql scripts for custom global analysis” on page 92

130 IBM Watson Content Analytics: Programming Guide


Sample custom analyzer for document ranking filters
The personname sample analyzer shows how you can create custom analyzers for
use with document ranking filters.

When documents are parsed, this custom analyzer detects the occurrence of person
names that have nicknames. Then the analyzer inserts the nicknames in place of
the original names when the text is extracted to the specified field so that users can
search for documents by entering the nickname. For example, if the value of a field
is "William Smith", the document will be returned if a user enters the search terms
"Will Smith" or "Bill Smith". If you created a document ranking filter for this
analyzer and added it to the Top document ranking filter group, documents that
contain the nicknames in the specified field will be ranked higher in the results.

The sample is provided in the ES_INSTALL_ROOT/samples/customAnalyzer directory.

Important: Before you can upload custom analyzers or associate analyzers with
fields in the administration console, you must enable the custom analyzer support.

To use the sample analyzer:


1. Upload the package that contains the sample analyzer. In the administration
console, open the System view, click the Parse tab, and click Configure custom
analyzer packages. Click Add Package and browse to the ES_INSTALL_ROOT/
samples/customAnalyzer/SamplePackage.zip file.
2. Define a document ranking filter to associate an index field that is enabled for
fielded search with the sample analyzer. In the Parse and Index pane for an
enterprise search collection, click Configure > Custom analyzers for document
ranking filters and click Associate Analyzer with Field. For example, associate
the author field with the personname sample analyzer.
3. Configure the document ranking filter.
a. In the Search pane for the enterprise search collection, click Configure >
Rules to tune queries and results.
b. In the Document Ranking Filters area, select the Enable document ranking
filters check box.
c. Click Edit Document Ranking Filters and add the document filter that you
created in step 2 to a document ranking filter group. For example, click the
tab for the Top document ranking filter group, click Add filters, and select
the filter.
4. Restart the search servers to apply the changes.
5. In a web browser, open the enterprise search application and search for a
nickname. If any documents in the collection contain the name of a person that
corresponds to the specified nickname in the specified field, the documents will
be ranked higher in the results.
Related tasks:
“Creating and deploying a custom analyzer for document ranking filters” on page
97
Configuring document ranking filters
Enabling custom analyzer support

© Copyright IBM Corp. 2009, 2014 131


132 IBM Watson Content Analytics: Programming Guide
Notices
This information was developed for products and services offered in the U.S.A.
This material may be available from IBM in other languages. However, you may be
required to own a copy of the product or product version in that language in order
to access it.

IBM may not offer the products, services, or features discussed in this document in
other countries. Consult your local IBM representative for information on the
products and services currently available in your area. Any reference to an IBM
product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product,
program, or service that does not infringe any IBM intellectual property right may
be used instead. However, it is the user's responsibility to evaluate and verify the
operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter
described in this document. The furnishing of this document does not grant you
any license to these patents. You can send license inquiries, in writing, to:

IBM Director of Licensing


IBM Corporation
North Castle Drive
Armonk, NY 10504-1785
U.S.A.

For license inquiries regarding double-byte (DBCS) information, contact the IBM
Intellectual Property Department in your country or send inquiries, in writing, to:

Intellectual Property Licensing


Legal and Intellectual Property Law
IBM Japan Ltd.
19-21, Nihonbashi-Hakozakicho, Chuo-ku
Tokyo 103-8510, Japan

The following paragraph does not apply to the United Kingdom or any other
country where such provisions are inconsistent with local law:
INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS
PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER
EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS
FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or
implied warranties in certain transactions, therefore, this statement may not apply
to you.

This information could include technical inaccuracies or typographical errors.


Changes are periodically made to the information herein; these changes will be
incorporated in new editions of the publication. IBM may make improvements
and/or changes in the product(s) and/or the program(s) described in this
publication at any time without notice.

Any references in this information to non-IBM Web sites are provided for
convenience only and do not in any manner serve as an endorsement of those Web

© Copyright IBM Corp. 2009, 2014 133


sites. The materials at those Web sites are not part of the materials for this IBM
product and use of those Web sites is at your own risk.

IBM may use or distribute any of the information you supply in any way it
believes appropriate without incurring any obligation to you.

Licensees of this program who wish to have information about it for the purpose
of enabling: (i) the exchange of information between independently created
programs and other programs (including this one) and (ii) the mutual use of the
information which has been exchanged, should contact:

IBM Corporation
J46A/G4
555 Bailey Avenue
San Jose, CA 95141-1003
U.S.A.

Such information may be available, subject to appropriate terms and conditions,


including in some cases, payment of a fee.

The licensed program described in this document and all licensed material
available for it are provided by IBM under terms of the IBM Customer Agreement,
IBM International Program License Agreement or any equivalent agreement
between us.

Any performance data contained herein was determined in a controlled


environment. Therefore, the results obtained in other operating environments may
vary significantly. Some measurements may have been made on development-level
systems and there is no guarantee that these measurements will be the same on
generally available systems. Furthermore, some measurements may have been
estimated through extrapolation. Actual results may vary. Users of this document
should verify the applicable data for their specific environment.

Information concerning non-IBM products was obtained from the suppliers of


those products, their published announcements or other publicly available sources.
IBM has not tested those products and cannot confirm the accuracy of
performance, compatibility or any other claims related to non-IBM products.
Questions on the capabilities of non-IBM products should be addressed to the
suppliers of those products.

All statements regarding IBM's future direction or intent are subject to change or
withdrawal without notice, and represent goals and objectives only.

This information contains examples of data and reports used in daily business
operations. To illustrate them as completely as possible, the examples include the
names of individuals, companies, brands, and products. All of these names are
fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.

COPYRIGHT LICENSE:

This information contains sample application programs in source language, which


illustrate programming techniques on various operating platforms. You may copy,
modify, and distribute these sample programs in any form without payment to
IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating

134 IBM Watson Content Analytics: Programming Guide


platform for which the sample programs are written. These examples have not
been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or
imply reliability, serviceability, or function of these programs. The sample
programs are provided "AS IS", without warranty of any kind. IBM shall not be
liable for any damages arising out of your use of the sample programs.

Each copy or any portion of these sample programs or any derivative work, must
include a copyright notice as follows: © (your company name) (year). Portions of
this code are derived from IBM Corp. Sample Programs. © Copyright IBM Corp.
2004, 2010. All rights reserved.

If you are viewing this information softcopy, the photographs and color
illustrations may not appear.

Additional notices
Portions of this product are:
v Oracle® Outside In Content Access, Copyright © 1992, 2014, Oracle.
v IBM XSLT Processor Licensed Materials - Property of IBM © Copyright IBM
Corp., 1999-2014.

This product uses the FIPS 140-2 approved cryptographic provider(s); IBMJCEFIPS
(certificate 376) and/or IBMJSSEFIPS (certificate 409) and/or IBM Crypto for C
(ICC (certificate 384) for cryptography. The certificates are listed on the NIST web
site at http://csrc.nist.gov/cryptval/140-1/1401val2004.htm.

Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of
International Business Machines Corp., registered in many jurisdictions worldwide.
Other product and service names might be trademarks of IBM or other companies.
A current list of IBM trademarks is available on the Web at "Copyright and
trademark information" at http://www.ibm.com/legal/copytrade.shtml

Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered
trademarks or trademarks of Adobe Systems Incorporated in the United States,
and/or other countries.

Linux is a registered trademark of Linus Torvalds in the United States, other


countries, or both.

Microsoft, Windows, Windows NT, and the Windows logo are trademarks of
Microsoft Corporation in the United States, other countries, or both.

Java and all Java-based trademarks and logos are trademarks or registered
trademarks of Oracle and/or its affiliates.

UNIX is a registered trademark of The Open Group in the United States and other
countries.

Other company, product, and service names may be trademarks or service marks
of others.

Notices 135
Privacy policy considerations
IBM Software products, including software as a service solutions, (“Software
Offerings”) may use cookies or other technologies to collect product usage
information, to help improve the end user experience, to tailor interactions with
the end user or for other purposes. In many cases no personally identifiable
information is collected by the Software Offerings. Some of our Software Offerings
can help enable you to collect personally identifiable information. If this Software
Offering uses cookies to collect personally identifiable information, specific
information about this offering’s use of cookies is set forth below.

This Software Offering does not use cookies or other technologies to collect
personally identifiable information.

If the configurations deployed for this Software Offering provide you as customer
the ability to collect personally identifiable information from end users via cookies
and other technologies, you should seek your own legal advice about any laws
applicable to such data collection, including any requirements for notice and
consent.

For more information about the use of various technologies, including cookies, for
these purposes, See IBM’s Privacy Policy at http://www.ibm.com/privacy and
IBM’s Online Privacy Statement at http://www.ibm.com/privacy/details the
section entitled “Cookies, Web Beacons and Other Technologies” and the “IBM
Software Products and Software-as-a-Service Privacy Statement” at
http://www.ibm.com/software/info/product-privacy.

136 IBM Watson Content Analytics: Programming Guide


Index
A compiling
sample content mining
D
administration applications applications 110 deep inspection
REST API sample scenarios 103 sample enterprise search custom plug-ins 85
security 61 applications 110 DeviationsViewExample class 116
AdvancedSearchExample class 111 sample REST API applications 106, document ranking filters
ANT script 110 107 custom analyzer deployment 97
API documentation crawler plug-ins custom analyzer sample 131
crawler plug-ins 5 Agent for Windows file systems data document-level security
export plug-in 5 sources 69 identity management 61
identity management APIs 5 archive file extraction 71 Java string classes 61
installation locations 5 archive file viewing 74 SIAPI 61
post-filtering plug-in 5 BoardReader data sources 69 DocumentsViewExample class 116
search and index APIs 5 Case Manager data sources 69
APIs Exchange Server data sources 69
APIs
REST 3
FileNet P8 data sources 69 E
Javadoc documentation 5, 65 enterprise search applications
crawler plug-ins 65 non-web crawlers 66 creating custom widgets 87
faceted search 26, 28 overview 65 custom widgets 87
faceted search, enterprise search post-filtering results 83 overview 11
collections 31 sample code for non-web REST API sample scenarios 103
identity management 62, 63 crawlers 121 sample custom widgets plug-in 127
Javadoc documentation 5 sample code for post filtering security 61
overview 3 results 123 export plug-in
real-time NLP 59 SharePoint data sources 69 Javadoc documentation 5
REST 7 web crawlers 76 exporting
REST API crawler plug-ins, non-web custom plug-ins 85
overview 3 post-filtering results 83 using a custom plug-in 125
search and index 3, 11 sample code 121, 123
SIAPI restrictions 12 crawler plug-ins, type A
application security 61
applications 25
archive file extraction 71
archive file viewing 74
F
archive file plug-in changing document content 67 faceted search 25
creating 71 changing metadata 67 API 26, 28
viewing extracted files 74 creating 67 API, enterprise search collections 31
crawler plug-ins, type B FacetedSearchExample class 115
changing document content 69 FacetPairsViewExample class 116
B changing metadata 69 FacetsViewExample class 116
FederatedFacetedSearchExample
Base64 encoding 63 creating 69
crawler plug-ins, web class 115
BrowseExample class 111
overview 76 FederatedSearchExample class 114
post-filtering results 83 federators 33
JDBC federator 33
C postparse 79
prefetch 76 LDAP federator 33
classes, API sample code 123 local federator 34
AdvancedSearchExample 111 custom analyzer for document ranking remote faceted federator 35
BrowseExample 111 deployment 97 remote federator 35
DeviationsViewExample 116 sample 131 fetch API 37
DocumentsViewExample 116 custom document export plug-in fetching search results 37
FacetedSearchExample 115 sample code 125 free style query syntax 39
FacetPairsViewExample 116 custom global analysis
FacetsViewExample 116 creating and deploying plug-in 91
FederatedFacetedSearchExample 115
FederatedSearchExample 114
Jaql scripts 92 G
sample code 129 global analysis
SearchExample 111 custom widgets plug-in for search and custom
TimeScaleViewSearchExample 111 analytics applications creating and deploying
TimeSeriesViewExample 116 creating 87 plug-in 91
TrendsViewExample 116 sample code 127 Jaql scripts 92
com.ibm.es.wc.pi.PostparsePlugin
sample code 129
interface 79
com.ibm.es.wc.pi.PrefetchPlugin
interface 76

© Copyright IBM Corp. 2009, 2014 137


I Q running
sample content mining
identity management query behavior applications 110
document-level security 63 ACL constraints 17 sample enterprise search
sign-on security category details 22 applications 110
APIs 62 controlling 17 sample REST API applications 106,
XML string 63 evaluation time 25 107
identity management APIs linguistic modes 20
Javadoc documentation 5 metadata fields 20
predefined attributes 21
predefined links 22 S
J query language setting 19
range of results 21
sample administration applications
REST API 103
Java source code sample applications
setting properties 17
sample applications 110 content mining 109, 110
site collapse 22
Javadoc documentation description 109
sort order 24
crawler plug-ins 5, 65 real-time NLP 119
sorting results 22
export plug-in 5 REST API 103, 106, 107
spelling correction 24
identity management APIs 5 search 109, 110
synonym expansion 24
installation locations 5 sample content mining applications 116
timeouts 25
post-filtering plug-in 5 compiling 110
Query interface 17
search and index APIs 5 running 110
ResultSet.getQueryEvaluationTime
JDBC federator 33 sample enterprise search applications
method 25
ResultSet.isEvaluationTruncated advanced 111
method 25 browse and navigate 111
L setACLConstraints method 17 compiling 110, 111
LDAP federator 33 setLinguisticMode method 20 faceted search 115
legal setPredefinedResultsEnabled federated faceted search 115
notices 133 method 22 federated search 114
trademarks 135 setProperty method 17 fetch document content 113
local federator 34 setQueryLanguage method 19 minimum required 111
setRequestedResultRange method 21 retrieve all search results 112
setResultCategoriesDetailLevel running 110
N method 22
setReturnedAttribute method 21
time scale view 111
sample plug-ins
notices custom document export 125
setReturnedFields method 20
legal 133 custom global analysis 129
setSiteCollapsingEnabled method 22
setSortKey method 22 custom widgets for search and
setSortOrder method 24 analytics applications 127
O setSpellCorrectionEnabled method 24 non-web crawlers 121
opaque terms query syntax 39 setSynonymExpansionMode post-filtering results 123
method 24 sample REST API applications
query syntax compiling 106, 107
P free style 39
opaque terms 39
search and index API
enterprise search application
personname sample analyzer 131 structure 13
structure 53
plug-in APIs federators 33
non-web crawlers 66 issuing queries 13
web crawlers 76
plug-ins R Javadoc documentation 5
obtaining a search service 13
custom global analysis 91 ranking search results obtaining a searchable object 13
custom widgets for search and custom analyzer deployment 97 obtaining an implementation 13
analytics applications 87 custom analyzer sample 131 overview 11
deep inspection results 85 document ranking filters 97, 131 processing query results 13
exporting documents 85 real-time NLP restrictions 12
post-filtering plug-in requirements 59 sample applications 109
creating 83 sample application 119 sample real-time NLP
Javadoc documentation 5 usage 59 application 119
sample code 123 remote federator 35 search federators 33
postparse plug-in REST API JDBC federator 33
creating 79 context root 7 LDAP federator 33
sample plug-in 79 documentation 5 local federator 34
prefetch plug-in overview 7 remote faceted federator 35
creating 76 real-time NLP 59 remote federator 35
deploying 79 sample scenarios 103 SearchExample class 111
sample plug-in 76 samples, compiling and running 106, security
107 APIs 61
document-level 63

138 IBM Watson Content Analytics: Programming Guide


security (continued)
identity management 61, 62
Java string classes 61
user's security context 63
XML string 63
semantic search 36
SIAPI, see search and index API 13
single sign-on 62

T
text miner
sample custom widgets plug-in 127
TimeScaleViewSearchExample 111
TimeSeriesViewExample class 116
trademarks 135
TrendsViewExample class 116
tuning queries
custom analyzer deployment
document ranking filters 97
custom analyzer sample
document ranking filters 131

U
UIMA annotations 36

W
web crawler plug-in (postparse)
creating 79
sample plug-in 79
web crawler plug-in (prefetch)
creating 76
deploying 79
sample plug-in 76

X
XML elements
retrieval 36
semantic search 36

Index 139
140 IBM Watson Content Analytics: Programming Guide


Product Number: 5724-Z21

SC27-6331-00

You might also like