GNRGNM Install
GNRGNM Install
GNRGNM Install
Version 3.5
Programming Guide
SC27-6331-00
IBM Watson Content Analytics
Version 3.5
Programming Guide
SC27-6331-00
Note
Before using this information and the product it supports, read the information in “Notices” on page 133.
This edition applies to version 3, release 5, modification 0 of IBM Watson Content Analytics (product number
5724-Z21) and to all subsequent releases and modifications until otherwise indicated in new editions.
© Copyright IBM Corporation 2009, 2014.
US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract
with IBM Corp.
Contents
ibm.com and related resources. . . . . v Creating and deploying a plug-in for
How to send comments . . . . . . . . . . vi post-filtering search results. . . . . . 83
Contacting IBM . . . . . . . . . . . . . vi
Creating and deploying a plug-in for
API overview . . . . . . . . . . . . 3 exporting documents or deep
API documentation . . . . . . . . . . . . 5
inspection results . . . . . . . . . . 85
REST APIs . . . . . . . . . . . . . 7
Creating and deploying a plug-in to
add custom widgets for user
Search and index APIs . . . . . . . . 11
SIAPI implementation restrictions . . . . . . . 12
applications . . . . . . . . . . . . 87
Enterprise search applications . . . . . . . . 13
Controlling query behavior . . . . . . . . 17 Creating and deploying a custom
Creating a faceted enterprise search application 25 global analysis plug-in . . . . . . . . 91
Search and index API federators . . . . . . 33 Jaql scripts for custom global analysis . . . . . 92
Retrieving targeted XML elements . . . . . . 36
Fetching search results. . . . . . . . . . 37 Creating and deploying a custom
analyzer for document ranking filters . 97
Query syntax . . . . . . . . . . . . 39
Query syntax structure . . . . . . . . . . 53
Sample REST API scenarios . . . . . 103
Compiling the sample REST API applications . . 106
Real-time NLP API . . . . . . . . . . 59 Running the sample REST API applications in
Eclipse . . . . . . . . . . . . . . . 107
Application security . . . . . . . . . 61
Document-level security . . . . . . . . . . 61 Sample SIAPI enterprise search and
Identity management for single sign-on security 62 content mining applications . . . . . 109
Creating the user's security context XML string
Compiling the sample enterprise search and
with the identity management API . . . . . 63
content mining applications . . . . . . . . 110
Simple and advanced sample enterprise search
Crawler plug-ins . . . . . . . . . . 65 applications . . . . . . . . . . . . . . 111
Crawler plug-ins for non-web sources . . . . . 66 Browse and navigation sample application. . . . 111
Creating a crawler plug-in for type A data Time scale view sample application . . . . . . 111
sources . . . . . . . . . . . . . . . 67 Retrieve all search results sample application . . . 112
Creating a crawler plug-in for type B data Fetch document content sample application . . . 113
sources . . . . . . . . . . . . . . . 69 Federated search sample application. . . . . . 114
Creating and deploying a plug-in for archive files 71 Federated faceted search sample application . . . 115
Extending the archive plug-in to view extracted Faceted search sample application . . . . . . 115
files . . . . . . . . . . . . . . . . 74 Content mining sample applications . . . . . . 116
Web crawler plug-ins . . . . . . . . . . . 75
Creating a prefetch plug-in for the web crawler 76
Deploying a prefetch plug-in . . . . . . . 78
Creating a postparse plug-in for the web crawler 79
Notices . . . . . . . . . . . . . . 133
Additional notices . . . . . . . . . . . . 135
You can view the product documentation with a web browser in the IBM
Knowledge Center. Content in the IBM Knowledge Center might be more current
than the PDF publications.
PDF publications
You can view the PDF files online by using the Adobe Acrobat Reader for your
operating system. If you do not have the Adobe Reader installed, you can
download it from the Adobe Web site at http://www.adobe.com.
Contacting IBM
To contact IBM customer service in the United States or Canada, call
1-800-IBM-SERV (1-800-426-7378).
To learn about available service options, call one of the following numbers:
v In the United States: 1-888-426-4343
v In Canada: 1-800-465-9600
For more information about how to contact IBM, see the Contact IBM Web site at
http://www.ibm.com/contact/us/.
You can:
v Develop custom applications for searching collections and exploring the results
of text analysis. You can use the application programming interfaces to create
new applications and use the applications that are provided with Watson
Content Analytics as a model for your own applications.
For information about how to use the Watson Content Analytics APIs, see the
examples in the ES_INSTALL_ROOT/samples directory.
REST APIs
Use the REST APIs to create search, content mining, and administration
applications. The search REST API is available on Watson Content Analytics search
servers and is deployed on the search application port, which by default is port
8393 if you use the embedded web application server. If you use WebSphere
Application Server, the default port is 9081 or 80 if IBM HTTP Server is configured.
The administrative REST API is available on the master server if you use the
embedded web application server and uses the same port number as the
administrative console, which by default is 8390. If you use WebSphere Application
Server, the administrative REST API is available on the search application port,
which by default is 9081 or 80 if IBM HTTP Server is configured. You can change
these port numbers when you install Watson Content Analytics.
For more information about using the REST APIs, see the API documentation in
the ES_INSTALL_ROOT/docs/api/rest directory. Sample scenarios that demonstrate
how to perform administrative and search tasks are available in the
ES_INSTALL_ROOT/samples/rest directory.
You can use the search and index application programming interfaces to create
custom enterprise search applications. The Watson Content Analytics
implementation of the search and index API (SIAPI) allows the search server to be
accessed remotely.
Restriction: The SIAPI administration APIs are deprecated and are no longer
supported. The SIAPI search APIs are being deprecated and will not be supported
in future releases. Use the REST APIs instead of the SIAPI APIs to create custom
applications.
You can use applications that are provided with Watson Content Analytics as a
base from which to develop your custom applications.
search This application shows you how to do basic search and retrieval tasks,
such as selecting collections for search, querying those collections,
configuring the display of search results, and narrowing results through
faceted browsing.
analytics
This application shows you how to use content mining capabilities to
Plug-in APIs
Plug-in APIs allow you to customize the Watson Content Analytics system in the
following ways:
v Use the crawler plug-ins to modify documents after they are crawled, but before
they are parsed and indexed for search. You can add, change, or delete
information in the document or the document metadata. You can also indicate
that the document is to be ignored (skipped) and not indexed.
v Use the post-filtering plug-in to apply your own security logic for post-filtering
search results.
v Use the export plug-in to apply your own logic for exporting crawled, analyzed,
or searched documents and the output from deep inspection requests.
See the Javadoc documentation for details about the APIs that can be used to
create your own identity management component or customize the provided
solution.
Use this API to perform ad-hoc text analytics on documents without adding the
documents to the index. Both SIAPI and REST API versions of the real-time NLP
API are provided. The NLP REST API accepts both text and binary content, but the
SIAPI version only accepts content in text format.
Restriction: The SIAPI version of the real-time NLP API is being deprecated and
will not be supported in future releases. Use the REST API version instead of the
SIAPI version to create custom applications.
Related concepts:
“API documentation” on page 5
Related reference:
API documentation
API documentation is available for the REST APIs, search and index APIs,
plug-ins, and the identity management component.
Related concepts:
“Search and index APIs” on page 11
“Crawler plug-ins” on page 65
“Extending the archive plug-in to view extracted files” on page 74
“API overview” on page 3
Related tasks:
“Creating and deploying a plug-in for post-filtering search results” on page 83
“Creating and deploying a plug-in for archive files” on page 71
“Creating and deploying a plug-in for exporting documents or deep inspection
results” on page 85
Related reference:
“Enterprise search applications” on page 13
API overview 5
6 IBM Watson Content Analytics: Programming Guide
REST APIs
The Watson Content Analytics REST application programming interfaces (APIs)
enable you to create applications to search, explore, and administer collections.
The REST APIs provide capabilities that IBM search and index APIs (SIAPI) offer,
such as:
v Managing collections
v Controlling and monitoring components
v Adding documents to a collection
v Searching a collection and federated collections
v Searching and browsing facets
The search REST API is available on search servers and listens on the search
application port, which by default is port 8393 if you use the embedded web
application server. If you use WebSphere Application Server, the default port is
9081 or 80 if IBM HTTP Server is configured. The administrative REST API is
available on the master server if you use the embedded web application server and
uses the same port number as the administrative console, which by default is 8390.
If you use WebSphere Application Server, the administrative REST API is available
on the search application port, which by default is 9081 or 80 if IBM HTTP Server
is configured. You can change these port numbers when you install Watson
Content Analytics.
HTTP methods
You can use both HTTP GET and HTTP POST methods to call most REST APIs.
For the HTTP GET method, you can directly enter a REST API URL into a web
browser. The POST method is recommended for security reasons.
http://host:port/api/v10/
For the administration REST APIs, create URLs in the following format:
http://Index_server_hostname:Administration_console_port/api/v10/admin/
API_name?method=method_name?parameters
For example, use the following administration REST API URL to return
information about the status of the index:
http://Index_server_hostname:8390/api/v10/admin/indexer?method=monitor
&api_username=user_name&api_password=password&collectionId=collection_ID
For the search REST APIs, create URLs in the following format:
http://Search_server_hostname:Search_server_port/api/v10/API_name?parameters
For example, use the following search REST API URL to return a list of all
available namespaces of facets for the specified collection:
http://Search_server_hostname:8393/api/v10/facets/namespaces?collection=sample
The /about/providers API returns all available search REST APIs. The
/about/providerdetail API returns detailed information about all the available
search REST API. These APIs are especially useful if you develop an application
that uses the REST API and you cannot access a computer on which Watson
Content Analytics is installed to view the REST API reference documentation.
http://Search_server_hostname:8393/api/v10/about/providerdetail?path=/
collections&output=application/xml
Tips:
v To create proper URLs, ensure that they are URL encoded. For example,
output=application/atom+xml should be eencoded as output=application/atom
%2Bxml.
For more information about using the REST APIs, see the API documentation in
the ES_INSTALL_ROOT/docs/api/rest directory. Sample scenarios that demonstrate
how to perform administrative and search tasks are available in the
ES_INSTALL_ROOT/samples/rest directory.
Restriction: The following functions are not available in the REST API:
v com.ibm.es.siapi.admin.AdminServiceImpl.associateApplicationWithCollection
v com.ibm.es.siapi.admin.AdminServiceImpl.associateApplicationWithCollection
v com.ibm.es.siapi.admin.AdminServiceImpl.registerApplication
v com.ibm.es.siapi.admin.AdminServiceImpl.unregisterApplication
v
com.ibm.es.siapi.admin.AdminServiceImpl.disassociateApplicationFromCollection
v com.ibm.es.siapi.admin.AdminServiceImpl.performAdminCommand
(changeRankingModel and revisitURLs options)
Related reference:
“Sample REST API scenarios” on page 103
REST APIs 9
10 IBM Watson Content Analytics: Programming Guide
Search and index APIs
The IBM search and index API (SIAPI) is a programming interface that enables you
to search and explore collections.
Restriction: The SIAPI search APIs are being deprecated and will not be supported
in future releases. Use the REST APIs instead of the SIAPI APIs to create custom
applications. For more information about using the REST APIs, see the API
documentation in the ES_INSTALL_ROOT/docs/api/rest directory. Sample scenarios
that demonstrate how to perform administrative and search tasks are available in
the ES_INSTALL_ROOT/samples/rest directory.
SIAPI is a factory based interface that allows for different implementations of the
search engine. By using SIAPI, your custom application can use different search
engines that are provided by IBM without changing your SIAPI application. For
example, if you create a SIAPI application in WebSphere® Portal that uses the
portal search engine, you can use the Watson Content Analytics search engine
without the need to change your enterprise search application.
SIAPI supports the following types of search and content mining tasks:
v Searching collections
v Customizing the information that is returned in the search results
v Searching and browsing facets
v Querying several enterprise search collections as if they were one collection
(search federation)
v Viewing results with URIs that you can click and viewing scoring information
(ranking)
v Searching and retrieving documents from a broad range of data sources, such as
IBM Content Integrator repositories and Lotus Notes® databases
v Performing real-time text analytics on documents without adding the analyzed
documents to the index
The following figure shows the relationships among the SIAPI search APIs.
Searchable
createApplicationInfo ( )
createQuery ( ) search ( )
getSearchService ( ) count ( )
createLocalFederator ( ) setSpellCorrectionEnabled ( )
isSpellCorrectionEnabled ( )
getSpellCorrections ( )
Obtains
setSynonymExpansionEnabled ( )
isSynonymExpansionEnabled ( )
getSynonymExpansions ( )
Java Interface
getDefaultLanguage ( )
SearchService
getAvailableAttributeValues ( )
getAvailableFields ( )
getAvailableSearchables ( )
setProperty ( )
getSearchable ( )
getProperty ( )
getAvailableFederators ( )
getProperties ( )
getFederator ( )
getCollectionInfo ( )
Obtains
Related concepts:
“API documentation” on page 5
Deprecated packages
Use the REST APIs instead of the SIAPI APIs to create custom applications. For
more information about using the REST APIs, see the API documentation in the
Unsupported methods
Class:
com.ibm.siapi.search.RemoteFederator
Methods:
searchStreaming(Query)
searchStreaming(Query, String[])
searchStreaming(Query, String[], String[])
Class:
com.ibm.siapi.search.StreamingResultSet
Methods:
getEstimatedNumberOfResults()
getPredefinedResults()
getProperties()
getProperty(String)
getSearchState()
getSpellCorrections()
getSynonymExpansions()
hasUnconstrainedResults()
isEvaluationTruncated()
addMessage(SiapiMessage)
addMessages(List)
clearMessages()
getMessages()
Class:
com.ibm.siapi.browse.BrowseFactory
Methods:
createApplicationInfo(String, String)
createApplicationInfo(String, String, String)
See the Javadoc documentation for examples of the search and index APIs.
To create an enterprise search application with the search and index APIs:
1. Instantiate an implementation of a SearchFactory object.
2. Use the SearchFactory object to obtain a SearchService object.
The SearchService object is configured with the connection information that is
necessary to communicate with the search engine. With the SearchService
object, you can access searchable collections. Configure the SearchService
object with the Watson Content Analytics administrator user name and
password, host name, and port. Configuration parameters are set in a
java.util.Properties object. The parameters are then passed to the
getSearchService factory method that generates the SearchService object.
The search and index APIs are a factory-based Java API. All of the objects that are
used in the enterprise search application are created by calling search and index
API object-factory methods or are returned by calling methods of factory-generated
objects. You can easily switch between search and index API implementations by
loading different factories.
The search and index API implementation in Watson Content Analytics is provided
by the com.ibm.es.api.search.RemoteSearchFactory class.
Use the following search and index API packages to create an enterprise search
application:
com.ibm.siapi
Root package
com.ibm.es.api.browse
Contains taxonomy browsing interfaces
com.ibm.siapi.common
Common SIAPI interfaces
com.ibm.siapi.search
Interfaces for searching collections
com.ibm.siapi.search.facets
Interfaces for faceted search
To create a search and index API enterprise search application, obtain the
implementation of the SearchFactory object as in the following example:
Class cls = Class.forName("com.ibm.es.api.search.RemoteSearchFactory");
SearchFactory factory = (SearchFactory) cls.newInstance();
When you request a Searchable object, you need to identify your application by
using an application ID. Contact your Watson Content Analytics administrator for
the appropriate application ID.
Issuing queries
After the Searchable object is obtained, you issue a query to that Searchable
object. To issue a query to the Searchable object:
1. Create a Query object.
2. Customize the Query object.
3. Submit the Query object to the Searchable object.
4. Get the query results, which are specified in a ResultSet object.
With the ResultSet interface and Result interface, you can process query results,
as in the following example:
Result[] results = resultSet.getResults();
for ( int i = 0 ; i < results.length ; i++ ) {
System.out.println
( "Result " + i + ": " + results[i].getDocumentID()
+ " - " + results[i].getTitle() );
}
Related concepts:
“API documentation” on page 5
Related reference:
“Federated search sample application” on page 114
See the Javadoc documentation for more details about each method and property.
Related concepts:
“Query syntax” on page 39
Related reference:
“Query syntax structure” on page 53
The setProperty method for query object has the following format:
query.setProperty("String name", "String value");
In this example, the capital letters ABC and the at sign (@) refer to n-gram
characters. For example, if the query term is ABCDE, a typical n-gram fuzzy search
returns a document that includes character sequence such as
@@AB@@@BC@@@@@@CD@@@@@DE because this document has all the n-grams
that are generated from the specified query. However, for some languages, this
query result is not preferable because it often means completely different meanings
if those n-grams are far apart.
To improve fuzzy search results, you can control the level of ambiguity in the
query by specifying the FuzzyNGramAmbiguity property and optionally the
FuzzyNGramAmbiguityCondition property.
FuzzyNGramSearch property
The ambiguity must be greater than 0.0 and less than or equal to 1.0. If the
ambiguity is set to 1.0, it is equivalent to an exact match. The lower the number
that the ambiguity is set to, the more it allows ambiguity determines whether each
document has similar character sequences to the search term. Thus, the search
query retrieves more documents.
Ambiguity is similar to the ratio of characters appearing in the same position and
the same order to the search query.
v Format: ambiguity
v Ambiguity: float value, 0.0< ambiguity <= 1.0, to specify ambiguity
This property is used to set the ambiguity that is applied to all search terms of the
query except for the terms that are specified by the FuzzyNGramAmbiguityCondition
property. The higher the ambiguity is, the more similar the returned document will
be. In other words, the document includes character sequences closer to the
original search term if the higher ambiguity is specified.
FuzzyNGramAmbiguityCondition property
In the following example, Watson Content Analytics searches for all terms except
for tablename: DATA_TBL with ambiguity 0.8, and search for tablename:
DATA_TBL with ambiguity 1.0 (exact):
q.setProperty(“FuzzyNGramAmbiguity”, “0.8”);
q.setProperty(“FuzzyNGramAmbiguityCondition”,
”tablename:DATA_TBL=1.0”);
Use Unicode identifiers for languages to set a specific language. For example, for
English, the query language parameter is en. For Chinese, use zh-CN for simplified
Chinese and zh-TW for traditional Chinese.
The setLinguisticMode(int mode) method sets the linguistic mode for a query. You
can set one of the following modes:
LINGUISTIC_MODE_ENGINE_DEFINED
Unmodified terms are matched according to the engine's best-effort policy.
This is the default mode. Base and exact form matching is performed by
default.
LINGUISTIC_MODE_EXACT_MATCH
Unmodified terms are matched as entered without undergoing linguistic
processing. This method allows the search engine to find exact results.
LINGUISTIC_MODE_BASEFORM_MATCH
Unmodified terms are matched by their base form after undergoing
linguistic processing. For example, the query term jumping matches
documents that contain jump, jumped, jumps, and so on.
LINGUISTIC_MODE_EXACT_AND_BASEFORM
Unmodified terms are matched by their base form and their exact form
after undergoing linguistic processing. For example, the query term
jumping matches documents that contain jump, jumped, jumps, and so on.
The difference from the LINGUISTIC_MODE_BASEFORM_MATCH mode
is that although linguistic base form matching relies on the query language
that matches the identified languages of the result documents, the
LINGUISTIC_MODE_EXACT_AND_BASEFORM mode assures that
documents that contain the exact form jumping are returned regardless of
their identified language.
If your enterprise search application supports the ability to search within results,
the linguistic mode that you specify for the application influences the number of
results returned. If the application is configured to use
LINGUISTIC_MODE_ENGINE_DEFINED, then a search within results might
return more documents than the original search. For example, if a user searches for
the term Lien, and then searches within results for the term Custody, the query is
expanded to be the query Lien ^Custody, which can show documents that contain
Lien or Custody.
If this is not the behavior that you want to see in your enterprise search
application, use one of the other linguistic modes. If you do not want users to see
the No preference option when they configure preferences for the enterprise search
application, you can edit the WebContent/options.jsp file to comment out the
HTML text, including the item prompt.selection.mode.engine.
By default, no metadata fields are returned, so you must use this method to return
metadata fields.
By default, all of the predefined result attribute values are returned except for the
RETURN_RESULT_FIELDS metadata fields attribute.
The fromResult value controls which ranked document your result set starts from.
For example, a value of 0 means that you are requesting the first document in the
query results.
The numberOfResults value controls how many results to return in the current page
of results. The numberOfResults value must be smaller than the maximum number
of results that is configured in the administration console minus the fromResult
value.
To retrieve more results from that same site, use the samegroupas:result URL query
syntax or re-issue the same query with the site http://www.ibm.com added to the
query string. See “Query syntax” on page 39 for more information.
You can specify whether query results contain predefined links in addition to the
regular results. Predefined links are enabled by default.
Any field that is defined for the collection and declared as "sortable" (for text
fields) or "parametric" (for numeric fields) can be specified as one of the sort keys
that are represented in SortKey objects in the call to the setSortKeys method.
Textual keys are sorted lexicographically according to the specified sort locale and
numeric keys are sorted arithmetically. Create a SortKey object by calling the
SearchFactory#createSortKey(String) method and modifying the sort order and
locale for the object. Then construct an array of SortKey objects and associate the
array to the query by calling the setSortKeys method.
Important:
v Because specifying multiple sort keys requires increased system resources such
as memory, specifying multiple sort keys might affect performance. Optimize
The collating sequence (that is, the order of characters in the alphabet to use for
sorting) is by default the sequence that is used by the collection. You can specify a
different sequence by providing a locale name as a second argument to the
setSortKeys method. For example, if you create a sortKey object
SearchFactory#createSortKey("title"), call the method setLocale("de_AT") for the
object, and then call the method Query#setSortKeys(new SortKey[]{sortKey}),
results are sorted by the value of their title field by using the alphabetic order that
is common in German as used in Austria. Use the standard five character locale
format xx_XX. For example, the locale for American English is en_US. The locale
for Japanese is ja_JP.
You can specify multiple SortKey objects. The order of SortKey objects in the array
specifies priority of sort keys. Search results are initially sorted by the first SortKey
object. After the search results are grouped by the first sort key, the results are
sorted by the next sort key.
The default sort order is descending: The first results to be output are those at the
top of the order. For example, the most relevant results are displayed at the top if
they are sorted by relevance, the most recent results are displayed at the top if they
are sorted by date, and so on.
Results whose sort key value is missing, undefined, or unavailable are sorted to
the end of the results list regardless of their sort order.
To indicate that the results from a query should be sorted by a field, use the
following methods:
BaseQuery.setSortKeys(<field_name>) or BaseQuery.setSortKeys
(<field_name><locale>)
BaseQuery.setSortOrder({SORT_ORDER_ASCENDING | SORT_ORDER_DESCENDING})
BaseQuery.setSortPoolSize({<int> | SORT_ALL_RESULTS})
See the Javadoc documentation for examples of the search and index APIs.
Procedure
To create a faceted enterprise search application with the search and index APIs:
1. Instantiate an implementation of a FacetsFactory object. The FacetsFactory
can then be used to obtain a FacetsService object.
2. Use the FacetsFactory object to obtain a FacetsService object. The
FacetsService object is configured with the connection information that is
necessary to communicate with the search engine. With the FacetsService
object, you can access faceted searchable collections. Configure the
FacetsService object with the host name, port, and, if WebSphere Application
Server global security is enabled, a valid WebSphere user name and password
for the search server. Configuration parameters are set in a
java.util.Properties object. The parameters are then passed to the
getFacetsService factory method that generates the FacetsService object.
3. Obtain a FacetedSearchable object. After you obtain a FacetsService object,
you can use it to obtain one or more FacetedSearchable objects. Each search
and index API searchable object is associated with one enterprise search or
content analytics collection. You can also use the FacetsService object to obtain
a federator object. A federator object is a special kind of FacetedSearchable
object that enables you to submit a single FacetedQuery object across multiple
FacetedSearchable objects (collections) at the same time.
When you request a FacetedSearchable object, you need to identify your
application by using an application ID. Contact your administrator for the
appropriate application ID.
4. Issue queries. The faceted enterprise search application passes search queries to
the search runtime on the search server. After the FacetedSearchable object is
obtained, you issue a query to that FacetedSearchable object. To issue a query
to the FacetedSearchable object:
a. Create a FacetedQuery object.
b. Customize the FacetedQuery object.
c. Submit the FacetedQuery object to the FacetedSearchable object.
The faceted search and index API implementation in Watson Content Analytics is
provided by the com.ibm.es.api.search.facets.RemoteFacetsFactory class.
Configure the FacetsService object with the host name, port, and, if WebSphere
Application Server global security is enabled, a valid WebSphere user name and
password for the search server.
If you do not want to retrieve the facets, set the empty facet context, as in the
following example:
FacetContext facetContext = facetsFactory.createFacetContext();
query.setFacetContext(facetContext);
FacetedResultSet resultSet = searchable.search(query);
If you want to refine search results for a particular facet, specify faceted query
terms such as /country/Japan in the query string. See “Query syntax” on page 39
for more information.
With the FacetedResultSet, Result, FacetSet and Facet interfaces, you can process
query results, as in the following example:
FacetedResultSet resultSet = searchable.search(query);
Result[] results = resultSet.getResults();
if (results != null) {
for (int i = 0; i < results.length; i++) {
System.out.println(
"Result " + i + ": " + results[i].getDocumentID() + " - "
+ results[i].getTitle());
}
}
FacetSet[] facetSets = resultSet.getFacetSets();
if (facetSets != null) {
for (int i = 0; i < facetSets.length; i++) {
Facet selfFacet = facetSets[i].getSelfFacet();
Sample programs
The following sample programs for faceted search are provided in the
ES_INSTALL_ROOT/samples/siapi directory:
v FacetedSearchExample
v DocumentsViewExample
Related concepts:
“Search and index API federators” on page 33
Related reference:
“Faceted search sample application” on page 115
“Content mining sample applications” on page 116
“Federated faceted search sample application” on page 115
Use a date taxonomy browser to issue a faceted query to get the date facets on the
Time Series view, Deviations view, and Trends view of a content analytics
collection. The root category of the date taxonomy browser has the following
categories as children:
v Year (id = "$.year")
v Month (id = "$.month")
v Week (id = "$.week")
v Day (id = "$.day")
v Month of Year (id = "$.month_of_year")
v Day of Month (id = "$.day_of_month")
v Day of Week (id = "$.day_of_week")
Use the FacetsFactory object to obtain the QualifiedCategory object to issue a
faceted query to get the date facets, as in the following example:
QualifiedCategory qualifiedCategory = facetsFactory.createQualifiedCategory(
dateBrowser.getTaxonomyInfo().getID(),
dateBrowser.getCategory("$.day").getInfo());
Constraint constraint = facetsFactory.createConstraint();
constraint.set(Constraint.SUBCATEGORY_COUNT_MODE, -1, false, null);
qualifiedCategory.setConstraint(constraint);
Use facet value and subcategory taxonomy browsers to issue a faceted query to get
the facet values and subcategory facets on the following views of a content
analytics collection:
v Facets view
v Deviations view
v Trends view
v Facet Pairs view
The root category of facet value and subcategory taxonomy browsers has the
following system-defined categories and user-defined categories in the
administration console as children:
v Part of Speech (id = "$._word")
v Phrase Constituent (id = "$._phrase")
Use flag, range, and rule-based taxonomy browsers to issue a faceted query to get
the flag, range, and rule-based facets on the following views of a content analytics
collection:
v Facets view
v Deviations view
v Trends view
v Facet Pairs view
These taxonomy browsers are available if you configure document flagging, range
facets and rule-based categories in the administration console.
Add the TargetFacet object that was obtained from the FacetsFactory object to the
facet context of a faceted query on the Facets view and Time series view of a
content analytics collection, as shown in the following example:
TargetExpressions targetExpressions = facetsFactory.createTargetExpressions();
// if you want to get the correlation value of facet
targetExpressions.addExpression("correlation", "#correlation");
// if you want to get the expected count value of facet
targetExpressions.addExpression("expected_count", "#expected_count");
TargetFacet targetFacet = facetsFactory.createTargetFacet(qualifiedCategory,
targetExpressions);
FacetContext facetContext = facetsFactory.createFacetContext();
facetContext.add(targetFacet);
query.setFacetContext(facetContext);
FacetedResultSet resultSet = searchable.search(query);
On the Deviations view, Trends view, and Facet Pairs view of a content analytics
collection, add the TargetCube object to the facet context of a faceted query to get a
two dimensional facet. To get the correlation value of the cube, you can use the
following expressions:
v #topic_view_correlation" on the Deviations view
v #delta_view_correlation" on the Trends view
v #2dmap_view_correlation" on the Facet Pairs view
The following example shows how to issue a faceted query on facet pairs view:
QualifiedCategory[] dimensions = { verticalQualifiedCategory,
horizontalQualifiedCategory };
Expression expression = facetsFactory.createExpression(
"correlation", "#2dmap_view_correlation");
TargetExpressions targetExpressions = facetsFactory.createTargetExpressions();
targetExpressions.add(expression);
TargetCube targetCube = facetsFactory.createTargetCube(dimensions,
Sample programs
The following sample programs for content analytics collections are provided in
the ES_INSTALL_ROOT/samples/siapi directory:
v FacetsViewExample
v TimeSeriesViewExample
v DeviationsViewExample
v TrendsViewExample
v FacetPairsViewExample
Related reference:
“Faceted search sample application” on page 115
“Content mining sample applications” on page 116
“Federated faceted search sample application” on page 115
The following example shows how you can get available browsers from
BrowseService:
TaxonomyBrowser[] browsers = browseService.getAvailableTaxonomyBrowsers
(applicationInfo, collectionId);
Use a time scale taxonomy browser to issue a faceted query to get the specified
date scale counts in an enterprise search collection. The following example shows
how to issue a faceted query to get specified date scale counts.
TaxonomyBrowser browser = timescaleBrowser;
// get category id corresponding to the facet path
// /<the date facet name>/<the specified granularity>/
// Avaliable date facet names and granularities can be known by browsing.
Category category = browser.getCategory(getIdFromTaxonomyBrowser(browser,
this.facetPath));
Use a facet taxonomy browser to issue a faceted query to get the facets and facet
values in an enterprise search collection. The first level child categories of the facet
taxonomy browser returns user defined metadata facets.
Use flag, range, scope, and rule-based taxonomy browsers to issue a faceted query
to get the flag, range, scope, and rule-based facets in an enterprise search
collection. These taxonomy browsers are available if you configure document
flagging, range facets, scopes, and rule-based categories in the administration
console.
To issue a faceted query to get flag, range, or rule-based facets, add the
TargetFacet object that was obtained from the FacetsFactory object to the facet
context of a faceted query, as shown in the previous example for facet taxonomy
browsers.
Sample programs
Search federators are intermediary components that exist between the requestors of
service and the agents that perform that service. They are coordinate resources to
manage the multitude of searches that are generated from a single request.
The following types of search and index API federators are available:
v Local federator
v Remote federator
Search federators are search and index API searchable objects. Multiple-level
federation is allowed, but too many levels of federation will decrease search
performance.
The local and remote federators can federate over collections that are created with
Watson Content Analytics or collections that are created with another product. You
can federate over collections that are not created with Watson Content Analytics if
those collections use lightweight directory access protocol (LDAP) or Java database
connectivity (JDBC).
Local federator
A local federator federates from the client over a set of searchable objects. In
addition to using a local federator to perform traditional searches, you can use a
local faceted federator to gather results of a faceted search from multiple
collections.
Before you can create a local federator, you must create or retrieve searchable
objects by using a search and index API SearchFactory. The searchable object that is
passed to the local federator must be ready for search without any additional
information. The local federator uses the searchable object to issue a federated
search request. To complete this request, the local federator environment must have
all the necessary software components for using various searchable objects.
The following code sample shows how to create a LocalFederator object and issue
a search request:
// create searchables
Remote federator
A remote federator federates from a server over a set of searchable objects. In
addition to using a remote federator to perform traditional searches, you can use a
remote faceted federator to gather results of a faceted search from multiple
collections.
A remote federator is run on the server and consumes server resources. A remote
federator requires an extra step in which input collection IDs are mapped to the
matching searchable object.
Each enterprise search application will have its own federator, so the federator ID
is the same value as the ApplicationInfo ID value.
The following code sample shows how to create a RemoteFederator object and
issue a search request. Use the com.ibm.siapi.search.SearchService.getFederator()
method to obtain a remote federator.
// obtain the SearchFactory implementation
Class cls = Class.forName("com.ibm.es.api.search.RemoteSearchFactory");
SearchFactory factory = (SearchFactory) cls.newInstance();
Properties properties;
String applicationName="All", federatorId="Default";
Properties properties;
String applicationName="All", federatorId="Default";
The XML element is designated as the targeted XML element whose occurrences
are to be enumerated. When the semantic search is expressed by XPath, then by
definition of XPath, the deepest element that is not inside the bracketed phrase [..]
and not inside a predicate is the target element.
For example, the query <book language=en> <#author> </#author> </book>, or the
equivalent query <book language=en> <#author/> </book>, returns documents that
include at least one occurrence of the annotation book that has the attribute
language=en and includes within its span an occurrence of the annotation author.
The query also returns the enumeration of all the occurrences of the tag <author>
that appear within the occurrence of the tag <book> that has the attribute
language=en.
Each occurrence is enumerated by its unique ID. The UIMA annotators assign a
unique ID to each annotation that they generate. XML elements that are part of the
raw document rather than annotations that are generated by UIMA annotators do
not have unique IDs, and they are not enumerated in that result field. If the
summary field of the retrieved document includes text that is covered in the
document by an enumerated occurrence, that text is highlighted.
The following occurrences of the tag <author> in the retrieved document will not
be enumerated:
v An occurrence of the tag <author> within the span of the tag <journal>
v An occurrence of the tag <author> within the span of the tag <book> that has the
attribute language=ge
v An occurrence of the tag <author> within the span of the tag <book> that does
not have the attribute language
The enterprise search application can access the enumeration of the occurrences of
the target element through the TargetElement property of the Result object, for
example, Result.getProperty("TargetElement"). The returned value of that
property is a string of integers that are separated by spaces. Each integer is an ID
of a single occurrence of the target element.
The actual target elements that correspond to these integer values cannot be
retrieved by the API. If an application must access those elements, it must create
its own mapping table during parsing. For example, you can create a common
analysis for relational database mapping.
The fetch API enables users to view content by clicking documents in the search
results. This API is especially useful for data sources that do not return a clickable
URI, such as documents from IBM DB2, IBM Content Manager Enterprise Edition,
and file system sources.
The fetch API uses client libraries that are installed when Watson Content
Analytics is installed. In a multiple server installation, the libraries are installed on
the crawler server. No additional application development work is required to take
advantage of this API because the API is provided with the esapi.jar file.
The fetch API supports security at the search server level, collection level (through
application IDs), and at the document level (through indexed access controls and
current user validation). The security policy relies on the security settings in the
enterprise search application. If the enterprise search application returns a
The following list describes the characters that you can use in enterprise search
and content mining applications to refine query results.
Free style query syntax
Free style query syntax is used to describe queries that do not have an
explicit interpretation and for which there is no default behavior defined.
The default implementation for this type of query is to return documents
only if they match all terms in the free style query.
Query: computer software
Result: This query returns documents that include the term computer and
the term software, or something else depending on the semantics
implemented in the application.
~ (prefix)
Precede a term with a tilde sign (~) to indicate that a match occurs
anytime a document contains the word or one of its synonyms.
Query: ~fort
Result: This query finds documents that include the term fort or one of its
synonyms (such as garrison and stronghold).
~ (postfix)
Follow a single term with a tilde sign (~) to indicate that a match occurs
anytime a document contains a term that has the same linguistic base form
as the query term (also known as a lemma or stem).
Query: run~
Result: This query finds documents that include the term run, running, or
ran because run is the base form of the verb.
+ Precede a term with a plus sign (+) to indicate that a document must
contain the term for a match to occur. Because the plus sign is the default,
it is usually omitted. The plus sign is not needed because documents are
included in the search results only if they match all terms in a free style
query. In a free text query (without the plus sign) only matches in exact
form are returned.
Query: +computer +software
Result: This query returns documents that include the term computer and
the term software.
− Precede a term with a minus sign (-) to indicate that the term must be
absent from a document for a match to occur. The minus sign acts as a
filter to remove documents and must be associated with a query that
returns positive results.
Query: computer -hardware
Query syntax 41
Result: This query matches apples and pears or apples or pears, but it does
not match apples pears.
To search for phrases that contain double quotation marks (") or backslash
characters (\), use the backslash character to escape the restricted character.
For example, "\"The Godfather\"" or "hardware\\software requirements".
/facet_name/value_level_1/.../value_level_n
If you search a collection that contains facets, you can search for
documents that contain a specific facet or facet value. For facets with
multiple value levels, such as hierarchical and date facets, you can search
for multiple-level facet values.
Query: /country/Japan
Result: This query finds documents that include the facet country with the
facet value Japan.
Query: /date/2009/1/15 /location/US/California
Result: This query finds documents that include the facet date with the
multiple-level facet values 2009, 1, and 15, and the facet location with the
multiple-level facet values US and California.
^boost Follow a search term by a boost value to influence how documents that
contain a specified term are ranked in the search results.
Query: ibm Germany^5.0
Result: This query finds documents that include the terms IBM and
Germany, and increases the relevance of these documents by a factor of 5 in
the search results.
~ambiguity
Query: ibm analytics~0.5
Result: This query does a fuzzy search and finds documents that include
the terms IBM and analytics, IBM and analyze, IBM and analysis, and so
on.
() Use parentheses ( ) to indicate that a document must contain one or more
of the terms within the parentheses for a match to occur. Use OR or a
vertical bar ( | ) to separate the terms in parentheses.
Do not use plus signs (+) or minus signs (-) within the parentheses.
Query: +computer (hardware OR software)
Query: +computer (hardware | software)
Result: Both of these queries find documents that include the term
computer and at least one of the terms hardware or software.
An OR of terms is designated as required (+) by default. Therefore, the
previous queries are equivalent to +computer +(hardware | software).
The following list describes keywords that you can use to limit a search to specific
documents or specific parts of documents.
IN contextual view
If a content analytics collection contains contextual views, you can include
the IN keyword with other query operators and keywords to search only
the documents that belong to a specific contextual view.
Query: computer IN question “software maintenance” IN answer
Result: This query returns documents that contain the term computer in the
question view and contain the phrase software maintenance in the answer
view.
Query: /keyword$._word.noun/computer IN question IN answer
Result: This query returns documents that include the noun facet with the
facet value computer in the intersection of the question and answer views.
Query: (software maintenance) WITHIN 5 IN answer
Result: This query returns documents that contain the words software and
maintenance, or matching forms of the words, in any order, within 5 words
of each other in the answer view.
Query: @xmlf2::’<title>IBM computers</title>’ IN question
Result: This query returns documents that contains the phrase IBM
computers in the <title> element of an XML fragment in the question view.
(terms) WITHIN context IN ORDER
Follow a search term or phrase by proximity search operators to find
documents that contain terms within a specified number of words of each
other, in the same sentence, or in a specified order within a sentence. The
IN ORDER option is optional and specifies that words must appear in the
same order that you specify them in the query. The context can be:
v A positive number. For example, (a b c) WITHIN 5 matches documents
that contain the three specified words or matching forms of the words,
in any order, within 5 words of each other (that is, up to two words
between them ).
The query ("a" "b" "c") WITHIN 5 INORDER means that the three words
must appear in the same order, and in their exact form, within five
words of each other. No lemmatization is performed for the terms a, b,
or c.
v WITHIN SENTENCE means that the terms must appear in the same
sentence. Lemmatization does not occur if the terms are specified in
quotation marks.
The WITHIN context requires all terms to appear in the same field. For
example, all terms must appear in the subject field or in the body field. In
addition, the terms must appear in the same document part. For example,
a match does not occur across the body of a document and an attachment.
Sample proximity queries:
( x y z ) WITHIN 5
("x" y z ) WITHIN SENTENCE
( x "y z") WITHIN SENTENCE
subject:(world star) WITHIN SENTENCE
Query syntax 43
(lemmatization is done of world and star, in any order)
("Hello" "World") WITHIN SENTENCE INORDER
(no lemmatization and order is maintained)
(terms) ANY number
Use the ANY keyword to find documents that contain a certain number of
the specified query terms.
Query: (x y z) ANY 2
Result: This query returns documents that contain at least two of the
specified query terms.
site:text
If you search a collection that contains web content, use the site keyword
to search a specific domain. For example, you can return all pages from a
particular website.
Do not include the prefix http:// in a site query.
Query: +laptop site:www.ibm.com
Result: This query finds all documents on the www.ibm.com domain that
contain the word laptop.
url:text
If you search a collection that contains web content, use the url keyword
to find documents that contain specific words anywhere in the URL.
Query: url:support
Result: This query finds documents that have a URL with the word
support, such as http://www.ibm.com/support/fr/.
Query: url:support url:fr
Result: This query finds documents that have a URL with the words
support and fr in any order.
Query: url:support&fr
Result: This query finds documents that have a URL with the phrase
support fr. This query is similar to using double quotation marks to
search for an exact phrase.
link:text
If you search a collection that contains web content, use the link keyword
to find documents that contain at least one hypertext link to a specific web
page.
Query: link:http://www.ibm.com/us
Result: This query finds all documents that include one or more links to
the page http://www.ibm.com/us .
field:text
If the documents in a collection include fields (or columns), and the
collection administrator made those fields searchable by field name, you
can query specific fields in the collection.
Query: lastname:smith div:software
Result: This query returns all documents about employees with the last
name Smith (lastname:smith) who work for the Software division
(div:software).
file:///myfileserver1.com/db2/sales/ sale
file:///myfileserver1.com/websphere/sales/ sale
file:///myfileserver2.com/db2/sales/ sale
file:///myfileserver2.com/websphere/sales/ sale
In this example, all the URIs with the prefix http://
mycompany.server1.com/hr/ or http://mycompany.server2.com/hr/ or
http://mycompany.server3.com/hr/ belong to one group: hr. All URIs
with the prefix http://mycompany.server1.com/finance/ belong to another
group: finance. And all the URIs with prefix file:///myfileserver1.com/
db2/sales/ or file:///myfileserver1.com/websphere/sales/ or
file:///myfileserver2.com/db2/sales/ or file:///myfileserver2.com/
websphere/sales/ belong to yet another group: sale. If
file:///myfileserver2.com/websphere/sales/mypath/mydoc.txt is a URI in
the collection, a query with the following search term will restrict the
search to the URIs in the sale group:
samegroupas:file:///myfileserver2.com/websphere/sales/mypath/mydoc.txt
All results for this query will have one of the following prefixes:
file:///myfileserver1.com/db2/sales/
file:///myfileserver1.com/websphere/sales/
file:///myfileserver2.com/db2/sales/
file:///myfileserver2.com/websphere/sales/
Query: samegroupas:http://www.ibm.com/solutions/us/
Result: This query finds all documents with URIs, in this case URLs, that
belong to the same group as http://www.ibm.com/solutions/us/.
facetName::/facet_name_1/.../facet_name_n
In a content analytics collection, you can search for documents that contain
a specific facet.
Query: facetName::/”Part of Speech”/Noun/”General Noun”
Result: This query finds documents that include the facet General Noun in
a content analytics collection.
Query syntax 45
facetValue::/facet_name_1/.../facet_name_n/value
In a content analytics collection, you can search for documents that contain
a specific facet value.
Query: facetValue::/”Part of Speech”/Noun/”General Noun”/Car
Result: This query finds documents that include the value Car of the facet
General Noun in a content analytics collection.
date::/facet_name/time_scale/value
In a content analytics collection, you can search for documents that contain
a specific date facet value.
Query: date::/date/Year/2010
Result: This query finds documents that include the value 2010 for the
year time scale of the default date facet in a content analytics collection.
Query: date::/modifieddate/Month/200905
Result: This query finds documents that include the value 200905 for the
month time scale of the modifieddate date facet in a content analytics
collection.
facet::/facet_name/value_level_1/.../value_level_n
In an enterprise search collection, you can search for documents that
contain a specific facet or facet value. For facets with multiple value levels,
such as hierarchical and date facets, you can search for multiple-level facet
values.
Query: facet::/country/Japan
Result: This query finds documents that include the facet country with the
facet value Japan in an enterprise search collection.
Query: facet::/date/2009/1/15 facet::/location/US/California
Result: This query finds documents that include the facet date with the
multiple-level facet values 2009, 1, and 15, and the facet location with the
multiple-level facet values US and California.
flag::/flag_name
If an administrator configured document flags for the collection, you can
use the flag prefix to search for documents that are assigned a particular
flag.
Query: flag::/"Important"
Result: This query finds documents that are flagged as Important.
scope::/scope_name
If an administrator configured scopes for the collection, you can use the
scope prefix to search for documents that are in a particular scope.
Query: scope::/TechSupport
Result: This query finds documents that are in the TechSupport scope.
rulebased::category_ID
Use the rulebased keyword to find documents that belong to a specific
rule-based category.
Sample category tree:
Query syntax 47
#field::=value
Use parametric constraint syntax to find documents that have a numeric
field with a value equal to the specified number.
Query: #price::=1700 laptop
Result: This query finds documents that contain the term laptop and a
price field with a value equal to 1700.
#field::>value
Use parametric constraint syntax to find documents that have a numeric
field with a value greater than the specified number.
Query: #price::>1700 laptop
Result: This query finds documents that contain the term laptop and a
price field with a value greater than 1700.
#field::<value
Use parametric constraint syntax to find documents that have a numeric
field with a value less than the specified number.
Query: #price::<1700 laptop
Result: This query finds documents that contain the term laptop and a
price field with a value less than 1700.
#field::>=value
Use parametric constraint syntax to find documents that have a numeric
field with a value greater than or equal to the specified number.
Query: #price::>=1700 laptop
Result: This query finds documents that contain the term laptop and a
price field with a value greater than or equal to 1700.
#field::<=value
Use parametric constraint syntax to find documents that have a numeric
field with a value less than or equal to the specified number.
Query: #price::<=1700 laptop
Result: This query finds documents that contain the term laptop and a
price field with a value less than or equal to 1700.
#field::>value1<value2
Use parametric constraint syntax to find documents that have a numeric
field with a value that falls between a range of specified numbers.
Query: #price::>1700<3900 laptop
Result: This query finds documents that contain the term laptop and a
price field with a value greater than 1700 and less than 3900.
#field::>=value1<=value2
Use parametric constraint syntax to find documents that have a numeric
field with a value that matches or falls between a range of specified
numbers.
Query: #price::>=1700<=3900 laptop
Result: This query finds documents that contain the term laptop and a
price field with a value greater than or equal to 1700 and less than or
equal to 3900.
You can create query syntax for two types of opaque terms. An opaque term is one
that is expressed and handled by another query language, such as the XML query
languages XML Fragment and XPath. XML Fragment can also be used to query
UIMA structures. The sign for an opaque term is expressed with @xmlf2:: (XML
fragment) or @xmlxp:: (XPath query). The XML fragment or the XPath query is
enclosed in single quotation marks (' ').
Query syntax 49
The expression xmlf2 is used for XML fragments, and xmlxp is used for XPath
terms. An opaque term has the following syntax: @syntax_name::'value'. The
expression starts with the @ sign, followed by the syntax name (xmlf2 or xmlxp),
two colons (::), and a value that is enclosed in single quotation marks (' '). The
value parameter is sometimes preceded by -, +, or ^. If you need to use a single
quotation mark in the value section of the expression, escape the single quotation
by using a backslash (\), for example, \'.
For negative terms, use a minus sign (−) before the @ symbol, for example,
-@xmlf2::’<person>michelle</person>’. However, Watson Content Analytics does
not accept negative unique query terms. The query -@xmlf2::’<person>michelle</
person>’ does not return results. To get results, use one positive term in the query,
for example, documentation -@xmlf2::’<person>michelle</person>’.
In an XML fragment query, specify term modifiers inside an XML element. For
example:
@xmlf2::’<Element>IBM +computers</Element>’
@xmlf2::’<Element>IBM =computers</Element>’
@xmlf2::’<Element>IBM computers~</Element>’
In an XPath query, use the contains operator instead of the ftcontains operator to
restrict search results by the occurrence of a word. For example:
@xmlxp::’personarecord[country contains("Germany") or title contains("IBM")]’
@xmlf2::'<tag1> text1 </tag1>'
Use the @xmlf2:: prefix and enclose the query in single quotation marks to
indicate a fragment query as a new search and index API opaque term.
Query: @xmlf2::’<title>"Data Structures"</title>’
Result: This query finds documents that contain the phrase Data Structures
within the span of an indexed annotation called title.
@xmlf2::<tag1><.depth value="$number"><tag2> ... </tag2></.depth></tag1>
@xmlf2::<tag1><.depth value='$number'><tag2> ... </tag2></.depth></tag1>
The first query uses double quotation marks. The second query uses single
quotation marks. However, each query returns the same results. This query
syntax looks for occurrences of tag2 exactly $number levels under tag1.
$number is a positive integer. You can use single quotation marks (' ') or
double quotation marks (" ") around the numerical value. This query
syntax is not applicable to Unstructured Information Management
Architecture (UIMA).
Query: (This query should appear on one line.)
@xmlf2::’<author>Albert Camus<.depth value=’1’>
<publisher>Carey Press</publisher></.depth></author>’
Result: This query finds documents of the publisher one level under the
author. A document with the following XML elements
<author>Albert Camus
<ISBN>002-12345</ISBN>
<country>USA
<publisher>Carey Press</publisher>
</country>
</author>
will not be returned with the example query because the publisher
(<publisher>) element occurs two levels under the author (<author>)
element.
Query syntax 51
Result: This query finds documents from Reuters about events in Pakistan
in March that are contained in the concatenated annotation formed by the
“Report” and “HoldsDuring” annotations.
@xmlf2::'<annotation1*annotation2> ... </annotation1*annotation2>'
You can express the intersection of annotations in a fragment query using
the asterisk sign (*) between the start and end tags of an element. The
intersection of two or more overlapping annotations is a new virtual
annotation that spans just the text that is covered by the intersection of the
overlapping annotations.
Query: @xmlf2::’<Inhibits* Activates>Aspirin</Inhibits*Activates>’
Result: This query finds documents in which Aspirin occurs in both the
'Inhibits' and 'Activates' annotations.
@xmlxp::'/tag1/@tag1'
You can distinguish between elements (XML start and end tags) and
attributes. Attributes are written explicitly with a leading @ sign. The @
sign enables you to distinguish between elements and attributes that might
have the same name. Concatenations and intersections are applicable only
to UIMA documents, and not to pure XML documents, where spands do
not cross over by definition.
Query: @xmlxp::’/author[@country="USA"]’
Result: This query finds documents in which USA is included in the
character string that is the value of the attribute country that is associated
with author.
@xmlxp::'/tag1[tag2 or tag3 and tag4]'
Use full Boolean to express AND and OR scope in an XPath query.
Query: @xmlxp::’book[author ftcontains("Jose Perez") or title
ftcontains("XML -Microsoft")]’
Result: This query finds documents that specify a book whose author is
Jose Perez or where the title of the book includes the word XML, but not
Microsoft.
@xmlxp::'tag1//tag2/tag3'
You can distinguish between descendent nodes (//) and child nodes (/).
Query: @xmlxp::’/books//book/name’
Result: This query finds documents that specify a book element as a
descendant of a books element and that specify a name element as a direct
child of the book.
@xmlxp::'tag1/.../tagn'
Use the @xmlxp:: prefix and enclose the query in single quotation marks to
indicate an XPath query as an search and index API opaque term.
Query: @xmlxp::’books[booktitle ftcontains("Data Structures")]’
Result: This query finds documents that contain the phrase "Data
Structures" within the span of an indexed annotation called "title."
Related reference:
“Controlling query behavior” on page 17
Appearance modifiers
Prematch type modifiers appear just before the word that they modify:
PreMatch_Type = { = | ~ }
v = denotes that the word should be matched as is, that is, it should not be
stemmed or lemmatized, and that the search should not be expanded to include
synonyms of the word
v ~ denotes that the search should be expanded to include synonyms of the word
Postmatch modifiers appear directly after the word that they modify:
PostMatch_Type = { * | ~ }
v * matches words having the indicated prefix
v ~ matches words that share the same base form, for example, stem or lemma
with this word
By default, words that are explicitly modified by an appearance modifier but not
by a match type use exact-match (“as is”) semantics.
Query syntax 53
Fielded search notation
OR terms
Semantically, at least one of the OR-ed terms must appear in documents that
qualify as search results.
ORable_term = Query_term \ { OrTerm, OpaqueTerm }
OR-SIGN = | | OR
The parametric field must be greater than (or equal to in the second case) the
double value:
Grelation = > double_value | >= double_value
The parametric field must be less than (or equal to, in the second case) the double
value:
Lrelation = < double_value | <= double_value
The # character is followed by the field name, two colons, and at least one relation
(or =). The Appearance_Modifier can be either + or ^ (- is not allowed). If no
Appearance_Modifier is given, a + is implicitly assumed:
RangeConstraint = Appearance_Modifier?# field :: Grelation Lrelation? |
Appearance_Modifier?# field :: Grelation? Lrelation |
Appearance_Modifier?# field :: =double_value
Opaque terms
An @ sign is followed by some syntax name, two colons, and a value enclosed in
single quotation marks. The opaque term can be preceded by an appearance
modifier. If a single quote is needed in the value part, it should be escaped by \,
as in \':
OpaqueTerm = Appearance_Modifier?@ syntax_name :: ’ value ’
For the semantics of opaque terms, the search and index APIs:
v Do not attempt to parse the value inside the single quotation marks; rather, that
string will be passed as-is to a parser that corresponds to the syntax_name.
v Does not define which external query languages should be supported by
implementations.
v Does not define how many opaque terms can exist inside a query, and how they
interact with the rest of the terms. All this is implementation defined. It is
assumed that in most cases, a query either consists solely of an opaque term, or
does not contain such terms at all.
Query syntax 55
v An exception to the previous rule is that a '(' can begin a token inside an
OrTerm because those cannot be nested and so '(' has no special meaning there.
v The characters + - ^ have special meaning only if they are preceded by a space,
by one of = ~ (, or at the beginning of the query string.
v The colon has meaning only as a separator between a field/constraint-type and
a value. The colon is considered a regular character in all other cases.
v The character ) has special meaning only inside an OrTerm, but outside of a
phrase inside the OrTerm. There, it will terminate the OrTerm. In all other cases,
it is considered a regular character.
v The character * has special meaning only for values; that is, wildcard characters
are not applied to field names.
v The sequences <,<=,>,>= have special meaning only within a range constraint.
v All special characters except " are considered regular characters inside a phrase:
they lose their special functions inside phrases. The " ends the phrase. This rule
trumps all previous rules.
v Wildcard characters are allowed inside phrases
The behavior of the query parser is undefined for nonconforming strings. In some
cases, the parser implicitly overcomes problems, such as ending phrases that are
not terminated, and in some cases it does not overcome such problems.
The syntax of ACL expressions is a subset of the full query syntax. Basically, it
consists of words, OR expressions over several words, and opaque terms.
The semantic disclaimers that were specified with respect to opaque terms in query
strings apply here.
The behavior of the ACL expression parser is undefined for nonconforming strings.
In some cases, the parser implicitly overcomes problems, such as ending OR-terms
that are not terminated, and in some cases it does not overcome such problems.
Query syntax 57
58 IBM Watson Content Analytics: Programming Guide
Real-time NLP API
The real-time natural language processing (NLP) API allows users to perform
ad-hoc text analytics on documents.
Real-time text analysis uses the existing text analytics resources that are defined for
a collection, but analyzes documents without adding them to the index. Users can
immediately check the analysis results without waiting for the index to be built or
updated.
Requirements
The following system set-up is required to use the real-time NLP API:
v Real-time NLP requires a content analytics collection that hosts text analytics
resources. The collection must not be enabled to use IBM InfoSphere BigInsights.
v Administrators configure the collection for real-time NLP by configuring the
facet tree, dictionaries, and patterns for text extraction, just as they would for
typical content analytics collections. The result of real-time NLP reflects the
configuration of that collection.
v The parse and index sessions for the collection must be running because these
sessions provide the document processing engine for the real-time NLP API.
v Search sessions for the collection must be running because these sessions serve
as the gateway for the real-time NLP API.
Typical usage
The following steps summarize the typical workflow for using real-time NLP:
v A dictionary developer creates a content analytics collection with dictionaries for
testing results, and uses the real-time NLP API to examine how the dictionaries
attach facets for various input documents.
v A workflow system uses real-time NLP to determine how to process documents
based on the facets attached to the documents.
v An alert system constantly processes input documents, such as chat logs or news
feeds, and sends email to managers immediately if a particular facet is attached
to an input document.
A call of the real-time NLP API might require additional time if the call needs to
initialize a document processor. Document processors are initialized when parse
and index or document processors are started, or analytic resources are deployed.
Document processors are also initialized after the parse and index configuration is
changed. Real-time NLP API requests and normal document processing, such as
building the index, share the resources of the document processors. Therefore,
index creation might affect the performance of the real-time NLP performance.
Similarly, real-time NLP API requests might affect the performance of the index
creation.
Both SIAPI and REST API versions of the real-time NLP API are provided. The
NLP REST API accepts both text and binary content, but the SIAPI version only
accepts content in text format.
The real-time NLP API is also supported with enterprise search collections for
advanced users.
When the application issues remote search and index API requests that must be
secure, you must set the user name and password on the Service classes with a
valid user name that is stored in the user registry that is for authentication. Any
requests that do not contain valid user names and passwords are rejected.
In an enterprise search application, the Properties object is passed in the call to the
getSearchService method or getBrowseService method. The Properties object
specifies property names called username and password.
Applications and collections must have IDs. For applications that need to access
specific collections, the collection ID must be associated with the application ID.
You can specify which collections the application can access in the administration
console.
Related reference:
Programming guidance for developing secure search applications with Java
API
Document-level security
To support prefiltering and post-filtering of search results, the search request must
provide a user's security context by using the setACLConstraints method on the
Query object.
You can create the user's security context XML string in two ways:
v By using the identity management API to programmatically create the XML
string.
v By using Java String classes to create the XML string
Use this method only if you cannot build applications with the identity
management API.
Related reference:
Programming guidance for developing secure search applications with Java
API
With the identity management Java APIs, you can create an application to manage
the security credentials of your users. The following graphic shows how users log
in to a system such as WebSphere Portal and authenticate with the registry.
User ID
User’s security
context string
To run the Java sample program, make sure that you have the following JAR files
in your class path:
v esapi.jar
v siapi.jar
v es.security.jar
v es.oss.jar
To run the sample program, enter the following command on a single command
line.
Windows
java –classpath $ES_INSTALL_ROOT\lib\esapi.jar;$ES_INSTALL_ROOT\lib\
siapi.jar;$ES_INSTALL_ROOT\lib\es.security.jar;.
IdentityManagementExample
AIX® or Linux
java –classpath $ES_INSTALL_ROOT/lib/esapi.jar:$ES_INSTALL_ROOT/lib/
siapi.jar:$ES_INSTALL_ROOT/lib/es.security.jar:.
IdentityManagementExample
Related reference:
Programming guidance for developing secure search applications with Java
API
To create the USC XML string for a particular user, first instantiate a
SecurityContext object. The SecurityContext object contains a user name, an array
of Identity objects, and optionally a Single Sign-On (SSO) token. The user name
that is assigned to the SecurityContext is typically the value that the user specified
to log in to your application.
After you create a SecurityContext object, you create an array of Identity objects.
Each Identity object contains a user name and a password, a String array of
group tokens, a source type, and a domain identifier. If the SecurityContext object
contains an SSO token, then the user name is required but the password is
optional. For example:
SecurityContext context = new SecurityContext();
context.setUserID("uid=wpsadmin,o=default organization");
Application security 63
identities[0].setGroups(groups);
identities[0].setProperties(new Properties());
context.setIdentities(identities);
After you create the context, you can easily set the ACL constraints in the query by
calling the context.serialize(true) method. The Boolean parameter indicates that
the XML string values should be Base64 encoding to ensure proper transmission to
the search server. For example:
q.setACLConstraints("@SecurityContext::’" + context.serialize
(true) + "’");
Related reference:
Programming guidance for developing secure search applications with Java
API
You can apply business and security rules to enforce document-level security and
add, update, or delete the crawled metadata and document content that is
associated with documents in an index. The data source crawler plug-in APIs
cannot be used with the web crawler.
You can also create a plug-in that extracts entries from archive files. The extracted
files can then be parsed individually and included in collections.
Restriction: The following type B data source crawlers do not support plug-ins to
extract or fetch documents from archive files:
v Agent for Windows file systems crawler
v BoardReader crawler
v Case Manager crawler
v Exchange Server crawler
v FileNet P8 crawler
v SharePoint crawler
You can add fields to the HTTP request header that is sent to the origin server to
request a document. You can also view the content, security tokens, and metadata
of a document after the document is downloaded. You can add to, delete from, or
replace any of these fields, or stop the document from being parsed.
Web crawler plug-ins support two kinds of filtering: prefetch and postparse. You
can specify only a single Java class to be the web crawler plug-in, but because the
prefetch and postparse plug-in behaviors are defined in two separate Java
interfaces and because Java classes can implement any number of interfaces, the
web crawler plug-in class can implement either or both behaviors.
For detailed information about each plug-in API, see the Javadoc documentation in
the following directory: ES_INSTALL_ROOT/docs/api/.
Related concepts:
“Crawler plug-ins for non-web sources”
“Web crawler plug-ins” on page 75
“API documentation” on page 5
Related tasks:
“Creating and deploying a plug-in for archive files” on page 71
Related reference:
“Sample plug-in application for non-web crawlers” on page 121
With the crawler plug-in for data source crawlers, you can add, change, or delete
crawled content or metadata. You can also create a plug-in for extracting files from
archive files and extend that plug-in to enable users to view the extracted content
when they view the search results.
Restriction: The following type B data source crawlers do not support plug-ins to
extract or fetch documents from archive files:
v Agent for Windows file systems crawler
v BoardReader crawler
v Case Manager crawler
v Exchange Server crawler
v FileNet P8 crawler
v SharePoint crawler
When you specify the Java class as the new crawler plug-in, the crawler calls the
class for each document that it crawls.
For each document, the crawler passes to your Java classes the document identifier,
the security tokens, the metadata, and the content that was specified by an
administrator. Your Java class can return a new or modified set of security,
metadata, and content.
Restriction: The crawler plug-in allows you to add security tokens, but it does not
allow you to access the native access control lists (ACLs) that are collected by the
crawlers that are provided with Watson Content Analytics.
Related concepts:
“Crawler plug-ins” on page 65
Related reference:
Tip: For information about creating a crawler plug-in for the following type B data
sources, see “Creating a crawler plug-in for type B data sources” on page 69:
v Agent for Windows file systems crawler
v BoardReader crawler
v Case Manager crawler
v Exchange Server crawler
v FileNet P8 crawler
v SharePoint crawler
Procedure
To create a Java class for use as a crawler plug-in with content-related functions for
type A data sources:
1. Extend com.ibm.es.crawler.plugin.AbstractCrawlerPlugin and implement the
following methods:
init()
isMetadataUsed()
isContentUsed()
activate()
deactivate()
term()
updateDocument()
The AbstractCrawlerPlugin class is an abstract class. The init, activate,
deactivate, and term methods are implemented to do nothing. The
isMetadataUsed method and isContentUsed method are implemented to return
false by default. The updateDocument method is an abstract method, so you
must implement it.
For name resolution, use the ES_INSTALL_ROOT/lib/dscrawler.jar file.
2. Compile the implemented code and make a JAR file for it. Add the
ES_INSTALL_ROOT/lib/dscrawler.jar file to the class path when you compile.
3. In the administration console, follow these steps:
a. Edit the appropriate collection.
b. Select the Crawl page and edit the crawler properties for the crawler that
will use the custom Java class.
Crawler plug-ins 67
c. Specify the following items:
v The fully qualified class name of the implemented Java class, for example,
com.ibm.plugins.MyPlugin. When you specify the class name, ensure that
you do not specify the file extension, such as .class or .java.
v The fully qualified class path for the JAR file and the directory in which
all files that are required by the Java class are located. Ensure that you
include the name of the JAR file in your path declaration, for example,
C:\plugins\Plugins.jar. If you need to specify multiple JAR files, ensure
that you use the correct separator depending on your platform, as shown
in the following examples:
– AIX or Linux: /home/esadmin/plugins/Plugins.jar:/home/esadmin/
plugins/3rdparty.jar
– Windows: C:\plugins\Plugins.jar;C:\plugins\3rdparty.jar
4. On the Crawl page, click Monitor. Then, click Stop and Start to restart the
session for the crawler that you edited. Click Details and start a full crawl.
Results
If the crawler stops when it is loading the plug-in, view the log file and verify that:
v The class name and class path that you specified in the crawler properties page
are correct.
v All necessary libraries are specified for the plug-in class path.
v The crawler plug-in does not throw a CrawlerPluginException error.
Metadata field definitions: If you want to add a new metadata field in your
crawler plug-in, you must create an index field and add the metadata field to the
collection by configuring parsing and indexing options in the administration
console. Ensure that the name of the metadata field is the same as the name of the
index field.
The following methods in the FieldMetadata class are deprecated. These field
characteristics are overwritten by field definitions in the parser configuration:
public void setSearchable(boolean b)
public void setFieldSearchable(boolean b)
public void setParametricSearchable(boolean b)
public void setAsMetadata(boolean b)
public void setResolveConflict(String string)
public void setContent(boolean b)
public void setExactMatch(boolean b)
public void setSortable(boolean b)
Using Plug-inLogger to log messages: The Plug-inLogger is a class that you can
use to include log statements from the plug-in in the Watson Content Analytics log
files. To use the Plug-inLogger, specify the following statement in the import
statements:
import com.ibm.es.crawler.plug-in.logging.Plug-inLogger;
Add the following statements after the start of the class declaration:
With the default collection settings, these statements cause warning and error
messages to be shown in the collection log file. For example:
W FFQD2801W 2013/04/27 23:02:05.619 CDT plug-in plug-in.WIN_50605.crawlerplug-in
FFQD2801W A warning was generated from the crawler plug-in.
Message: This is a warning message.
E FFQD2800E 2013/04/27 23:02:05.681 CDT plug-in plug-in.WIN_50605.crawlerplug-in
FFQD2800E An error was generated from the crawler plug-in.
Message: This is an error message.
To show informational messages in the collection log file, open the administration
console. Select the collection, click Actions > Logging > Configure log file
options, and then select All messages for the type of information to log and trace.
After you stop and restart the crawler session, informational messages appear in
the collection log file.
Related tasks:
Configuring search fields
Related reference:
“Sample plug-in application for non-web crawlers” on page 121
Unlike type A data source crawler plug-ins, the type B data source crawler plug-in
process is not forked. The plug-in always runs in the same process of the crawler.
When the crawler session starts, a CrawlerPlugin object is instantiated with the
default constructor. During the crawler session, the activate method is called
when the crawler starts its crawling and the deactivate method is called when the
crawler finishes its crawling. When the crawler session ends, the object is
destroyed. If the crawler scheduler is enabled, the activate method is called when
the crawling is scheduled to start and the deactivate method is called when the
crawling is scheduled to end. Because a single crawler session runs continuously
when the crawler scheduler is enabled, the object is not destroyed.
Procedure
To create a Java class for use as a crawler plug-in for type B data sources:
Crawler plug-ins 69
1. Extend com.ibm.ilel.crawler.plugin.CrawlerPlugin and implement the
following methods:
activate()
deactivate()
updateDocument()
The CrawlerPlugin class is an abstract class. The activate and deactivate
methods are implemented to do nothing. The updateDocument method is an
abstract method, so you must implement it.
Deprecated methods: The init and term methods in the CrawlerPlugin class
are deprecated. For compatibility purposes, the init method is called at the
same time as the activate method when the crawler starts its crawling and the
term method is called at the same time as the deactivate method when the
crawler stops its crawling. Do not use the init and activate methods in the
same plugin. Similarly, do not use the deactivate and term methods in the
same plugin.
For name resolution, use one of the following JAR files:
v AIX or Linux: $ES_INSTALL_ROOT/lib/ilel-crawler.jar
v Windows: %ES_INSTALL_ROOT%\lib\ilel-crawler.jar
2. Compile the implemented code and create a JAR file for it. Add the
ilel-crawler.jar file to the class path when you compile.
3. In the administration console, follow these steps:
a. Edit the appropriate collection.
b. Select the Crawl page and edit the crawler properties for the crawler that
will use the custom Java class.
c. Specify the following items:
v The fully qualified class name of the implemented Java class, for example,
com.ibm.plugins.MyPlugin. When you specify the class name, ensure that
you do not specify the file extension, such as .class or .java.
v The fully qualified class path for the JAR file and the directory in which
all files that are required by the Java class are located. Ensure that you
include the name of the JAR file in your path declaration, for example,
C:\plugins\Plugins.jar. If you need to specify multiple JAR files, ensure
that you use the correct separator depending on your platform, as shown
in the following examples:
– AIX or Linux: /home/esadmin/plugins/Plugins.jar:/home/esadmin/
plugins/3rdparty.jar
– Windows: C:\plugins\Plugins.jar;C:\plugins\3rdparty.jar
4. On the Crawl page, click Monitor. Then, click Stop and Start to restart the
session for the crawler that you edited. Click Details and start a full crawl.
Results
If the crawler stops when it is loading the plug-in, view the log file and verify that:
v The class name and class path that you specified in the crawler properties page
are correct.
v All necessary libraries are specified for the plug-in class path.
v The crawler plug-in does not throw a CrawlerPluginException error.
Metadata field definitions: If you want to add a new metadata field in your
crawler plug-in, you must create an index field and add the metadata field to the
Ensure that the correct version of Java is installed. The crawler plug-in for archive
files must be compiled with the IBM Software Development Kit (SDK) for Java
Version 1.6.
Restriction: You cannot use this plug-in with the following type B data source
crawlers:
v Agent for Windows file systems crawler
v BoardReader crawler
v Case Manager crawler
v Exchange Server crawler
v FileNet P8 crawler
v SharePoint crawler
Type A data source crawlers provide a plug-in interface that enables you to extend
their crawling capabilities and crawl archive files in Watson Content Analytics. The
crawler uses the specified crawler plug-in for archive files to extract archive entries
from an archive file and send the extracted archive entries to the parsers.
To use this capability, you must develop a crawler plug-in for archive files that
implements the com.ibm.es.crawler.plugin.archive.ArchiveFile interface and register
the plug-in in the crawler configuration file.
Important: To enable users to fetch and view files that are extracted from an
archive file when they view search results, you must extend your archive plug-in
to view extracted files.
Procedure
Crawler plug-ins 71
/**
* Close this archive file.
*/
public void close() throws IOException;
/**
* Reads the next archive entry and positions stream at the beginning of
* the entry data.
*
* @param charset the name of charset
* @return the next entry
*/
public ArchiveEntry getNextEntry(String charset) throws IOException;
/**
* Returns an input stream of the current archive entry.
*
* @return the input stream
*/
public InputStream getInputStream() throws IOException;
}
/**
* Returns the modify time of this entry.
*
* @return the modify time of this entry
*/
public long getTime();
/**
* Returns the length of file in bytes.
*
* @return the length of file in bytes
*/
public long getSize();
/**
* Tests whether the entry is a directory.
*
* @return true if the entry is a directory
*/
public boolean isDirectory();
}
c. Compile the implemented code and create a JAR file for it. Add the
dscrawler.jar file to the class path when you compile. The crawler plug-in
for archive files must be compiled with the IBM Software Development Kit
(SDK) for Java Version 1.6.
2. Verify the crawler plug-in with the
com.ibm.es.crawler.plugin.archive.ArchiveFileTester class. Add the
dscrawler.jar file and your plug-in code to the class path when you run this
Java application.
Crawler plug-ins 73
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Path"></SetAttribute>
<AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Extensions" />
<AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile/Extensions"
Name="Extension">archive_file_extension</AppendChild>
</ExtendedProperties>
where:
archive_file_type
Specifies the type of the archive files.
plugin_classname
Specifies the fully qualified class name of your crawler plug-in for
archive files.
path_to_required_jars
Specifies the class path, delimited by the path separator, that are
required to run your crawler plug-in for archive files.
archive_file_extension
Specifies the file extension of the archive files that you want to process
with your crawler plug-in for archive files.
d. Restart the crawler that you stopped.
Example
Here is a sample crawler configuration for enabling the crawler plug-in for LZH
archive files.
<ExtendedProperties>
<AppendChild XPath="/Crawler" Name="ArchiveFileRegistry" />
<AppendChild XPath="/Crawler/ArchiveFileRegistry" Name="ArchiveFile" />
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Type">lzh</SetAttribute>
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Class">com.ibm.es.sample.archive.lzh.LzhFile</SetAttribute>
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Classpath">C:\lzhplugin;C:\lzhplugin\lzhplugin.jar</SetAttribute>
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Path"></SetAttribute>
<AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Extensions" />
<AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile/Extensions"
Name="Extension">.lzh</AppendChild>
</ExtendedProperties>
Related concepts:
“Crawler plug-ins” on page 65
“API documentation” on page 5
Watson Content Analytics provides Java APIs for implementing a crawler plug-in
that extracts archive entries from archive files that are crawled by type A data
source crawlers. The fetch capabilities, however, do not allow users to view the
extracted files. You can extend the archive plug-in so that users can fetch and view
Restriction: You cannot use this plug-in with the following type B data source
crawlers:
v Agent for Windows file systems crawler
v BoardReader crawler
v Case Manager crawler
v Exchange Server crawler
v FileNet P8 crawler
v SharePoint crawler
where:
type
Specifies the identifier of the archive document type, such as .rar or .lzh. You
can also choose your own type.
classpath
Specifies the list of paths for the class path that is required to run your archive
plug-in. Separate the paths by a semicolon (;) on Windows or a colon (:) on
AIX or Linux.
classname
Specifies the class name of your archive plug-in.
extension
Specifies the file extension. Your archive plug-in is invoked for the files that
match this extension.
es.ext.dirs.rar=C:\\rarplugin;C:\\rarplugin\rarplugin.jar;
archive.plugin.rar=RarFile;.rar
Related concepts:
“API documentation” on page 5
Crawler plug-ins 75
With the prefetch plug-in, you can use Java APIs to add fields to the HTTP request
header that is sent to the origin server to request a document.
With the postparse plug-in, you can use Java APIs to view the content, security
tokens, and metadata of a document before the document is parsed and tokenized.
You can add to, delete from, or replace any of these fields, or stop the document
from being sent to the parser.
If your plug-in requires Java classes or non-Java libraries or other files besides the
plug-in, you must write the plug-in to handle that requirement. For example, your
plug-in can invoke a class loader to bring in more Java classes and can also load
libraries, make network connections, make database connections, or do anything
else that it needs.
Plug-ins run as part of the crawler JVM process. Exceptions and errors will be
caught, but crawler performance is affected by plug-in execution. You should write
plug-ins to do the minimum amount of processing and catch all anticipated
exceptions. Plug-in code must be multithread-safe. If you have 200 concurrent
downloads, you might have 200 concurrent calls to your plug-in.
The plug-in is required if you use the web crawler to crawl any sites through
WebSphere Portal, including Workplace Web Content Management sites and
Lotus® Quickr® sites.
Related concepts:
“Crawler plug-ins” on page 65
Procedure
Results
If an error occurs and the web crawler stops while it is loading the plug-in, view
the log file and verify that:
v The class name and class path that you specified on the crawler properties page
is correct.
v All necessary JAR files were specified for plug-in class path.
v The crawler plug-in does not throw CrawlerPluginException or any other
unexpected exception, and no fatal errors occur in the plug-in.
You must write this method to be thread-safe, which you can do by wrapping its
entire contents in a synchronized block, but that permits only one thread to
execute the method at a time, which causes the crawler to become single-threaded
during plug-in operation, creating a performance bottleneck.
Crawler plug-ins 77
Prefetch plug-in example
You can use a prefetch plug-in to add a cookie to the HTTP request header before
the document is downloaded.
package com.mycompany.ofpi;
import com.ibm.es.wc.pi.PrefetchPlugin;
import com.ibm.es.wc.pi.PrefetchPluginArg;
import com.ibm.es.wc.pi.PrefetchPluginArg1;
public class MyPrefetchPlugin implements PrefetchPlugin {
public boolean init() { return true; }
public boolean release() { return true; }
public boolean processDocument(PrefetchPluginArg[] args) {
PrefetchPluginArg1 arg = (PrefetchPluginArg1)args[0];
String header = arg.getHTTPHeader();
header = header.substring(0, header.lastIndexOf("\r\n"));
header += "Cookie: class=TestPrefetchPlugin\r\n\r\n";
arg.setHTTPHeader(header);
return true;
}
}
Requirment: Use the Java JAR utility that is included in the Java Development
Kit (JDK).
2. Copy this JAR file to the computer that runs the web crawler. Enter the
absolute path for the JAR file in the administration console on the crawler
window when you enable plug-in.
3. In the administration console, specify the following items:
v The fully qualified class name of the implemented Java class, for example,
com.mycompany.ofpi.MyPrefetchPlugin
v The qualified class path for the JAR file
Ensure that the information that you enter is correct. The system does not
check that the JAR file exists.
When the crawler is started and finds a plug-in JAR file and class name, the
crawler loads the JAR and instantiates the class by using the no-argument
constructor. The crawler then initializes the instance by calling the init method.
If that method returns true, the plug-in is added to the list of prefetch plug-ins.
Results
After you run the crawler, the return value is logged in the collection log file as
informational message. To see information messages, choose All messages as Type
of information to log.
To create a postparse plug-in, you write a Java class that implements the interface
com.ibm.es.wc.pi.PostparsePlugin, for example:
public class MyPostparsePlugin implements
com.ibm.es.wc.pi.PostparsePlugin {
public MyPostparsePlugin () { ... }
public boolean init() { ... }
public boolean processDocument(PostparsePluginArg[] args) { ... }
public boolean release() { ... }
}
The plug-in class can implement both interfaces, but it needs only one init
method and one release method. If the class does both prefetch and postparse
processing, you need to initialize and release resources for both tasks. Both the
init method and the release method are called once.
The processDocument method is called on the single plug-in instance for every URL
for which a download was attempted. Not all downloads return content. The
HTTP return codes, such as 200, 302, or 404, can be used by your plug-in to
Crawler plug-ins 79
determine what to do when called. If content was obtained and if the content was
suitable for HTML parsing, the content is put through the parser, and the results of
parsing are available when your plug-in is called.
The following example shows how to add security ACLs to the metadata that the
crawler sends with documents that are downloaded from a particular site. You can
use a postparse plug-in to add those ACLs just before the crawler writes the
document to the parser's input buffer:
package com.mycompany.ofpi; // Plug-ins
import com.ibm.es.wc.pi.*;
You can also use a postparse plug-in to add a new metadata field to your crawled
documents. For example, if some of your documents contain a particular facet
value, you might want to add a metadata field called "MyUserSpecificMetadata" to
the search record that contains a string that you need to query when the crawler is
running with various "searchability" attributes. In another example, because the
built-in parsers cannot extract metadata from binary documents, you might want
to add enterprise-specific metadata to binary documents after they are crawled to
ensure that the metadata fields can be searched when users search the collection.
public MyPostparsePlugin() { }
public boolean init() { return true; }
public boolean release() { return true; }
public boolean processDocument(PostparsePluginArg[] args) {
try {
PostparsePluginArg1 arg = (PostparsePluginArg1)args[0];
if (arg.getContent() != null && arg.getContent().length > 0) {
String content = new String( arg.getContent(), arg.getEncoding() );
if (content != null && content.indexOf(keyword) > 0) {
final String userdata = null; // look up string by keyword.
FieldMetadata mf = new FieldMetadata(
"MyUserSpecificMetadata", // field name
userdata, // field value
false, // searchable?
true, // field-searchable?
false, // parametric-searchable?
To define a new metadata field, create an instance of the FieldMetadata object and
set its field values.
Crawler plug-ins 81
82 IBM Watson Content Analytics: Programming Guide
Creating and deploying a plug-in for post-filtering search
results
You can create a Java class to programmatically apply your own security logic for
post-filtering search results.
When a search for a collection is started, the plug-in is also initialized. An object
that implements the SecurityPostFilterPlugin interface is instantiated with the
default constructor. When the search is stopped, the object is destroyed. Before the
interim search result candidates that are returned by a query are post-filtered by
the plug-in, the init method is invoked. The term method is called after the plug-in
finishes filtering results for a query.
If you want to apply only your custom plug-in for post-filtering the search results
and not use the system-defined post-filtering functions, you can disable the
system-defined function. If any crawler is configured to use the system-defined
post-filtering function, your custom plug-in is applied in addition to the
post-filtering that is done automatically by the system.
Procedure
To create a Java class and deploy a plug-in for post-filtering search results:
1. Create a Java class that implements the
com.ibm.es.security.plugin.SecurityPostFilterPlugin interface and implement the
following methods:
v init()
v term()
v verifyUserAccess()
For name resolution, use the ES_INSTALL_ROOT/lib/trevi.tokenizer.jar file.
2. Compile the implemented code and create a JAR file for it. To deploy the
plug-in, you must provide the plug-in as a JAR file. Add the file
trevi.tokenizer.jar to the class path when you compile.
3. Do the following steps on all search servers.
Results
If you see an error message similar to the following message, and no results are
returned the first time that you submit a search after you configure the plug-in,
then it is possible that your plug-in was not applied successfully:
FFQR0648E A general exception was caught while processing document level security.
Exception text: com.ibm.es.security.plugin.SecurityPostFilterPluginException:
Failed to load plug-in class.
In this case, check the ESSearchServer log file and verify the following conditions:
v The class name and class path that you specified in the ESSearchServer
properties file are correct. To ensure that the plug-in can be found on Windows,
be sure to escape, by using the backslash character ( \ ), characters like colons ( :
) and backslashes in the class path.
v All necessary libraries are specified for the plug-in class path.
v The plug-in does not throw a SecurityPostFilterPluginException exception in the
ESSearchServer log file.
Example
The sample plug-in application for post-filtering search results shows how you can
eliminate documents that users are not authorized to view from the search results.
Related concepts:
“API documentation” on page 5
Related reference:
“Sample plug-in for post-filtering search results” on page 123
Procedure
To create a Java class and deploy a plug-in for exporting documents or deep
inspection results:
1. Create a Java class that extends the
com.ibm.es.oze.api.export.ExportDocumentPublisher abstract class. The
com.ibm.es.oze.api.export.ExportDocumentPublisher class has the following
methods:
v init()
v initPublish()
v publish()
v termPublish()
v term()
The init, initPublish, termPublish, and term methods are implemented to do
nothing. The publish method is an abstract method, so you must implement it.
If you plan to export content from an InfoSphere BigInsights collection and
export directly from Hadoop MapReduce tasks, the plug-in class must have the
annotation com.ibm.es.oze.api.export.ExecuteOnHadoop. The plug-in can
override the abortPublish method that cleans up the output of an aborted
Hadoop task. The abortPublish method is called when a Hadoop task is
aborted and it calls the termPublish method by default.
2. Optional: If you want to control which documents are exported, extend the
com.ibm.es.oze.api.export.ExportDocumentFilter abstract class. The class has the
following method:
v accept()
3. Optional: If you want to export deep inspection results, implement the
following interfaces:
interface: com.ibm.es.oze.api.export.document. InspectionContent
Use this interface to export metadata about the deep inspection request.
package com.ibm.es.oze.api.export.document;
public interface InspectionContent extends Content {
public InspectionRecord[] getInspectionRecords();
}
You can create custom widgets by using the Dojo Toolkit. You must create a
separate plug-in for each custom widget.
Procedure
To create and deploy a plug-in that adds a custom widget in a search or analytics
application:
1. Develop the JavaScript (.js) file for the custom plug-in by using the Dojo
Toolkit. Develop the plug-in as a Dojo widget that extends the
ica/pane/PanePluginBase class. For information about the available functions,
see the MyFirstSearchPane and MyFirstAnalyticsPane sample plug-ins.
2. Add the plug-in file to the ES_NODE_ROOT/master_config/searchapp/icaplugin
directory.
3. Register the widget.
a. Back up and edit the appropriate widgets.json file for the type of
application to which you want to add the custom widget:
v To register a custom widget for a search application, edit the
ES_NODE_ROOT/master_config/searchserver/repo/search/
Application_Name/widgets.json file.
v To register a custom widget for an analytics application, edit the
ES_NODE_ROOT/master_config/searchserver/repo/analytics/
Application_Name/widgets.json file.
Application_Name is the application ID, such as default, social, or advanced.
You can determine the ID by viewing the list of applications in the
application customizer.
b. Add an entry for the widget in the following format:
} ,
"MyCustomAnalyticsPane" : {
"available" : true,
"label" : "My Custom Analytics Pane" ,
"widgetName" : "icaplugin/MyCustomAnalyticsPane" ,
"properties": [
{"value":"test", "name":"defaultQuery", "editable":true, "sync":false,
"type":"TextBox", "label":"Default Query", "widgetOptions":{},
"requried":false}
]
}
The MyCustomAnalyticsPane field is the internal ID of this widget. You can
assign any value that includes alphabetic and numeric characters only.
Tip: Ensure that you include a comma (,) before each entry to conform to
JSON syntax.
4. Restart the user application.
v If you use the embedded web application server, enter the following
commands, where node_ID identifies the search server:
esadmin searchapp.node_ID stop
esadmin searchapp.node_ID start
To determine the node ID for the search server, run the esadmin check
command to view a list of session IDs. Look for the node ID that is listed for
the searchapp session.
v If you use WebSphere Application Server:
a. Enter the following command:
esadmin config sync
b. Stop and restart the user application.
Tip: To test plug-in code without restarting the server, you can add the
plug-ins to the ES_INSTALL_ROOT/wlpapps/servers/searchapp/apps/
commonui.ear/commonui.war/icaplugin directory. After you update the contents
of this directory, clear the browser cache to immediately view the changes in
your application. However, this directory is automatically overridden when the
server is restarted. When the server is restarted, the ES_NODE_ROOT/
master_config/searchapp/icaplugin directory is copied to the
ES_INSTALL_ROOT/wlpapps/servers/searchapp/apps/commonui.ear/
commonui.war/icaplugin directory.
Creating and deploying a plug-in to add custom widgets for user applications 89
90 IBM Watson Content Analytics: Programming Guide
Creating and deploying a custom global analysis plug-in
You can create plug-ins to use custom logic in addition to the default global
analysis tasks that occur during the indexing process.
Restriction: Custom global analysis is available only for collections that use IBM
InfoSphere BigInsights. Jaql must be installed on the InfoSphere BigInsights server.
Procedure
The inputs for the script are the fields, facets, and text that are extracted from the
content during the document processing stage. Use the readGAInput(GAOptions)
function to get document fields, facets, and text content in JSON format. The
output from the script can be stored as document fields or facets in the Watson
Content Analytics index by using the writeGAOutput(GAOptions) function.
GAOptions is a JSON record that contains the necessary parameters. GAOptions can
be obtained by using the getGAOptions($MetaTrackerJaqlVars) function.
$MetaTrackerJaqlVars is always needed as an argument. To call these functions,
modules with the namespace ica::ga must be imported. The following example
shows a sample custom global analysis Jaql script:
import ica::ga(*);
options:=getGAOptions($MetaTrackerJaqlVars);
readGAInput(options)
-> someOperation()
-> anotherOperation()
-> writeGAOutput(options);
To store the output into the Watson Content Analytics index, pass an array of
JSON records to the first argument of the writeGAOutput() function. Each record
must include a field with the name uri. The specified values of the record are
stored in the index for the document whose URI matches the value of the uri field.
Any other field in the record besides the uri field is stored as an index field or
document-level facet for the document. In which index field or facet to store the
data is determined by the field name in the JSON record. For example, the
following array of JSON records adds values for the rank field and ranking facet in
the index for the documents with the URIs jdbc://ICA/APP.CLAIM/ID/0 and
jdbc://ICA/APP.CLAIM/ID/1.
[{"uri":"jdbc://ICA/APP.CLAIM/ID/0","rank":"1","$.ranking":"1"},
{"uri":"jdbc://ICA/APP.CLAIM/ID/1","rank":"2","$.ranking":"2"}
]
Requirement: To store data in fields and facets in the index, you must first create
the fields and facets in the administration console. If a field or facet does not exist,
the value is not added to the index.
For index fields, the value of the JSON record field is stored in a new index field.
For the name of the new index field, the prefix custom_ is added to the name of
the index field. In the previous example, if an index field with the name rank is
For facets, if the collection includes a facet with the same facet path as the JSON
record field name, the value of the JSON record field is stored to that facet. In the
previous example, if there is an existing facet with the facet path $.ranking, the
value 1 is stored in that facet. When you specify the facet path, ensure that the
facet path starts with $ and that each facet path component is concatenated by a
period. For example, the facet path $.ranking corresponds to the root facet with
the name ranking.
You can also specify in the Jaql script to save the output in a file or some other
format so that another application can use the data. For example, you can output
the data to a JSON file on the local computer:
readGAInput(options)
-> someOperation()
-> write(file(’/home/biadmin/ica_out.json’));
Related concepts:
Custom global analysis
Custom global analysis
Related reference:
“Sample plug-ins for custom global analysis” on page 129
Important: Before you can upload custom analyzers or associate analyzers with
fields in the administration console, you must enable the custom analyzer support.
Procedure
You can use the sample code as a guideline when you do the following application
development activities:
v Create enterprise search or content mining applications
v Run real-time text analytics on documents without adding them to the index
v Create administration applications
v Create plug-ins for crawlers, pre-filtering and post-filtering search results,
exporting documents, and exporting deep inspection results
The following sample scenarios are provided for the Search REST API:
v Search
v Faceted search against single index
For the administration REST APIs, several sample REST API Java programs are
provided. The programs illustrate two different methods of using the REST APIs:
Apache HttpClient or Java API for XML Web Services (JAX-WS). The sample
programs must be compiled and their class files are in the ES_INSTALL_ROOT/
samples/rest/admin/es.admin.rest.jar file.
The following REST API sample programs for HttpClient are provided in the
ES_INSTALL_ROOT/samples/rest/admin/com/ibm/es/admin/control/api/samples/
commons directory:
DocumentExample sample program
The DocumentExample class provides an example of how to add or
remove a document in a collection. The sample program allows you to add
a document to a collection and remove a document from the collection.
The type of function depends on the method that you input as a command
argument. This program requires that a collection was already created.
The program builds the HTTP request based on the specified host name,
port, method, collection ID, and other options. It uses the user name and
password to perform authentication. With the HTTP request, the program
obtains the HTTPMethod object that is executed by HttpClient. For the add
and addMultiDocs functions, the program uses the file path of an existing
file and some of the metadata of the document to add the file content as a
document in the collection.
The usage statement is as follows:
DocumentExample -hostname host_name -port port -method method
-username user_name -password password -collectionId collection_ID
additional_parameters
The following example shows how to add the documents to the col_12345
collection:
AddMultiDocsExample -hostname es.ibm.com -port 8390 -username user1
-password mypassword -collectionId col_12345
Field administration sample program
The FieldExample class provides an example of performing the available
methods on search fields within a collection. It allows you to add a search
field, list the fields, map a search field to a crawler, map the search field to
a facet, and remove a search field from the collection. The type of field
function depends on the method that you input as a command
argument. This program requires that a collection was already created.
It builds the HTTP Request URL based on the inputted host name, port,
method, and other options. It uses the inputted user name and password
to perform authentication. With the HTTP request, the program obtains the
HTTPMethod object that is executed by HttpClient.
The usage statement is as follows:
FieldExample -hostname host_name -port port -method method
-username user_name -password password additional_parameters
The following example shows how to obtain the state of the index in xml
format for the Test collection.
GetAccess "http://localhost:8390/api/v10/admin/indexer?
method=getState&output=xml&collectionId=Test" -username esadmin
-password password
IndexerExample sample program
The following example shows to start the index for the col_12345
collection.
IndexerExample -hostname es.ibm.com -port 8390 -username esadmin
-password password -method start -collectionId col_12345
Administering PEAR files sample program
The PearExample shows how to manipulate a custom annotator PEAR
file. The program allows you to add a PEAR file to the system, associate
and disassociate it with a collection, obtain a list of deployed PEAR files,
and remove a PEAR file from the system. The type of function depends on
the method that you input as a command argument.
The usage statement is as follows:
PearExample -hostname host_name -port port -method method
-username user_name -password password additional_parameters
The following example shows how to add the of_regex.pear PEAR file
and name it RegexPear in Watson Content Analytics.
PearExample -hostname es.ibm.com -port 8390 -method add
-username esadmin -password password -pearName RegexPear
-content "C:\\IBM\\es\\packages\\uima\\regex\\of_regex.pear"
The following REST API sample programs for JAX-WS are provided in the
ES_INSTALL_ROOT/samples/rest/admin/com/ibm/es/admin/control/api/samples/
jaxws directory:
Administering facets sample program
The FacetExample shows how to work with facets for a specified collection
by using the JAX-WS service to access the REST API. Some of the available
functions for the Facet API include adding and removing a facet from a
specified collection and obtaining a list of facets. The type of function
depends on the method that you input as a command argument.
The usage statement is as follows:
FacetExample -hostname host_name -port port -method method
-username user_name -password password additional_parameters
The following example shows how to obtain a list of facets for the
col_12345 collection.
FacetExample -hostname es.ibm.com -port 8390 -method getList
-username esadmin -password password -collectionId col_12345
GetAccess sample program
The GetAccess example shows how to use REST API by directly inputting
URLs You need to pass the URL with required parameters to this program.
It then invokes the JAX-WS service to invoke the REST API based on the
URL that you passed in as an argument and prints the result stream.
The usage statement is as follows:
The following example shows how to obtain the state of the indexer for the
Test collection.
GetAccess “http://localhost:8390/api/v10/admin/indexer?
method=getState&output=xml&collectionId=Test
&api_username=esadmin&api_password=password”
Administering PEAR files sample program
The PearExample shows how to manipulate a custom annotator PEAR file
by using the JAX-WS service to access the REST API. It allows you to add
a PEAR file to the system, associate and disassociate it with a collection,
obtain a list of deployed PEAR files, and remove a PEAR file from the
system. The type of function depends on the method that you input as a
command argument.
The usage statement is as follows:
PearExample -hostname host_name -port port -method method
-username user_name -password password additional_parameters
The following example shows how to add the of_regex.pear PEAR file
and name it RegexPear in Watson Content Analytics.
PearExample -hostname es.ibm.com -port 8390 -method add
-username esadmin -password password -pearName RegexPear
-content "C:\\IBM\\es\\packages\\uima\\regex\\of_regex.pear"
Related concepts:
“API overview” on page 3
“REST APIs” on page 7
Before you can build the sample REST API Java applications, you must install and
configure Apache ANT, a Java-based build tool. For information about how to
install and configure Apache ANT, see http://ant.apache.org/.
Procedure
For example:
java -cp es.admin.rest.jar
com.ibm.es.admin.control.api.samples.jaxws.GetAccess
For the samples in the ES_INSTALL_ROOT/samples/rest/admin/com/ibm/es/
admin/control/api/samples/commons directory
java -cp "%ES_INSTALL_ROOT%\lib\axis2\commons-fileupload-1.2.jar;
%ES_INSTALL_ROOT%\lib\axis2\commons-httpclient-3.1.jar;
%ES_INSTALL_ROOT%\lib\axis2\commons-logging-1.1.1.jar;
%ES_INSTALL_ROOT%\lib\axis2\commons-codec-1.3.jar;.\es.admin.rest.jar"
com.ibm.es.admin.control.api.samples.commons.application
arguments_to_run_the_application
For example:
AIX or Linux
java -cp "/opt/IBM/es/lib/axis2/commons-fileupload-1.2.jar:
/opt/IBM/es/lib/axis2/commons-httpclient-3.1.jar:
/opt/IBM/es/lib/axis2/commons-logging-1.1.1.jar:
/opt/IBM/es/lib/axis2/commons-codec-1.3.jar:./es.admin.rest.jar"
com.ibm.es.admin.control.api.samples.commons.GetAccess
Windows
java -cp
"C:\Program Files\IBM\es\lib\axis2\commons-fileupload-1.2.jar;
C:\Program Files\IBM\es\lib\axis2\commons-httpclient-3.1.jar;
C:\Program Files\IBM\es\lib\axis2\commons-logging-1.1.1.jar;
C:\Program Files\IBM\es\lib\axis2\commons-codec-1.3.jar;
.\es.admin.rest.jar"
com.ibm.es.admin.control.api.samples.commons.GetAccess
On Windows, you can also run the applications by using the resttest.bat
sample batch file in the ES_INSTALL_ROOT/samples/rest/admin directory. The
sample batch file contains the command to run the
com.ibm.es.admin.control.api.samples.jaxws.GetAccess sample application. You
can edit the batch file to run other sample applications. To run samples in the
ES_INSTALL_ROOT/samples/rest/admin/com/ibm/es/admin/control/api/samples/
commons directory, ensure that you specify the correct class path, as shown in
the previous command.
Before you can run the sample REST API Java applications, you must install and
configure Apache ANT, a Java-based build tool. For information about how to
install and configure Apache ANT, see http://ant.apache.org/.
Procedure
After you install Watson Content Analytics, the Javadoc documentation for SIAPI
enterprise search applications is available in the ES_INSTALL_ROOT/docs/api/siapi
directory
Before you can build Java applications for searching or exploring collections, you
must install and configure Apache ANT, a Java-based build tool. For information
about how to install and configure Apache ANT, see http://ant.apache.org/.
Procedure
The advanced sample application does the same tasks as the simple sample except
that it processes the returned results differently than the simple sample.
q.setSortKey(Query.SORT_KEY_NONE);
Run the query in a loop to obtain one page of results at a time. The maximum
result page size that is allowed is 100.
The fetch API provides the com.ibm.es.fetch package in the esapi.jar file and
the following interfaces:
v com.ibm.es.fetch.Document
v com.ibm.es.fetch.Fetcher
v com.ibm.es.fetch.FetchRequest
v com.ibm.es.fetch.FetchService
v com.ibm.es.fetch.FetchServiceFactory
You can use these classes the same way that you use other SIAPI classes.
Fetching a document
First, create the factory object. Using this factory class, create the FetchService
object and FetchRequest object. The Fetcher class can be created through the
You can set ACL constraints to the FetchRequest object. If its value is set, ACL
constraints will be delivered to the search server, and the search server will verify
the user's authority to access the document by checking the ACL constraints.
String aclConstraints = (String) parameters.get("SecurityContext");
aclConstraints = "@SecurityContext::’" + aclConstraints + "’";
FetchRequest fetchRequest = factory.createFetchRequest(uri, aclConstraints);
The ACL constraints value is a String value that must conform to the SIAPI format.
The sample content mining applications are in the following default directories:
v AIX or Linux: /opt/IBM/es/samples/siapi
v Windows: C:\Program Files\IBM\es\samples\siapi
The client program must have the siapi.jar and esapi.jar files in the
ES_INSTALL_ROOT/lib directory of the Watson Content Analytics server.
import com.ibm.es.security.plugin.NameValuePair;
import com.ibm.es.security.plugin.SecurityPostFilterIdentity;
import com.ibm.es.security.plugin.SecurityPostFilterPlugin;
import com.ibm.es.security.plugin.SecurityPostFilterPluginException;
import com.ibm.es.security.plugin.SecurityPostFilterResult;
import com.ibm.es.security.plugin.SecurityPostFilterUserContext;
/**
* The sample SecurityPostFilterPlugin class.
*/
public class SampleSecurityPostFilterPlugin implements SecurityPostFilterPlugin {
/**
* We should reuse the context for a bunch of results.
*/
private SecurityPostFilterUserContext context = null;
/**
* Default constructor.
* The <code>SecurityPostFilterPlugin</code> implementation is initialized
* using this constructor.
*/
public SampleSecurityPostFilterPlugin() {
// Initialize resources required for the entire this instance.
// For example, logging.
}
/* (non-Javadoc)
* @see com.ibm.es.security.plugin.SecurityPostFilterPlugin#init
* (com.ibm.es.security.plugin.SecurityPostFilterUserContext)
*/
public void init(SecurityPostFilterUserContext context)
throws SecurityPostFilterPluginException {
// Initialize resources for the bunch of results.
// i.e. for results from a query.
this.context = context;
}
/* (non-Javadoc)
* @see com.ibm.es.security.plugin.SecurityPostFilterPlugin#term()
*/
public void term() throws SecurityPostFilterPluginException {
// finalize plugin here after verifying access to documents
// i.e deallocate system resources, close remote
// datasource connections...
}
/* (non-Javadoc)
* @see com.ibm.es.security.plugin.SecurityPostFilterPlugin#verifyUserAccess
* (com.ibm.es.security.plugin.SecurityPostFilterResult)
*/
public boolean verifyUserAccess(SecurityPostFilterResult result)
throws SecurityPostFilterPluginException {
if(false) {
return false; // If you don’t want to return this result to user,
SecurityPostFilterIdentity id = null;
// EXAMPLE :
// verify access to documents from a document source "MyDocs"
// only users in group "OmniFind" are allowed to see documents
// from "MyDocs".
if ("OmniFindDocs".equals(source)) {
// obtain a list of user groups from the identity
String[] groups = null;
if (id != null) {
groups = id.getGroups();
}
// EXAMPLE :
// always allow access to documents from other sources
// (winfs, notes, quickplace...).
return true;
}
}
Related tasks:
“Creating and deploying a plug-in for post-filtering search results” on page 83
Tip: Ensure that you include a comma (,) before each entry to conform to
JSON syntax.
2. Restart the search and analytics applications.
Prerequisite: Before you can build the samples, you must install and configure
Apache ANT, a Java based build tool. For information about how to install and
configure Apache ANT, see http://ant.apache.org/.
When documents are parsed, this custom analyzer detects the occurrence of person
names that have nicknames. Then the analyzer inserts the nicknames in place of
the original names when the text is extracted to the specified field so that users can
search for documents by entering the nickname. For example, if the value of a field
is "William Smith", the document will be returned if a user enters the search terms
"Will Smith" or "Bill Smith". If you created a document ranking filter for this
analyzer and added it to the Top document ranking filter group, documents that
contain the nicknames in the specified field will be ranked higher in the results.
Important: Before you can upload custom analyzers or associate analyzers with
fields in the administration console, you must enable the custom analyzer support.
IBM may not offer the products, services, or features discussed in this document in
other countries. Consult your local IBM representative for information on the
products and services currently available in your area. Any reference to an IBM
product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product,
program, or service that does not infringe any IBM intellectual property right may
be used instead. However, it is the user's responsibility to evaluate and verify the
operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter
described in this document. The furnishing of this document does not grant you
any license to these patents. You can send license inquiries, in writing, to:
For license inquiries regarding double-byte (DBCS) information, contact the IBM
Intellectual Property Department in your country or send inquiries, in writing, to:
The following paragraph does not apply to the United Kingdom or any other
country where such provisions are inconsistent with local law:
INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS
PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER
EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS
FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or
implied warranties in certain transactions, therefore, this statement may not apply
to you.
Any references in this information to non-IBM Web sites are provided for
convenience only and do not in any manner serve as an endorsement of those Web
IBM may use or distribute any of the information you supply in any way it
believes appropriate without incurring any obligation to you.
Licensees of this program who wish to have information about it for the purpose
of enabling: (i) the exchange of information between independently created
programs and other programs (including this one) and (ii) the mutual use of the
information which has been exchanged, should contact:
IBM Corporation
J46A/G4
555 Bailey Avenue
San Jose, CA 95141-1003
U.S.A.
The licensed program described in this document and all licensed material
available for it are provided by IBM under terms of the IBM Customer Agreement,
IBM International Program License Agreement or any equivalent agreement
between us.
All statements regarding IBM's future direction or intent are subject to change or
withdrawal without notice, and represent goals and objectives only.
This information contains examples of data and reports used in daily business
operations. To illustrate them as completely as possible, the examples include the
names of individuals, companies, brands, and products. All of these names are
fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
Each copy or any portion of these sample programs or any derivative work, must
include a copyright notice as follows: © (your company name) (year). Portions of
this code are derived from IBM Corp. Sample Programs. © Copyright IBM Corp.
2004, 2010. All rights reserved.
If you are viewing this information softcopy, the photographs and color
illustrations may not appear.
Additional notices
Portions of this product are:
v Oracle® Outside In Content Access, Copyright © 1992, 2014, Oracle.
v IBM XSLT Processor Licensed Materials - Property of IBM © Copyright IBM
Corp., 1999-2014.
This product uses the FIPS 140-2 approved cryptographic provider(s); IBMJCEFIPS
(certificate 376) and/or IBMJSSEFIPS (certificate 409) and/or IBM Crypto for C
(ICC (certificate 384) for cryptography. The certificates are listed on the NIST web
site at http://csrc.nist.gov/cryptval/140-1/1401val2004.htm.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of
International Business Machines Corp., registered in many jurisdictions worldwide.
Other product and service names might be trademarks of IBM or other companies.
A current list of IBM trademarks is available on the Web at "Copyright and
trademark information" at http://www.ibm.com/legal/copytrade.shtml
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered
trademarks or trademarks of Adobe Systems Incorporated in the United States,
and/or other countries.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of
Microsoft Corporation in the United States, other countries, or both.
Java and all Java-based trademarks and logos are trademarks or registered
trademarks of Oracle and/or its affiliates.
UNIX is a registered trademark of The Open Group in the United States and other
countries.
Other company, product, and service names may be trademarks or service marks
of others.
Notices 135
Privacy policy considerations
IBM Software products, including software as a service solutions, (“Software
Offerings”) may use cookies or other technologies to collect product usage
information, to help improve the end user experience, to tailor interactions with
the end user or for other purposes. In many cases no personally identifiable
information is collected by the Software Offerings. Some of our Software Offerings
can help enable you to collect personally identifiable information. If this Software
Offering uses cookies to collect personally identifiable information, specific
information about this offering’s use of cookies is set forth below.
This Software Offering does not use cookies or other technologies to collect
personally identifiable information.
If the configurations deployed for this Software Offering provide you as customer
the ability to collect personally identifiable information from end users via cookies
and other technologies, you should seek your own legal advice about any laws
applicable to such data collection, including any requirements for notice and
consent.
For more information about the use of various technologies, including cookies, for
these purposes, See IBM’s Privacy Policy at http://www.ibm.com/privacy and
IBM’s Online Privacy Statement at http://www.ibm.com/privacy/details the
section entitled “Cookies, Web Beacons and Other Technologies” and the “IBM
Software Products and Software-as-a-Service Privacy Statement” at
http://www.ibm.com/software/info/product-privacy.
T
text miner
sample custom widgets plug-in 127
TimeScaleViewSearchExample 111
TimeSeriesViewExample class 116
trademarks 135
TrendsViewExample class 116
tuning queries
custom analyzer deployment
document ranking filters 97
custom analyzer sample
document ranking filters 131
U
UIMA annotations 36
W
web crawler plug-in (postparse)
creating 79
sample plug-in 79
web crawler plug-in (prefetch)
creating 76
deploying 79
sample plug-in 76
X
XML elements
retrieval 36
semantic search 36
Index 139
140 IBM Watson Content Analytics: Programming Guide
SC27-6331-00