In this article, we’ll talk about how to integrate the Lucene text searching solution using Spring Data and GemFire to provide a flexible, parallel fast search engine. By combining the two independent products we can leverage each product to its fullest capability. The end result provides an elastic search capability with the in memory data speeds of a distributed cache platform and high availability.
Motivation—Why Combine These?
The motivation of the project was to provide an alternative search capability for GemFire while providing users a natural method to define searchable domain object attributes. Performance was also a key driver to ensure constant search performance irrespective of scale. The solution outlined below provides a baseline approach for developers to build upon.
Goals and Requirements
Read the full story in the Lucene-GemFire Index Integration Guide.
- Advanced search capability using established search technology
- Deterministic search performance irrespective of scale
- Simple API to allow developers to define searchable attributes on domain classes
- Abstract complexity to clean simple interfaces
- Fault tolerant search process
Various business scenarios can take advantage of this integration to improve search use cases either by retrieval time, criteria flexibility, high availability searching, and more.
Example Business Cases
- Fast online trading symbol or company lookups to narrow search results to smaller set of companies, similar to Google Finance, for user to select desired symbol.
- Provide the ability to search through large document stores to retrieve content that is relevant to the user. For example, provide searchable legal document stores to retrieve similar case notes using document abstracts as the searchable material.
- Provide location based business search services to find local businesses relative to own location. For example find local dry clearers for traveling sales person.
Architecture: Lucene + GemFire
The architecture builds Lucene search capability within GemFire cache data nodes. GemFire is a distributed caching and data management platform that is considered an in-memory data-grid. By embedding the Lucene search engine within each GemFire cache node JVM, the result is a distributed powerful searchable store with all the functional benefits of GemFire.
Each GemFire cluster node manages its own independent set of Lucene region indexes within the same JVM process. This provides search isolation and elasticity across the GemFire cluster.
The figure 1 presents a high level view of a single GemFire cache node. Each cache node has one or many data regions holding key/value objects. Each region will have its own Lucene index repository. The index can be set to be highly available, using a configuration described later in this article.
- All user data is stored in GemFire
- Each region has its own index repository.
- Spring Index repositories contain Lucene search documents
- Index repositories and data reside within same JVM process
- Index repositories can be configured to be highly available
- User domain objects are decorated with @Searchable annotation
- Client interfaces provided for searching and batch operations
At the GemFire data region level, a Lucene index is applied using a Spring configuration. The configuration creates a dedicated index repository for the specified region. Along with this setting a cache listener,
LuceneIndexWriter, is configured to capture all region events to manage the index repository.
For example, the
CacheListener (see GemFire developer guide for details on cache event processing) receives an event on a region update and updates the index repository synchronously. This process is suitable if data updates are infrequent. Figure 2 represents this structure where the red intersection is the
LuceneIndexWriter cache listener. The index itself is not held in a region but rather in an instance of a Spring-Data specialized repository.
For high volume data loads, a GemFire function is used to batch updates into manageable Lucene index transactions and reduce update latency. This is the preferred method when the GemFire cluster is being primed or for intra-day batch update processing. This technique can be applied for high frequency updates via asynchronous processes too.
An important point to mention is index operations are performed synchronously and occur after the cache insert. This can be changed by switching to a
CacheWriter where the index update will occur before the cache update, not implemented. This implementation strategy is left to the user to decide at design time.
Each region index can be set to high availablity. In the event of a failure, an index is rebuilt within the redundant cache node, providing a region has been set to have a redundancy (see GemFire developer guide). For partitioned regions, a GemFire
PartitionListener implementation is used to determine if a node has become primary and would then build the index based on its own local data set.
- A signification portion of this article was sourced from the Lucene-GemFire Index System Integration paper written by Lyndon Adams.
- Learn about GemFire from over 44 articles in this list.
- Visit the SpringSource blog for more about GemFire.
|About the Author: Lyndon Adams is a Data Strategist from the VMware Field Center of Excellence. He specializes in fast data and big data systems. Lyndon over 15+ years of IT experience, mostly from the investment banking world. During his period in Investment Banking he was involved in margin, risk, trading, tick and reporting systems.|