Have you ever built a search using a SQL LIKE statement, only to have your users complain about functionality? A simple SQL-based search doesn’t handle synonyms, misspellings, prefixes, suffixes, result rankings, weighting, and so on and so forth. Fret no longer, you can spend a little more time and build a “smart” search using Lucene and get all of these features as well as the ability to tweak the search as much as you like.
Lucene.NET is a direct port of the popular open source Java Lucene project. Large companies such as EMC and Cisco have placed bets on Lucene and embedded the library within some of their products. The .NET version is a little bit behind the Java version in terms of features and releases, but by and large the library is very usable. Lucene can be used to index just about any type of content – including files , database records, web pages, and can be used in any number of architectural scenarios – searching in an ASP.NET web site, searching within a desktop app, search as a web service or Windows service, etc.
In the most simple search scenario – architecturally, you have to build an Indexer and a Searcher. You can think of Lucene as a set of tools that will do most of the work for you in building these components – you have to use Lucene to build an index and dump your searchable content into that index, and you have to tell Lucene how to search the index that you’ve built. Conceptually, the index is built out of the content that you want to search, whether it be files or database records. If you change the content you want to search on (for example, you’ve added a new file), then you have to either append that content to your index or rebuild your index. One strategy is to set up a scheduled process (i.e. using Quartz.NET, a windows service, or scheduled task) to periodically re-index your content.
Adding Lucene to your project
First things first, you have to add the Lucene libraries to your project. On the Lucene.NET web site, you’ll see the most recent release builds of Lucene. These are two years old. Do not grab them, they have some bugs. There has not been an official release of Lucene for some time, probably due to resource constraints of the maintainers. Use Subversion (or TortoiseSVN) to browse around and grab the most recently updated Lucene.NET code from the Apache SVN Repository. The solution and projects are Visual Studio 2005 and .NET 2.0, but I upgraded the projects to Visual Studio 2008 without any issues. I was able to build the solution without any errors. Go to the bin directory, grab the Lucene.Net dll and add it to your project.
Building the Index
Step two is building your searchable index. A Lucene index is usually stored as a set of files on the file system, but can also be stored in memory for performance – and there are even proof of concept projects available that allow you to store the index in a database (though I’m not sure why you would).
A couple of Lucene concepts/classes you should be aware of for indexing include Documents, Fields, Analyzers, and the IndexWriter. Documents are what you put into your index. They’re not “documents” in the traditional sense, like a Word document – rather, a Document is just an abstraction of an indexable piece of content. It is your responsibility to create the Document objects to place into your Index.
For example, let’s say we’re creating a product search, using Product objects pulled from our database. Our searches will be based on the Product Name.
public class Product
{
public Product() { }
public string ProductName { get; set; }
public decimal Price { get; set; }
public string Color { get; set; }
public string Id { get; set; }
//return a Lucene document for the product
public Document GetDocument()
{
Document document = new Document();
document.Add(new Field("ProductName", this.ProductName, Field.Store.NO, Field.Index.ANALYZED));
document.Add(new Field("Id", this.Id.ToString(), Field.Store.YES, Field.Index.NO));
return document;
}
}
We’ll add fields to our Document to represent the values we want to search on or store in our Index. Field.Store.YES/NO indicates whether or not we want to actually store the field in our index. Note how I don’t store the Price or Color columns – we don’t want to store the complete objects in Lucene – it’s just our search index. Keep the complete objects stored in your database (or keep your files on the file system, etc.). We do want to store the Id because when we get our search result documents back from querying the index only the stored fields will be returned . We need to at least know the Product Id so we can go fetch our full objects that match our search results from the database. There is also a COMPRESS option that you can use if you need to store large fields or binary data.
Field.Index.ANALYZED/NO indicates whether or not we want to actually index the field. Indexing a field takes some minimal level of processing power, so we don’t want to index every field – only index what you want to search on. Thus we don’t want to Index the Product Id, Color, or Price – only the Name because that’s all we want to search on.
Next, we’ll create the index and add the documents to it. Below is an example of a very simple class with a single method that we can use to build our Product search index using a given list of products.
public class Index
{
public void BuildIndex(List<Product> products)
{
FSDirectory directory = FSDirectory.Open(new System.IO.DirectoryInfo("C:\\temp\\"));
Analyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);
IndexWriter indexWriter = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
foreach (Product product in products)
{
indexWriter.AddDocument(product.GetDocument());
}
indexWriter.Optimize();
indexWriter.Close();
}
}
The FSDirectory is just an abstraction of the storage of the index, and there are “directory” classes that represent in-memory storage, etc. that you can use as well. You can pass a DirectoryInfo object to the Open method to specify where to store the search index.
The Analyzer’s job is to parse, tokenize, and index your data. There are a number of different Analyzers implemented in Lucene, but the StandardAnalyzer is the most straightforward. The StandardAnalyzer will do a few things to your text – including removing junk search terms (aka “stop words”) and punctuation, and normalizing the case of your text. There are a number of constructors available for the StandardAnalyzer, and you can specify your own stop words if you like, but there is a list of common stop words built into Lucene. There is another good analyzer available called the SnowballAnalyzer, which will remove suffixes and prefixes from your text, which can greatly improve your search results. The SnowballAnalyzer is a separate Lucene project that is outside of the main source code, it can be found under the contrib folder in the Lucene source (not in the main Lucene.Net solution) – build it yourself and include it in your project if you would prefer to use it instead of the StandardAnalyzer.
The IndexWriter is responsible for creating the index. The IndexWriter is actually thread safe, and an index can be rebuilt while being read from at the same time without you having to manage the locking of the index files. Lucene takes care of that for you. There is a boolean parameter on the constructor that indicates whether or not to recreate or append to the index. Simply call the AddDocument method on the IndexWriter to write documents to the index. When you’re finished writing documents to the index, you must call the Close method. Optionally, you can call the Optimize method before closing the index which will greatly shrink the size of the index – however, this can take a few seconds sometimes so you may not want to call Optimize if you have indexing performance concerns.
Now that we have the Index built, we can move on to actually searching the index…
Searching the Index
Below is an example method that you could use to search your newly created product search index, you could potentially add it into your Index class. You’ll see a few of the same classes from the indexing sample being used in the search method. As in the previous example, you’ll use the FSDirectory class to specify where the index is located. Then, you’ll need to create an IndexReader, passing in your directory object. The second parameter of the IndexReader specifies whether or not to open the index in read-only mode – for our simple purposes, we only need to read from the index. One thing to note about the IndexReader is that it is fairly expensive to create, so you don’t want to create one every time you’re doing a search in your web application for example. Create a single IndexReader – perhaps in a singleton pattern or by caching the IndexReader object, and re-use that IndexReader. Next, we need an IndexSearcher to actually search our index, fairly straightforward.
When searching, the search queries must be parsed and tokenized in the same way that the data was parsed when it was placed into the index. Due to this, one very important thing to note is that when searching, the same type of Analyzer that was used to create the index must also be used to parse the search queries. If a StandardAnalyzer is used to create the index, a StandardAnalyzer must also be used to parse search queries against the index. The QueryParser actually parses the query text against the field that is going to be searched against – as you can see in the QueryParser constructor, we’ll be searching against the “ProductName” field from our documents. After that, simply call the Parse method on the QueryParser to get the Query that we’ll pass to the searcher. To note, if you want to search on multiple fields – say we wanted to search on the Product Name and the Color, you can use the MultiFieldQueryParser class to query against multiple fields. With the MultiFieldQueryParser, you can even do some clever things like weighting fields differently, i.e. if I wanted product name matches to rank higher than color matches.
Next, we’ll create a collector that will define how the search results are collected from the searcher – we’ll use a TopScoreDocCollector. The first parameter is the maximum number of results, and the second parameter determines whether or not the results are sorted in order of search relevancy. For our purposes, we want to show the customers the best results for their search query so we’ll obviously want our results sorted in order. From there, simply call the Search method on the searcher, passing in the query and the document collector and receive a collection of scored matches based on the search query. For each match, you can call the .Doc method on the searcher to retrieve the actual full Document that was placed in the Index originally. After I’ve collected up the Product IDs from the search result documents, I go back and fetch the full Product objects from the database. Depending on what fields you choose to store in your Lucene index, you may not need to re-fetch what you’re searching for from the database. It’s a good idea to store only enough data to display the search results, that way you don’t need to make a trip to the database just to display your search results.
public List<Product> SearchProductName(string productName)
{
FSDirectory directory = FSDirectory.Open(new System.IO.DirectoryInfo("C:\\temp\\"));
IndexReader reader = IndexReader.Open(directory, true);
Searcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);
QueryParser parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "ProductName", analyzer);
Query query = parser.Parse(productName);
TopScoreDocCollector collector = TopScoreDocCollector.create(100, true);
searcher.Search(query, collector);
ScoreDoc[] hits = collector.TopDocs().scoreDocs;
List<int> productIds = new List<int>();
foreach (ScoreDoc scoreDoc in hits)
{
//Get the document that represents the search result.
Document document = searcher.Doc(scoreDoc.doc);
int productId = int.Parse(document.Get("Id"));
//The same document can be returned multiple times within the search results.
if (!productIds.Contains(productId))
{
productIds.Add(productId);
}
}
//Now that we have the product Ids representing our search results, retrieve the products from the database.
List<Product> products = ProductDAO.GetProductsByIds(productIds);
reader.Close();
searcher.Close();
analyzer.Close();
return products;
}
Again, keep in mind this is only an example method. The examples above are based around searching rows that live in a database, but they could be easily adapted to searching through a directory of files, or searching through indexed web pages. The Lucene class structure, to me seems highly abstracted – this is to allow for ultimate flexibility. Search is a finicky thing and you’ll always run into scenarios where your client doesn’t like the way the search works – that’s fine, because Lucene gives you the flexibility to change how the search works.