Autocomplete is one feature that has become an integral part of every application. Building a fast, scalable autocomplete is pretty big project of its own especially if you have a ever growing data set. One of the best autocomplete I have seen is Quora’s search autocomplete. I know google and youtube have the best ones too but I just want to focus on a smaller companies rather than giant ones. Here is a post on Quora’s autocomplete explaining how they built their’s in C++ designed for speed.

In this post I we will use Solr search engine to power a decent auto-complete. Solr works like a document database where each record is a document. In Solr you define a schema with all possible fields. When you define fields you specify the type each field is. For example:

<field name="username" type="text" indexed="true" stored="true" multiValued="false" />
<field name="category" type="text" indexed="true" stored="true" multiValued="false" />
<field name="create_date" type="date" indexed="true" stored="true" multiValued="false" />
<field name="pins" type="slong" indexed="true" stored="true" multiValued="false" />

As you can see that we have defined username & category as text where as create_date is defined as date and pins as long. This enables solr to perform some advanced queries based on the type of field. For autocomplete we need a text field but with the ability to search substring of text. But Solr does not do a good job at matching string unless you use a specific parser designed for it. Two such parsers are NGram and EdgeNGram.

NGramTokenizerFactory is a parser which breaks the strings into small grams of specified lengths of substring. This tokenizer take 2 inputs.

<fieldType name="text" class="solr.TextField" >
<analyzer type="index"> 
   <tokenizer class="solr.KeywordTokenizerFactory"/> 
   <filter class="solr.LowerCaseFilterFactory"/> 
   <filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="15" /> 
 </analyzer> 
 <analyzer type="query"> 
   <tokenizer class="solr.KeywordTokenizerFactory"/> 
   <filter class="solr.LowerCaseFilterFactory"/> 
 </analyzer> 
</fieldType>

Using this setting in your solr will make solr break up text into substring and perform matches. This can yield undesirable results and make solr do un-necessary substring calculations. Ex: “Steve” is broken into “st”, “te”, “ev”, “ve”, “ste”, “tev” etc.

EdgeNGramFilterFactory is a parser which makes substring only from edges. That is either from front or back of the word. For example: “Steve” will be parsed into “ste”, “eve”, “stev”, etc

<fieldType name="prefix_token" class="solr.TextField" positionIncrementGap="1">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
  </analyzer>
</fieldType>

The above setting define edgegram to generate gram size from 2 to 15 and only consider making grams from front(side). This works pretty well for the autocomplete purposes.

Since we defined a new field type “prefix_token” we can set a field in solr to be of the type prefix_token.

<field name="fullname" type="prefix_token" indexed="true" stored="true" multiValued="false" />

Now querying solr on fullname will yield good autocomplete results. Suppose you only need users and ignore other document type you can add another condition selecting the documents of type “user”. You can also use boost to push the users with high social media score to the top of the results.

?q={!boost b=log(social_score)}(fullname:Steve)

Note: Solr supports TermsComponent another component which returns an array of spell matches to a query string. But we needed a autocomplete where upon selecting a result redirects to that particular document’s url. Also we wanted to show user’s pic into autocomplete which means we need the complete documents rather than spelling suggestions.