mass communicating

Static Site Search

Basics

The idea is to download the RSS/Atom feed of the page and page and then to search in there:

http://joevennix.com/2011/05/25/How-I-Implement-Static-Site-Search.html

First, I changed that code to limit the search term size, and use en/decodeURIComponent to access the hash data:

var query = decodeURIComponent($.param.fragment()).replace('search=', '');
// limit query length
query = query.replace(/[^A-Za-z0-9\,\:\;\.\_\-\+\*\s]/g, "");
query = query.substr(0, 256);
$(o.elem_id).val(query);

Search Score

The original code also uses a regular expression for matching. That is nice, but means that your users have to know regular expressions, and also that the results don’t come out in a particular order.

Here is how I do it:

  1. The search query is split into search terms that are separated by whitespace.
  2. When matching a string (title, summary, link of post) to the search terms, we compute for each term it’s number of occurences within .

We compute the following score which normalises the number of occurrences over the string length:

The score for all patterns is summed:

Finally, we add the scores for matching the title, link and post/page summary:

We could play with the weights a bit here to improve the search.

// ... 
find: function (q) {
    $(o.results_id).show();

    var matches = [];

    // split the query string into words
    var qs = q.toLowerCase().split(/\s+/);

    // this is the match score function
    var rq = function(s) {
      nmatches = 0.0;
      s = s.toLowerCase()

      for (var i = 0; i < qs.length; i++) {
        if(qs[i].length == 0) {
          continue;
        }
        var xs = s;
        delta = 1.0*qs[i].length/Math.min( q.length, xs.length );
        while (xs.length > 0) {
          var kk = xs.indexOf(qs[i]);
          if(kk < 0) {
            break;
          }

          if( (kk == 0 || xs[kk-1].match(/[^A-Za-z0-9]/)) 
            && (kk+qs[i].length == xs.length || xs[kk+qs[i].length].match(/[^A-Za-z0-9]/)) ) {
            nmatches += delta; // full word matches count more!
          } else {
            nmatches += 0.1*delta;
          }

          xs = xs.substr(kk+qs[i].length);
        }
      }
      return nmatches;
    }

    for (var i = 0; i < entries.length; i++) {
        var entry = entries[i];
        var title = $(entry.getElementsByTagName('title')).text();
        var link = $(entry.getElementsByTagName('link')).attr('href');
        var summary = "";

        if ($(entry.getElementsByTagName('summary')).length > 0) {
          summary = $(entry.getElementsByTagName('summary')).text();
        } else {
          summary = $(entry.getElementsByTagName('content')).text();
        }
        
        var category_score = _.reduce($(entry.getElementsByTagName('category')), 
            function(memo, cat){ 
              return memo + rq( $(cat).attr('label') + " " + $(cat).attr('term') );
             }, 0.0);

        // here's another way to improve search sensitivity: 
        // a) change the weights
        var matchscore = rq(title) + rq(link) + rq(summary) + category_score;

        if ( matchscore > 0.1 ) {
            var updated = prettyDate(xmlDateToJavascriptDate($(entry.getElementsByTagName('updated')).text()));
            matches.push({'title':title, 'link':link, 'date':updated, 'summary': summary, 'score' : matchscore });
        }
    }

    // sort by match score
    matches.sort(function(a,b){return b.score - a.score;});

	// ... 

Here is a gist with the full file.

One issue remains: this probably doesn’t scale so well if the page gets larger. A better way to to it would probably be to generate an index of terms and their scores during the Jekyll run and store+retrieve this as JSON. I might do that once I have written enough things on this site to actually run into this problem.

Category: code Tags: javascript jquery html rss atom search