The
Significance Showcase
Elasticsearch Plugin

Hannes Korte
email@hkorte.com
Elasticsearch Usergroup Berlin Meetup 2014-04-29

Motivation

Classical count-ordered term facets generate stop words on full-text fields:

			"terms": [
            {
               "term": "of",
               "count": 14142
            },
            {
               "term": "the",
               "count": 13806
            },
            {
               "term": "in",
               "count": 13625
            },
            {
               "term": "and",
               "count": 13600
            },
			...
		

Motivation (2)

We want dynamic word-clouds consisting of significant terms correlated to the current search result!

Different flavours of significance

  • ES Heuristic used in SignificantTermsAggregation
  • Mutual Information
  • $\mathcal{X}^2$ Chi-squared test
  • Kullback-Leibler Divergence
  • ...

Code Example: Chi-squared

  /**
   * @param n00 docs which do not contain word with negative class
   * @param n01 docs which do not contain word with positive class
   * @param n10 docs which contain word with negative class
   * @param n11 docs which contain word with positive class
   * @return The Chi-squared test result of the given distribution
   */
   @Override
   public double compute(long n00, long n01, long n10, long n11) {
      // add +1 for smoothing and to avoid division by zero
      n00++; n01++; n10++; n11++;
      return ((n11+n10+n01+n00) * Math.pow(n11*n00 - n10*n01, 2)) /
		     ((n11+n01) * (n11+n10) * (n10+n00) * (n01+n00));
   }

Installation

  • As this plugin is not a pure _site plugin, we have cannot simply use the github installation shortcut.
  • To do it right, we would have to deploy a release to maven central.
  • We did it the direct way: gh-pages
bin/plugin --url http://hkorte.github.io/significance-showcase/stable.zip \
           --install significance-showcase

Usage: Request

POST /wikipedia/page/_significance
{
    "query": {
        "match": {
		    "text": "geek"
        }
    },
    "field": "text",
    "size": 10
}

Usage: Response

{
   "default": [
      {
         "term": "geek",
         "score": 722,
         "n00": 18050,
         "n01": 0,
         "n10": 0,
         "n11": 25,
         "subsetProbability": 1,
         "supersetProbability": 0.0013831258644536654,
         "absoluteProbabilityChange": 0.9986168741355463,
         "relativeProbabilityChange": 723
      },
      {
         "term": "nerds",
         "score": 65.52727272727275,
         "n00": 18044,
         "n01": 20,
         "n10": 6,
         "n11": 5,
         "subsetProbability": 0.2,
         "supersetProbability": 0.0006085753803596127,
         "absoluteProbabilityChange": 0.1993914246196404,
         "relativeProbabilityChange": 328.6363636363637
      },
      {
         "term": "cosplay",
         "score": 35.433846153846154,
         "n00": 18041,
         "n01": 21,
         "n10": 9,
         "n11": 4,
         "subsetProbability": 0.16,
         "supersetProbability": 0.0007192254495159059,
         "absoluteProbabilityChange": 0.1592807745504841,
         "relativeProbabilityChange": 222.46153846153848
      }
   ],
   "mi": [
      {
         "term": "geek",
         "score": 0.014968935346494364,
         "n00": 18050,
         "n01": 0,
         "n10": 0,
         "n11": 25
      },
      {
         "term": "nerds",
         "score": 0.0024575965051602095,
         "n00": 18044,
         "n01": 20,
         "n10": 6,
         "n11": 5
      },
      {
         "term": "video",
         "score": 0.0020564057622244735,
         "n00": 16312,
         "n01": 8,
         "n10": 1738,
         "n11": 17
      }
   ],
   ...
}

Usage: _site

There is also a site plugin available at
http://localhost:9200/_plugin/significance-showcase/
Thanks!
Hannes Korte
email@hkorte.com
hkorte