[tutorial]clustering search result with plugin:tools-carrot2

this tutorial is write for this plugin,https://github.com/medcl/-

first of all,we are talking about text clustering,and what’s text clustering,the wikipedia can tell you:

“Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.”

and what’s the component behind,the answer is carrot2(http://project.carrot2.org/),

Carrot2 is an Open Source Search Results Clustering Engine. It can automatically organize small collections of documents (search results but not only) into thematic categories.

and this plugin is simpliy sent the search result of elasticsearch to carrot2,and carrot2 will do text clustering for you,and then return the clustering result to you,so by using this plugin,we can clustering our search results into topics,and i will show you how to achieve that.

let’s see the final result,

as you can see,the clustering result looks like some sort of classification,they are not directly come from any of there search results, and they can present a high-level overview of the search result,including the size of the classifications,that’s amzing,that‘s also the magic of machine learning.

to make carrot2 works,let’s see how to install this plugin:
i am sorry,directly install by using plugin.sh is not supported right now(kimchy,will you add my plugin to you download repo).
ok, let’s manually copy the related files to elasticsearch’s plugin’s folder.

create a folder such as “tools.carrot2”,(any name as you wish,it doesn’t matter)
finally,you will have a folder,looks like this:

and then careate a folder “carrot2” within config,this folder name must be “carrot2”.
you can download these files directly from here:https://github.com/medcl/elasticsearch-rtf/tree/master/elasticsearch/config/carrot2

ok,you can restart you elasticsearch now,

and ,right now,
let’s see what data do we have,really simple,uha


after indexing finished,
let’s see how to write the query dsl,

in my demo,i used ajax to send the clustering request to elasticsearch,because the clustering is CPU-bounded,and with a async manner,speed of the page load will be a little faster.

and this is a simple query

the query above,will send 500 search result to elasticsearch,this number influence the clustering result,and is not a good idea to clustering a large result,because large result will use lare memory,you know, it will be a bad idea to use too consume memory in a single query,and also cpu will be crazy,final words,more data,clustering result will be more better,but will be more slower,you should make a trade off.

the clustering result will looks like this:

now,you can consume the result freely,
the whole request and response is package by fiddler(www.fiddler2.com)
you can download from: