Machine Learning & Big Data Blog

ElasticSearch Search Syntax and Boolean and Aggregation Searches

Curl elasticsearch commands.
3 minute read
Walker Rowe

Here we explain how to do searches in ElasticSearch (ES). ES has a seemingly endless list of search options, which can seem overwhelming. So we will start with some simple examples and build from there.

We look at these items:

Basic Search Syntax 2
Boolean Searches 2
Aggregation 4

Prerequisites
If you have not installed ElasticSearch, you can follow these instructions to do that. And if you could also review this cheat sheet which provides some introductory.

Not that we are not using Kibana in these examples. Kibana is an optional graphical front-end for ElasticSearch. How to use KIbana is a topic unto itself.

Data
We need some data. So:

  1. Download this program to create student data. Change the URL in the program to point to your instance of ElasticSearch: http://(your IP):9200/universities/universities/.
  2. Then download this list of universities. Select the csv option and unzip the downloaded file. Copy the file InstitutionCampus.csv. Then download this program to load that data.

Run both programs. Both will run for several minutes since we want to create a large amount of data.

These programs create two different types of data, students and universities, in the same index, universities. We do this because with join type nested queries the data must be in the same index.

(This article is part of our ElasticSearch Guide. Use the right-hand menu to navigate.)

Basic Search Syntax

In order to get started, you need to understand something about the ES search syntax, aka Lucene Query. Here is a list of options:

search type example
free text, meaning no field specified “tober” matches “October” found anywhere in the document.

Usually this type of parameter-less query is written into the Kibana screen (i.e., the graphical front-end to ElasticSearch) or as a curl parameter, as in:

curl -XGET –header
‘Cosx:9200/universities/_search?q=DON

name of field “school”:”Harvard”
range “[1 to 10]” for numeric fields
boolean: and, or “tall or small”. You can also leave off the word and write: “(tall small)”
scope multiple fields “school.\*”:”(North South)” would query school.name and school.location.
test for value “exists” : { “field” : “grades” } tests whether there is a field called grades.
wild cards: ? for one character and * for one or more “Sou?h” matches “South”. “So*h” matches “South” as well.

Boolean Searches

With all the available options, there are multiple ways to achieve the same results, like finding schools in North Carolina.

This search finds all universities with both North and Carolina in the ParentName field.

curl -XGET --header 'Content-Type: application/json' http://parisx:9200/universities/_search/?pretty=true -d '{
"query" : { 
"bool": {
"must": [
{ "match" : { "ParentName": "North" }},
{ "match" : { "ParentName": "Carolina" }}
]
}
}
}'

results in:

      {
"_index" : "universities",
"_type" : "universities",
"_id" : "fc7549cbac32fc5770c9cb7bdc72d30de9fdc5a2",
"_score" : 10.899638,
"_source" : {
"ParentName" : "North Carolina Wesleyan College",
"DapipId" : "133942003",
"OpeId" : "",
"LocationType" : "Additional Location",
"AdminPhone" : "",
"Fax" : "",
"LocationName" : "Raleigh",
"Address" : "2000 Perimeter Park Dr, Morrisville, NC 27560",
"AdminName" : "",
"ParentDapipId" : "133942",
"UpdateDate" : "",
"AdminEmail" : "",
"GeneralPhone" : ""
}
},

You can limit the output to just certain fields using the _source parameter as shown below.

curl -XGET --header 'Content-Type: application/json' http://parisx:9200/universities/_search/?pretty=true -d '{
"_source" : ["ParentName"],
"query" : { 
"bool": {
"must": [
{ "match" : { "ParentName": "Florida" }},
{ "match" : { "ParentName": "University" }}
]
}
}
}'

Results in:

{
"_index" : "universities",
"_type" : "universities",
"_id" : "455f7ec172260cae0afad78c6b532920d65876d3",
"_score" : 5.5798583,
"_source" : {
"ParentName" : "Florida State University"
}
}

Aggregation

Aggregation lets you group data to do counts.

In the program massAdd.py, change this index in this line of code:

url="http://parisx:9200/universities/universities/"

to a different index:

url="http://parisx:9200/students/students/"

And run that program to load data to the student index.

Then run this search:

curl -XGET --header 'Content-Type: application/json' http://parisx:9200/students/_search/?pretty=true -d '{
"size": 0,
"aggs": {
"group_by_school": {
"terms": {
"field": "school.keyword"
}    }
}
}'

Results in:

"aggregations" : {
"group_by_school" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Arizona State University",
"doc_count" : 9909
},
{
"key" : "University of South Carolina - Columbia",
"doc_count" : 9890
},
{
"key" : "University of North Carolina at Chapel Hill",
"doc_count" : 9791
}
]
}

In SQL, this is the same as:

select *, count(*) from schools group by schools

This shows you how many students are at each school.

Notice below that we have to add the word keyword to school.keyword. If we did not use the keyword in school.keyword it would produce this error:

Fielddata is disabled on text fields by default. Set fielddata=true on [school] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.

Basically ElasticSearch is saying that doing aggregation on the text fields would require calculating extra data and holding that in memory. That’s not needed for ordinary search queries.

Learn ML with our free downloadable guide

This e-book teaches machine learning in the simplest way possible. This book is for managers, programmers, directors – and anyone else who wants to learn machine learning. We start with very basic stats and algebra and build upon that.


These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing blogs@bmc.com.

Business, Faster than Humanly Possible

BMC empowers 86% of the Forbes Global 50 to accelerate business value faster than humanly possible. Our industry-leading portfolio unlocks human and machine potential to drive business growth, innovation, and sustainable success. BMC does this in a simple and optimized way by connecting people, systems, and data that power the world’s largest organizations so they can seize a competitive advantage.
Learn more about BMC ›

About the author

Walker Rowe

Walker Rowe is an American freelancer tech writer and programmer living in Cyprus. He writes tutorials on analytics and big data and specializes in documenting SDKs and APIs. He is the founder of the Hypatia Academy Cyprus, an online school to teach secondary school children programming. You can find Walker here and here.