Machine Learning & Big Data Blog

ElasticSearch Nested Queries: How to Search for Embedded Documents

Curl elasticsearch commands.
3 minute read
Walker Rowe

ElasticSearch is annoyingly complicated at times. You can run a search and it runs the wrong results and you are not made aware of that.

This can happen when, for example, you have a nested JSON document, i.e., one JSON document inside another. This is because Lucene (i.e., ElasticSearch) query has no understanding of object hierarchy in a JSON document.

(This article is part of our ElasticSearch Guide. Use the right-hand menu to navigate.)

The Problem with Searching for nested JSON objects

To illustrate the problem and the solution, download this program massAdd.py and change the URL to match your ElasticSearch environment. Then run it.

Then look at loaded data. You can see from the brackets that classes is a JSON array. But the index, as we will see, does not reflect that.

curl -XGET --header 'Content-Type: application/json' http://parisx:9200/universities/_search?pretty}
"_index" : "universities",
"_type" : "universities",
"_id" : "428b71b0b935eaf58bdf874c819f263a919da5f3",
"_score" : 0.0,
"_source" : {
"firstName" : "Stephen",
"classes" : [
{
"name" : "French",
"grades" : 2
},
{
"name" : "French",
"grades" : 3
},
{
"name" : "French",
"grades" : 2
},
{
"name" : "physics",
"grades" : 6
},
{
"name" : "physics",
"grades" : 5
}
],
"lastName" : "Shakespeare",
"school" : "Arizona State University"
}
}
]
}

Notice the index mapping does not show the JSON array. This is not because there are no brackets []. Rather the nested word is missing.

curl -XGET --header 'Content-Type: application/json' http://parisx:9200/universities?pretty
...
"classes" : {
"properties" : {
"grades" : {
"type" : "long"
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
...

So run a search using what you would think would the logical way to find students taking physics who got a grade of 1:

curl -XGET --header 'Content-Type: application/json' http://parisx:9200/universities/_search/?pretty=true -d '{
"query": {
"bool" : {
"must" : [
{ "match" : {"classes.name" : "physics"} },
{ "match" : {"classes.grades" :  1} }
]
}
}
}'

Produces this incorrect result. This is because Lucene flattens the whole document and finds both a classes.grades 1 and classes.name physics.

{
"_index" : "universities",
"_type" : "universities",
"_id" : "5846c01f90d80448ab087bf4f476cd3f8ab6f683",
"_score" : 1.621972,
"_source" : {
"classes" : [
{
"grades" : 2,
"name" : "physics"
},
{
"grades" : 2,
"name" : "French"
},
{
"grades" : 3,
"name" : "physics"
},
{
"grades" : 1,
"name" : "math"
}
],
"firstName" : "Stephen",
"lastName" : "Rowe",
"school" : "University of South Carolina - Columbia"
}
}

Making the Index Explicit

You can create an index in ElasticSearch just by loading data. That creates it on-the-fly. But you cannot control the way the index is created if you do that. So let’s make the index creation explicit and make clear that classes is a JSON array.

First delete the index, which will also delete all the data.

curl -XDELETE http://parisx:9200/universities

Then create a new index and make it explicit that classes is an object inside of the universities object by using “type” : “nested”. Note: that we are not filling in university data, just students. So ignore those fields. (We will use university data in another post and used it in the previous post).

curl -XPUT --header 'Content-Type: application/json' http://parisx:9200/universities -d '{
"mappings" : {
"universities" : {
"properties" : {
"Address" : { "type" : "text"},
"AdminEmail" : { "type" : "text"},
"AdminName" : { "type" : "text" },
"AdminPhone" : { "type" : "text" },
"DapipId" : { "type" : "text" },
"Fax" : { "type" : "text" },
"GeneralPhone" : { "type" : "text" },
"LocationName" : { "type" : "text" },
"LocationType" : { "type" : "text" },
"OpeId" : { "type" : "text" },
"ParentDapipId" : { "type" : "text" },
"ParentName" :  { "type" : "text" },
"UpdateDate" : { "type" : "text" },
"classes" : {
"type" : "nested",
"properties" : {
"name" : { "type" : "text"},
"grades" : { "type" : "integer" }
}
},
"firstName" : { "type" : "text" },
"lastName" : { "type" : "text" },
"school" : { "type" : "text" }
}
}
}
}'

Then load the data again by running massAdd.py again:

Now add the terms nested and path to the query to find students who got a grade of 1 in physics. The path is the parent object in the JSON, which in this case is classes. And use classes.name to search by class name:

curl -XGET --header 'Content-Type: application/json' http://parisx:9200/universities/_search/?pretty=true -d '{
"query": {
"nested" : {
"path" : "classes",
"score_mode" : "avg",
"query" : {
"bool" : {
"must" : [
{ "match" : {"classes.name" : "physics"} },
{ "match" : {"classes.grades" : 1} }
]
}
}
}
}
}'

results in the correct result.

{
"_index" : "universities",
"_type" : "universities",
"_id" : "20ce682e16d0b88487addc9ceba2a695d57692aa",
"_score" : 2.188588,
"_source" : {
"lastName" : "Shakespeare",
"firstName" : "Julie",
"classes" : [
{
"grades" : 1,
"name" : "physics"
},
{
"grades" : 6,
"name" : "math"
}
],
"school" : "Arizona State University"
}
}
]

Learn ML with our free downloadable guide

This e-book teaches machine learning in the simplest way possible. This book is for managers, programmers, directors – and anyone else who wants to learn machine learning. We start with very basic stats and algebra and build upon that.


These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing blogs@bmc.com.

Business, Faster than Humanly Possible

BMC empowers 86% of the Forbes Global 50 to accelerate business value faster than humanly possible. Our industry-leading portfolio unlocks human and machine potential to drive business growth, innovation, and sustainable success. BMC does this in a simple and optimized way by connecting people, systems, and data that power the world’s largest organizations so they can seize a competitive advantage.
Learn more about BMC ›

About the author

Walker Rowe

Walker Rowe is an American freelancer tech writer and programmer living in Cyprus. He writes tutorials on analytics and big data and specializes in documenting SDKs and APIs. He is the founder of the Hypatia Academy Cyprus, an online school to teach secondary school children programming. You can find Walker here and here.